Created on 26 Mar 2020 ; Modified on 23 Aug 2020 ; Translation: italian
I wrote a couple of articles about the progress of Coronavirus COVID-19 outbreak. After that, I had the problem of updating the data daily. Doing it manually was a significant waste of time. So, why do not make computers do their job?
In this period, here in Italy, we are hypnotized by the data relating to the epidemic by Coronavirus Covid-19. Data that change day after day.
For those who, like myself, have written some articles regarding this epidemic, arises the problem of updating data referred to in the articles.
Doing it manually is not a viable way. It is necessary at least half an hour, if not more, to performed every day for months. Boredom!
So, let'go to take the monitor and keyboard and put to work our beloved/hated life partner: the computer.
The idea is to have an article made up of text that doesn't change over time, interspersed with some areas, which will house graphs and data tables whose content is automatically updated as often as we wish.
In the following, we call the aforementioned artifact template.
To achieve this we use:
Sure, this is not a small thing: we have a lot of stuff to cook on the fire. But, I assure you, using another language, for example C, faster in execution but also much more demanding to manage during the project phase, it would have been much worse. As we will see, with Python the thing is less challenging than it seems at first glance.
First think first: the template.
As said, it is a matter of having a fixed structure within which to drop the data that can vary. And these data can be of any type: text, images, tables.
Do we remember that usually we use the reStructuredText to write our articles? Well, let's keep it up. Simply, where we want write something that changes over time, let's put a word that begins with the $.
Why the $? This is the sign that indicates the Python library which manages the templates, where to make a replacement and with what. For example, let's take a piece of text from the template for the trend of infection in the world [1]:
... <cut> ... Below is a table summary of the situation for the twenty most affected countries as of the update date of this document. Total cases are reported, so deaths, and the relationship between deaths and total cases .. csv-table:: situazione dei venti paesi piĆ¹ colpiti al $UPDATED $DATA_TABLE ... <cut> ...
In this example we have the placeholders $UPDATED and $DATA_TABLE. How we sense, we want to replace the update date in $UPDATED, and a table with updated data in $DATA_TABLE, using the csv format.
How do we do to merge templates and data? Suppose of having:
we can proceed as follows:
from string import Template # ...<cut>... code to istantiate template_text, adate, data_table d = {'UPDATED': adate, 'DATA_TABLE': data_table, } tmpl = Template(template_text) article = tmpl.safe_substitute(d) # ...<cut>... code to write article to file
Pratically:
We did it using six lines of code, except the accessories to prepare the data in the previous list and to save the work done on file. Not bad.
Now let's consider why we are doing all this. The chart that we want to insert instead of $DATA_TABLE, it will be recalculated every day, as the epidemic data changes.
Let's say we want to insert such a table:
date | cases | country |
2020-03-26 | 81968 | China |
2020-03-26 | 74386 | Italy |
we understand that the data in the table on March 27 will be different from those reported here, calculated with the data on March 26th.
We need:
The source from which to download the updated data will be identified based on what interests us. In this case, we refer to this URL of the agency European Center for Disease Prevention and Control.
Suppose we have the aforementioned URL registered in the url variable and we want to save the data in a name file stored in the variable fname. Using the standard library urllib, to download and save the file we can proceed like this [2]:
import urllib.request # ...<cut>... code to istantiate variables url and fname with urllib.request.urlopen(url) as response: from_web = response.read() with open(fname, 'wb') as f: f.write(from_web)
Saved the data file from the Internet on our computer, we now use Pandas to import the data in tabular form and start processing them.
Importing data is not complex. If the name of the file containing the data is in fname:
import pandas as pd # ...<cut>... code to istantiate variable fname df = pd.read_csv(fname)
imports data in variable df.
df is acronym of DataFrame: the tabular data structure on which typically you work using Pandas.
In this article we are not going to explore the possibilities of Pandas. We don't know him so well, and they are really many. So we are happy to understand quickly how to pass from a DataFrame coming from csv file, structured like this [3]:
date ... cases ... country id 1060 2020-03-18 ... 33 ... China CN 1061 2020-03-17 ... 110 ... China CN 1062 2020-03-16 ... 25 ... China CN ... ... ... ... ... ... .. 5441 2020-01-01 ... 0 ... United_States_of_America US 5442 2019-12-31 ... 0 ... United_States_of_America US
where every single row reports, among others, the new positive cases observed during the day for the aforementioned country, to a DataFrame structured as follows:
date cases country 2020-03-25 81847 China 2020-03-25 69176 Italy ...
where each row reports the total positive cases for the country and the date of the last observation received.
In practice we can run this code:
import pandas as pd # ...<cut>... code to istantiate variable df l = [] # will be: [[date, sum, country], [date, sum, country], ...] countries = df['country'].drop_duplicates() # get a copy of all countries for country in countries: country_df = df.loc[df['country']==country].copy() # get a copy of all records of a country the_cases = country_df['cases'].sum() # total of cases the_date = country_df['date'].max() # the greatest date for this country l.append((the_date, the_cases, country[:])) df2 = pd.DataFrame(l, columns=['date', 'cases','country']) # create a new dataframe with these values
With eight lines of code we get the dataframe that interests us. And with the following we assign it to the variable data_table to bring it into the template:
import pandas as pd # ...<cut>... code to istantiate variable df # a single big string with every line starting from 1st column on left side and ending with \n data_table = df2.to_csv(index=False) # we need a couple of spaces from 1st column on left side for every line; so: lines = data_table.split('\n') # splitting in lines lines = [' '+line for line in lines] # two spaces on left side data_table = '\n'.join(lines) # rejoining lines
Here we had to exagerate: plus four lines to add those blessed spaces white on the left, which we need to indent the contents of the table.
At this point we have all the bricks we need to build our automatic update system of data entered in the article.
If you are interested in reading all the project code, you can read it from this repository on GitHub.
Enjoy. ldfa
As mentioned, this project was born to update daily four articles that graph the temporal trends of the covid-19 in Italy and in the world.
Up to now I was running these two commands:
python world.py python italy.py
then I updated the site by loading manually the four generated articles.
After several months of manually uploading these items every day, I become tired to do this ripetivie action. So I decided to use the requests library to make the necessary posts to upload the articles using a program.
The program is fu.py: acronym for file upload.
I talk about it in these lines, because I got some problems to develop it. Until I put a breakpoint on the development server, I couldn't figure out why I wasn't uploading the files. If you are interested, here is a summary of how to do it. As usual: you can get the full details by reading the entire program in GitHub.
We will have a configuration file in which to put sensitive data:
[fu] ARTICLES = 001_article1.rst,002article1.it.rst URLL = https://website/login/ URL = https://website/blog/load-article AUTH = user,password
While most of the program is as below:
1 #... 2 import configparser 3 #... 4 import requests 5 6 # ... loading article list, user, password, URL to load article, 7 # URLL to login user, from configuration file (see https://github.com/l-dfa/COVID-19-italia) 8 9 def main(): 10 with requests.Session() as s: # using session 'cause we need cookies to handle csrf token and user login 11 lg = s.get(URL) # get article AND ... 12 csrf_token = lg.cookies['csrftoken'] # ... csrf 13 login_data = {'csrfmiddlewaretoken': csrf_token, # program MUST login user to upload files 14 'username': AUTH[0], 15 'password': AUTH[1], 16 'next': '/blog/load-article'} 17 rl = s.post(URLL, data=login_data) # posting to login (NOT article: weird, isn't it?). it switches to load-article AND ... 18 csrf_token = rl.cookies['csrftoken'] # ... it CHANGES csrf 19 # ... 20 21 load_article_data = {'csrfmiddlewaretoken': csrf_token, } 22 for article in ARTICLES: 23 files = {'article': open(ARTICLE_PATH + '/' + article, 'rb')} # the field name is article, not file 24 ru = s.post(URL, files=files, data=load_article_data) 25 # ... 26 27 # ...
main function supposes to have the global variables:
Therefore:
[1] | For those who don't know reStructuredText:
Attention to the fact that the content of the table must be:
Instead the indicator ... <cut> ... in the example remarks the fact that we omitted part of the template. |
[2] | I admit it, when I read something on the Internet from a program, I am amazed by the ease with which you can do this when using Python. |
[3] | This is a time series: each nation has a group of dates, and for each date the number of observations during the day is reported. |