How I update articles about Coronavirus

I wrote a couple of articles about the progress of Coronavirus COVID-19 outbreak. After that, I had the problem of updating the data daily. Doing it manually was a significant waste of time. So, why do not make computers do their job?

Introduction

In this period, here in Italy, we are hypnotized by the data relating to the epidemic by Coronavirus Covid-19. Data that change day after day.

For those who, like myself, have written some articles regarding this epidemic, arises the problem of updating data referred to in the articles.

Doing it manually is not a viable way. It is necessary at least half an hour, if not more, to performed every day for months. Boredom!

So, let'go to take the monitor and keyboard and put to work our beloved/hated life partner: the computer.

General organization

The idea is to have an article made up of text that doesn't change over time, interspersed with some areas, which will house graphs and data tables whose content is automatically updated as often as we wish.

In the following, we call the aforementioned artifact template.

To achieve this we use:

the language reStructuredText to write text of the template;
language Python, to create the general algorithm to follow, to download the updated data from the Internet and to merge the data in the template;
the Python library Pandas, to import the updated data in a tabular standard form, to process them as needed, and to form the tables to be displayed.

Sure, this is not a small thing: we have a lot of stuff to cook on the fire. But, I assure you, using another language, for example C, faster in execution but also much more demanding to manage during the project phase, it would have been much worse. As we will see, with Python the thing is less challenging than it seems at first glance.

The template

First think first: the template.

As said, it is a matter of having a fixed structure within which to drop the data that can vary. And these data can be of any type: text, images, tables.

Do we remember that usually we use the reStructuredText to write our articles? Well, let's keep it up. Simply, where we want write something that changes over time, let's put a word that begins with the $.

Why the $? This is the sign that indicates the Python library which manages the templates, where to make a replacement and with what. For example, let's take a piece of text from the template for the trend of infection in the world [1]:

... <cut> ...
Below is a table summary of the situation for the twenty most affected countries
as of the update date of this document. Total cases are reported,
so deaths, and the relationship between deaths and total cases

.. csv-table:: situazione dei venti paesi più colpiti al $UPDATED

$DATA_TABLE

... <cut> ...

In this example we have the placeholders $UPDATED and $DATA_TABLE. How we sense, we want to replace the update date in $UPDATED, and a table with updated data in $DATA_TABLE, using the csv format.

How do we do to merge templates and data? Suppose of having:

in template_text the text of the previous example;
in adate the update date, as a string;
in data_table the csv lines wich we want ti insert in place of $DATA_TABLE;

we can proceed as follows:

from string   import Template

# ...<cut>... code to istantiate template_text, adate, data_table
d = {'UPDATED': adate,
    'DATA_TABLE': data_table,
}
tmpl = Template(template_text)
article = tmpl.safe_substitute(d)
# ...<cut>... code to write article to file

Pratically:

we insert our data in the dictionary d;
we create the object tmpl of type Template using template_text;
we ask the object tmpl to merge the data present in d.

We did it using six lines of code, except the accessories to prepare the data in the previous list and to save the work done on file. Not bad.

Data table in csv format

Now let's consider why we are doing all this. The chart that we want to insert instead of $DATA_TABLE, it will be recalculated every day, as the epidemic data changes.

Let's say we want to insert such a table:

situation of the two most affected countries until 2020-03-26
date	cases	country
2020-03-26	81968	China
2020-03-26	74386	Italy

we understand that the data in the table on March 27 will be different from those reported here, calculated with the data on March 26th.

We need:

a source from which to download updated data, and
a tool that allows us to easily upload, process and represent the data.

The source from which to download the updated data will be identified based on what interests us. In this case, we refer to this URL of the agency European Center for Disease Prevention and Control.

Suppose we have the aforementioned URL registered in the url variable and we want to save the data in a name file stored in the variable fname. Using the standard library urllib, to download and save the file we can proceed like this [2]:

import urllib.request

# ...<cut>... code to istantiate variables url and fname
with urllib.request.urlopen(url) as response:
   from_web = response.read()
with open(fname, 'wb') as f:
    f.write(from_web)

Saved the data file from the Internet on our computer, we now use Pandas to import the data in tabular form and start processing them.

Importing data is not complex. If the name of the file containing the data is in fname:

import pandas as pd

# ...<cut>... code to istantiate variable fname
df = pd.read_csv(fname)

imports data in variable df.

df is acronym of DataFrame: the tabular data structure on which typically you work using Pandas.

In this article we are not going to explore the possibilities of Pandas. We don't know him so well, and they are really many. So we are happy to understand quickly how to pass from a DataFrame coming from csv file, structured like this [3]:

date             ...  cases  ...                   country  id
1060 2020-03-18  ...     33  ...                     China  CN
1061 2020-03-17  ...    110  ...                     China  CN
1062 2020-03-16  ...     25  ...                     China  CN
...         ...  ...    ...  ...                       ...  ..
5441 2020-01-01  ...      0  ...  United_States_of_America  US
5442 2019-12-31  ...      0  ...  United_States_of_America  US

where every single row reports, among others, the new positive cases observed during the day for the aforementioned country, to a DataFrame structured as follows:

date         cases   country
2020-03-25   81847     China
2020-03-25   69176     Italy
...

where each row reports the total positive cases for the country and the date of the last observation received.

In practice we can run this code:

import pandas as pd

# ...<cut>... code to istantiate variable df
l = []           # will be: [[date, sum, country], [date, sum, country], ...]
countries = df['country'].drop_duplicates()                 # get a copy of all countries
for country in countries:
    country_df = df.loc[df['country']==country].copy()      # get a copy of all records of a country
    the_cases = country_df['cases'].sum()                   # total of cases
    the_date = country_df['date'].max()                     # the greatest date for this country
    l.append((the_date, the_cases, country[:]))
df2 = pd.DataFrame(l, columns=['date', 'cases','country'])  # create a new dataframe with these values

With eight lines of code we get the dataframe that interests us. And with the following we assign it to the variable data_table to bring it into the template:

import pandas as pd

# ...<cut>... code to istantiate variable df
# a single big string with every line starting from 1st column on left side and ending with \n
data_table = df2.to_csv(index=False)
# we need a couple of spaces from 1st column on left side for every line; so:
lines = data_table.split('\n')              # splitting in lines
lines = ['  '+line for line in lines]       # two spaces on left side
data_table = '\n'.join(lines)               # rejoining lines

Here we had to exagerate: plus four lines to add those blessed spaces white on the left, which we need to indent the contents of the table.

Conclusion

At this point we have all the bricks we need to build our automatic update system of data entered in the article.

If you are interested in reading all the project code, you can read it from this repository on GitHub.

Enjoy. ldfa

Update on the 23rd of August 2020

As mentioned, this project was born to update daily four articles that graph the temporal trends of the covid-19 in Italy and in the world.

Up to now I was running these two commands:

python world.py
python italy.py

then I updated the site by loading manually the four generated articles.

After several months of manually uploading these items every day, I become tired to do this ripetivie action. So I decided to use the requests library to make the necessary posts to upload the articles using a program.

The program is fu.py: acronym for file upload.

I talk about it in these lines, because I got some problems to develop it. Until I put a breakpoint on the development server, I couldn't figure out why I wasn't uploading the files. If you are interested, here is a summary of how to do it. As usual: you can get the full details by reading the entire program in GitHub.

We will have a configuration file in which to put sensitive data:

[fu]

ARTICLES = 001_article1.rst,002article1.it.rst
URLL = https://website/login/
URL  = https://website/blog/load-article
AUTH = user,password

While most of the program is as below:

 1  #...
 2  import configparser
 3  #...
 4  import requests
 5 
 6  # ... loading article list, user, password, URL to load article,
 7  #     URLL to login user, from configuration file (see https://github.com/l-dfa/COVID-19-italia)
 8 
 9  def main():
10      with requests.Session() as s:          # using session 'cause we need cookies to handle csrf token and user login
11          lg = s.get(URL)                                      # get article AND ...
12          csrf_token = lg.cookies['csrftoken']                 # ... csrf
13          login_data = {'csrfmiddlewaretoken': csrf_token,     # program MUST login user to upload files
14                        'username': AUTH[0],
15                        'password': AUTH[1],
16                        'next': '/blog/load-article'}
17          rl = s.post(URLL, data=login_data)                   # posting to login (NOT article: weird, isn't it?). it switches to load-article AND ...
18          csrf_token = rl.cookies['csrftoken']                 # ... it CHANGES csrf
19          # ...
20 
21          load_article_data = {'csrfmiddlewaretoken': csrf_token, }
22          for article in ARTICLES:
23              files = {'article': open(ARTICLE_PATH + '/' + article, 'rb')}  # the field name is article, not file
24              ru = s.post(URL, files=files, data=load_article_data)
25          # ...
26 
27  # ...

main function supposes to have the global variables:

ARTICLES a list of articles to upload,
URLL the address of the login page.
URL the address of the upload page.
AUTH a tuple with username and password to use,
ARTICLE_PATH the directory to find articles.

Therefore:

on line 10 it instantiates a session that will keep track of the cookies; we need it because here we have the csrf token used by Django in the forms;
on lines 13-16 we set the data to login, since only a logged in user can upload files;
and on line 17 we send them to the login form; to find out this thing has consumed us quite a bit of time;
just as it took us a long time to understand that Django also changed the csrf token (line 18);
now we are ready: we can load the desired files with the loop at lines 22-24, remembering
- that in our form the name of the field is article, and not file, as usually indicated in various articles on the Internet,
- that the URL to send is the upload one and not the login one.

[1]

For those who don't know reStructuredText:

.. csv-table:: title is the directive that tells the reStructuredText library to interpret what follows as a table in csv format.

Attention to the fact that the content of the table must be:

spaced from the directive with an empty line;
indented with respect to the directive, for example with a couple of whitespaces before each line that forms the content.

Instead the indicator ... <cut> ... in the example remarks the fact that we omitted part of the template.

[2]	I admit it, when I read something on the Internet from a program, I am amazed by the ease with which you can do this when using Python.

[3]	This is a time series: each nation has a group of dates, and for each date the number of observations during the day is reported.