Created on 26 Mar 2020 ;    Modified on 27 Mar 2020 ;    Translationitalian

How I update articles about Coronavirus

I wrote a couple of articles about the progress of Coronavirus COVID-19 outbreak. After that, I had the problem of updating the data daily. Doing it manually was a significant waste of time. So, why do not make computers do their job?

Introduction

In this period, here in Italy, we are hypnotized by the data relating to the epidemic by Coronavirus Covid-19. Data that change day after day.

For those who, like myself, have written some articles regarding this epidemic, arises the problem of updating data referred to in the articles.

Doing it manually is not a viable way. It is necessary at least half an hour, if not more, to performed every day for months. Boredom!

So, let'go to take the monitor and keyboard and put to work our beloved/hated life partner: the computer.

General organization

The idea is to have an article made up of text that doesn't change over time, interspersed with some areas, which will house graphs and data tables whose content is automatically updated as often as we wish.

In the following, we call the aforementioned artifact template.

To achieve this we use:

  • the language reStructuredText to write text of the template;
  • language Python, to create the general algorithm to follow, to download the updated data from the Internet and to merge the data in the template;
  • the Python library Pandas, to import the updated data in a tabular standard form, to process them as needed, and to form the tables to be displayed.

Sure, this is not a small thing: we have a lot of stuff to cook on the fire. But, I assure you, using another language, for example C, faster in execution but also much more demanding to manage during the project phase, it would have been much worse. As we will see, with Python the thing is less challenging than it seems at first glance.

The template

First think first: the template.

As said, it is a matter of having a fixed structure within which to drop the data that can vary. And these data can be of any type: text, images, tables.

Do we remember that usually we use the reStructuredText to write our articles? Well, let's keep it up. Simply, where we want write something that changes over time, let's put a word that begins with the $.

Why the $? This is the sign that indicates the Python library which manages the templates, where to make a replacement and with what. For example, let's take a piece of text from the template for the trend of infection in the world [1]:

... <cut> ...
Below is a table summary of the situation for the twenty most affected countries
as of the update date of this document. Total cases are reported,
so deaths, and the relationship between deaths and total cases

.. csv-table:: situazione dei venti paesi piĆ¹ colpiti al $UPDATED

$DATA_TABLE

... <cut> ...

In this example we have the placeholders $UPDATED and $DATA_TABLE. How we sense, we want to replace the update date in $UPDATED, and a table with updated data in $DATA_TABLE, using the csv format.

How do we do to merge templates and data? Suppose of having:

  • in template_text the text of the previous example;
  • in adate the update date, as a string;
  • in data_table the csv lines wich we want ti insert in place of $DATA_TABLE;

we can proceed as follows:

from string   import Template

# ...<cut>... code to istantiate template_text, adate, data_table
d = {'UPDATED': adate,
    'DATA_TABLE': data_table,
}
tmpl = Template(template_text)
article = tmpl.safe_substitute(d)
# ...<cut>... code to write article to file

Pratically:

  • we insert our data in the dictionary d;
  • we create the object tmpl of type Template using template_text;
  • we ask the object tmpl to merge the data present in d.

We did it using six lines of code, except the accessories to prepare the data in the previous list and to save the work done on file. Not bad.

Data table in csv format

Now let's consider why we are doing all this. The chart that we want to insert instead of $DATA_TABLE, it will be recalculated every day, as the epidemic data changes.

Let's say we want to insert such a table:

situation of the two most affected countries until 2020-03-26
date cases country
2020-03-26 81968 China
2020-03-26 74386 Italy

we understand that the data in the table on March 27 will be different from those reported here, calculated with the data on March 26th.

We need:

  • a source from which to download updated data, and
  • a tool that allows us to easily upload, process and represent the data.

The source from which to download the updated data will be identified based on what interests us. In this case, we refer to this URL of the agency European Center for Disease Prevention and Control.

Suppose we have the aforementioned URL registered in the url variable and we want to save the data in a name file stored in the variable fname. Using the standard library urllib, to download and save the file we can proceed like this [2]:

import urllib.request

# ...<cut>... code to istantiate variables url and fname
with urllib.request.urlopen(url) as response:
   from_web = response.read()
with open(fname, 'wb') as f:
    f.write(from_web)

Saved the data file from the Internet on our computer, we now use Pandas to import the data in tabular form and start processing them.

Importing data is not complex. If the name of the file containing the data is in fname:

import pandas as pd

# ...<cut>... code to istantiate variable fname
df = pd.read_csv(fname)

imports data in variable df.

df is acronym of DataFrame: the tabular data structure on which typically you work using Pandas.

In this article we are not going to explore the possibilities of Pandas. We don't know him so well, and they are really many. So we are happy to understand quickly how to pass from a DataFrame coming from csv file, structured like this [3]:

date             ...  cases  ...                   country  id
1060 2020-03-18  ...     33  ...                     China  CN
1061 2020-03-17  ...    110  ...                     China  CN
1062 2020-03-16  ...     25  ...                     China  CN
...         ...  ...    ...  ...                       ...  ..
5441 2020-01-01  ...      0  ...  United_States_of_America  US
5442 2019-12-31  ...      0  ...  United_States_of_America  US

where every single row reports, among others, the new positive cases observed during the day for the aforementioned country, to a DataFrame structured as follows:

date         cases   country
2020-03-25   81847     China
2020-03-25   69176     Italy
...

where each row reports the total positive cases for the country and the date of the last observation received.

In practice we can run this code:

import pandas as pd

# ...<cut>... code to istantiate variable df
l = []           # will be: [[date, sum, country], [date, sum, country], ...]
countries = df['country'].drop_duplicates()                 # get a copy of all countries
for country in countries:
    country_df = df.loc[df['country']==country].copy()      # get a copy of all records of a country
    the_cases = country_df['cases'].sum()                   # total of cases
    the_date = country_df['date'].max()                     # the greatest date for this country
    l.append((the_date, the_cases, country[:]))
df2 = pd.DataFrame(l, columns=['date', 'cases','country'])  # create a new dataframe with these values

With eight lines of code we get the dataframe that interests us. And with the following we assign it to the variable data_table to bring it into the template:

import pandas as pd

# ...<cut>... code to istantiate variable df
# a single big string with every line starting from 1st column on left side and ending with \n
data_table = df2.to_csv(index=False)
# we need a couple of spaces from 1st column on left side for every line; so:
lines = data_table.split('\n')              # splitting in lines
lines = ['  '+line for line in lines]       # two spaces on left side
data_table = '\n'.join(lines)               # rejoining lines

Here we had to exagerate: plus four lines to add those blessed spaces white on the left, which we need to indent the contents of the table.

Conclusion

At this point we have all the bricks we need to build our automatic update system of data entered in the article.

If you are interested in reading all the project code, stay tuned: I have to add some comments and then I will publish it as open source on Github. When I'll do, I will update this article with the related link.

Enjoy. ldfa


[1]

For those who don't know reStructuredText:

  • .. csv-table:: title is the directive that tells the reStructuredText library to interpret what follows as a table in csv format.

Attention to the fact that the content of the table must be:

  • spaced from the directive with an empty line;
  • indented with respect to the directive, for example with a couple of whitespaces before each line that forms the content.

Instead the indicator ... <cut> ... in the example remarks the fact that we omitted part of the template.

[2]I admit it, when I read something on the Internet from a program, I am amazed by the ease with which you can do this when using Python.
[3]This is a time series: each nation has a group of dates, and for each date the number of observations during the day is reported.