- Introduction
- Scenario
- Original Datasets and Mashup
- Data Analysis
- Website and Data Visualization
- Conclusion
The goal of this website is to explain, in a comprehensible way, the evolution of the disease COVID-19 in Italy in the first wave of the pandemic (february to may 2020), focusing our attention on the causes that made our country one of the most affected by the pandemic, and how these causes influenced each other in the process, all of this by looking at 2019 data. We took into consideration many aspects of the virus that have probably been underestimated or ignored at first, in order to provide people a clear idea of what COVID-19 is and which countermeasures could be adopted to deal with it. Virulence (that's the name of the project) aims to tell this story through a proper use of legal and ethical analysis, metadata exploitation and datasets cleaning, in order to provide a neat mashup of all the information sources gathered during the research steps. For convenience, the work has been split into 4 parts:
- Metadata, featuring the metadata of Virulence dataset and the datasets from the primary sources;
- Visualization, where data results are represented through comprehensible maps and graphs;
- Documentation, where every step of the research is explained and where we analyze the collected data;
- License, the license of our project.
Why did COVID-19 spread so rapidly in Italy, making us reach one of the highest death rates in the world? There was the need to look at the conditions that made it possible, gathering information through data and articles from multiple sources. It is often demonstrated that the most relevant events may originate from the most unexpected causes. We will indeed have the chance to look at them carefully.
Now we are gonna discuss the many aspects taken into consideration during the research. The following list has been considered for all the 20 italian regions:
- Air pollution due to PM10;
- Temperatures;
- Density of population;
- Age of population;
- Number of hospitals.
All these factors have been considered only within 2019, that is shortly before the official appearance of COVID-19. We did so in order to have an idea of how much Italy was actually predisposed for an extensive spread of the virus.
The datasets used for our project. In the next sections we are going to analyze them from various points of view.
| ID | FILE | DESCRIPTION | DATASET | CATALOGUE | URI | LICENSE | LAST UPDATE | DOWNLOADED |
|---|---|---|---|---|---|---|---|---|
| 1:COVID | dpc-covid-19-ita-regioni.csv | COVID-19 data for every italian region. We took the cases and deaths of the first pandemic wave (february to may 2020). | COVID-19 Monitoraggio situazione Italia (RNDT - Serie) - Versione 2.0 | RNDT - Repertorio Nazionale dei Dati Territoriali - Servizio di ricerca | https://github.com/pcm-dpc/COVID-19/blob/master/dati-regioni/dpc-covid19-ita-regioni.csv | https://creativecommons.org/licenses/by/4.0/ | December 10, 2021 | January 12, 2022 |
| 2:POP | DCIS_POPORESBIL1_12012022143315331.csv | Population of every italian region in 2019. We only took the ‘popolazione inizio periodo’ row from the database. | Popolazione residente - bilancio | I.Stat | http://dati.istat.it/viewhtml.aspx?il=blank&vh=0000&vf=0&vcq=1100&graph=0&view-metadata=1&lang=it&QueryId=18461&metadata=DCIS_POPORESBIL1 | https://creativecommons.org/licenses/by/3.0/it/ | February 8, 2022 | |
| 3:PM10 | DataExtract.csv | PM10 mean level of 2019 for every italian station that measure it. We calculated the mean of all the values of the stations in the same italian region and created a single value for every italian region. | Air quality annual statistics calculated by the EEA | Air Quality e-Reporting (AQ e-Reporting) | https://discomap.eea.europa.eu/App/AirQualityStatistics/index.html?Country=Italy&AirPollutant=PM10&DataAggregationProcess=Annual%20mean%20/%201%20calendar%20year&ReportingYear=2019 | https://creativecommons.org/licenses/by/4.0/ | February 18, 2022 | March 10, 2022 |
| 4:SUP | DCCV_CARGEOMOR_ST_COM_27032022165808849.csv | The total area of land for every italian region calculated at the beginning of 2020. We used it, together with the data about the population, to calculate the population density. | Superfici territoriali | I.Stat | http://dati.istat.it/Index.aspx?DataSetCode=DCCV_CARGEOMOR_ST_COM# | https://creativecommons.org/licenses/by/3.0/it/ | March 27, 2022 | |
| 5:AGE | DCIS_INDDEMOG1_28032022142732546.csv | The age mean of the population of every italian region in 2019. | Indicatori demografici | I.Stat | http://dati.istat.it/Index.aspx?DataSetCode=DCIS_INDDEMOG1# | https://creativecommons.org/licenses/by/3.0/it/ | March 28, 2022 | |
| 6:HOSP | C_17_dataset_68_0_upFileUTF8CODREG.csv | The list of hospital in the italian country in 2019. We grouped and count them by region. Information about Valle d'Aosta, Trentino Alto Adige, Molise and Abruzzo is missing from the original dataset. | Aziende Ospedaliere, Aziende Ospedaliere Universitarie e IRCCS pubblici (anche costituiti in fondazione) | Open Data Ministero della Salute | http://www.dati.salute.gov.it/dataset/aziende_ospedaliere_e_aziende_ospedaliere_universitarie.jsp | https://www.dati.gov.it/content/italian-open-data-license-v20 | 2019-12-31 | 2022-05-12 |
| 7:TEMP | Tavole_dati_meteo_2019_capoluoghi-provincia.xlsx | The tables show information about temperatures and rainfalls in 2019 for every capital of the provinces of italian regions. We only took the informations about the temperatures and we grouped and counted by region. | TEMPERATURA E PRECIPITAZIONE NELLE CITTÀ CAPOLUOGO DI PROVINCIA | Istat | https://www.istat.it/it/files//2020/12/Tavole_dati_meteo_2019_capoluoghi-provincia.xlsx | https://creativecommons.org/licenses/by/4.0/ | 2022-05-15 |
0:VIR:
- Formats: we decided to publish the data in CSV, GEOJSON and RDF format;
- Metadata: we paired the data with RDF metadata in DCAT-AP IT standard;
- Last update: 2022-06-02;
- Description: for each italian region the dataset contains: region name, region istat code, covid-19 cases at the beginning of the pandemic, covid-19 deaths at the beginning of the pandemic, covid-19 cases at the beginning of the pandemic for every 100.000 people, covid-19 deaths at the beginning of the pandemic for every 100.000 people, pm10 level mean of 2019, the population density in 2019, the average age of the population in 2019, the average temperature in 2019, the number of hospitals in 2019;
- Methodology: we manipulated and merged the data coming from the previously described datasets and followed the italian guidelines for the enhancement of public information assets, pairing our merged data with the appropriate metadata about both the original and mashup datasets.
The initial datasets have been subjected to a quality analysis, following the principles indicated in the "linee guida nazionali per la valorizzazione del patrimonio informativo pubblico":
| Completeness | Accuracy | Coherence | Currentness | |
| 1:COVID | Yes | Yes | Yes | Yes |
| 2:POP | Yes | Yes | Yes | Yes |
| 3:PM10 | No | Yes | Yes | Yes |
| 4:SUP | Yes | Yes | Yes | No |
| 5:AGE | Yes | Yes | Yes | No |
| 6:HOSP | No | Yes | Yes | No |
| 7:TEMP | No | Yes | Yes | Yes |
Below we justify our assessments for every dataset:
- 1:COVID - The dataset shows plenty of information divided by date, region code, latitude/longitude, and many other details regarding the patients' conditions, so it's way more than we actually needed to extrapolate. The whole dataset is updated to 2022, but we considered only the period between February and May 2020, because we wanted to focus on the first months of COVID-19's spread.
- 2:POP - Provided by ISTAT, this dataset, like the previous one, is constantly updated, and it doesn't show any issue related to the aspects considered in the table above.
- 3:PM10 - Some information like "city", "city code" and "city population" are missing in almost every line, but the "air quality station name" often compensates for this lack. It's updated to 2022.
- 4:SUP - Updated to the beginning of 2020, this dataset provides the total surface of every italian region, represented in hectares (ha) and square kilometers (kmq).
- 5:AGE - Updated to 2021, the dataset features information about life expectation in Italy, depending on the age group, and the percentage of the same groups in the peninsula.
- 6:HOSP - The dataset covers only a period between 2010 and 2017, besides it lacks data about some regions (the most absent ones result to be Molise, Basilicata, Valle d'Aosta and Trentino-Alto Adige). Anyway, this information doesn't show accuracy or coherence issues.
- 7:TEMP - Updated to 2019, this dataset deals only with provincial capital municipalities and regional capital municipalities (comuni capoluogo di provincia e comuni capoluogo di regione). However, it is exhaustive in every other aspect.
- 1:COVID - Provided by the RNDT (Repertorio Nazionali dei Dati Territoriali), this dataset is subjected to a "Creative Commons Attribution 4.0 International (CC BY 4.0)." license: those who find it can host, modify and share its information, even for commercial use.
- 2:POP - This dataset, provided by ISTAT, is subjected to a "Creative Commons – Attribution – version 3.0." license. Hence, data is available for reproduction, distribution and broadcasting, without needing permission to create hypertext links to this site itself. The only condition is the source's citation.
- 3:PM10 - This dataset is under an "EEA (European Environment Agency) standard re-use policy": this means that this content is freely available either for commercial and non-commercial use, as long as the source is acknowledged. This is possible thanks to "Discomap", a website that allows the re-use of map services created by developers and GIS (Geographic Information System) experts.
- 4:SUP - Same as the 2:POP dataset.
- 5:AGE - Same as the 2:POP dataset.
- 6:HOSP - Provided by the website of the Ministry of Health, this dataset is tutelated by an "Italian Open Data Licence 2.0": data under this policy can be freely downloaded, consulted and shared. Besides, users have the possibility to merge this data with further information, in order to obtain a mashup for a product or an application. The only requirement is to mention the source and to include, if possible, a link to the license.
- 7:TEMP - Same as the 2:POP dataset.
-
1:COVID:
-
Formats: CSV;
-
Provenance: the COVID-19 Github repository made available by the italian Protezione Civile;
-
Metadata: we first found the metadata available in the same repository (this XML file), which was not in the DCAT-AP standard but in the RNDT standard, which is a standard used for territorial data in Italy.
Then, in this md file of the description of the dataset we found this link to the metadata in the geodati.gov.it site. Ignoring the fact that it is not possible to access to the actual RDF file of this metadata, we realized that it was more updated than the github one. So the github XML file was old and not updated metadata.
Doing more research we found this manual, which explains that the territorial data which is also open can be “translated” to the DCAT-AP IT standard through the GeoDCAT-AP IT specific. And that the version of the metadata in the DCAT-AP IT standard could be found at dati.gov.it.
So we searched for the dataset in dati.gov.it catalogue and we found this, once again it was not possible to access the actual metadata file in RDF. Browsing the site more carefully we found this page and finally found the download URL of the RDF metadata of all the datasets in the geodati.gov.it catalogue, including the COVID-19 one.
Concluding, finding the right metadata for a government dataset should not be this complicated and long search and it should have been made available in the Github repository;
-
-
2:POP:
-
Formats: Excel, CSV, PC-Axis, SDMX. There is no real download URI for any of the formats;
-
Provenance: I.Stat, the ISTAT database;
-
Metadata: there is some metadata next the data presented in the tool, but it follows no official standard. So we made it ourselves, gathering information from the ISTAT site and following the DCAT-AP IT standard;
-
-
3:PM10:
- Formats: CSV, TSV, JSON. There is no real download URI for any format;
- Provenance: Air Quality e-Reporting by the European Environment Agency;
- Metadata: there is some metadata but it doesn’t follow the DCAT-AP standard and there is no download URL. So we decided to translate the information in the DCAT-AP IT standard;
-
4:SUP:
-
Formats: Excel, CSV, PC-Axis, SDMX. There is no real download URI for any format;
-
Provenance: I.Stat, the ISTAT database;
-
Metadata: there is some metadata next the data presented in the tool, but it follows no official standard. So we made it ourselves, gathering information from the ISTAT site and following the DCAT-AP IT standard;
-
-
5:AGE:
-
Formats: Excel, CSV, PC-Axis, SDMX. There is no real download URI for any format;
-
Provenance: I.Stat, the ISTAT database;
-
Metadata: there is some metadata next the data presented in the tool, but it follows no official standard. So we made it ourselves, gathering information from the ISTAT site and following the DCAT-AP IT standard;
-
-
6:HOSP:
-
Formats: CSV;
-
Provenance: the Open Data Ministero della Salute catalogue site;
-
Metadata: we found the metadata through the same way that we used for the 1:COVID dataset: in this dati.gov.it page, where we found this link that contained the RDF metadata in the DCAT-AP IT standard for the catalogue and the dataset that we used;
-
-
7:TEMP:
-
Formats: XLSX;
-
Provenance: in this Istat archive page;
-
Metadata: there is no RDF metadata, just this pdfs file: tables index, methodology note, glossary. So we made it ourselves, gathering information from the ISTAT site and following the DCAT-AP IT standard.
-
The Virulence datasets contains informations concerning the factors that could have impacted the first wave of the COVID-19 pandemic in the italian regions. The catalog was created for the Open Access and Digital Ethics course at the University of Bologna and will not be actively mantained in the future. But all of our datasets and scripts are openly available with the CC-BY-4.0 license here on Github to be used, reproduced and updated.
In this section we present how we decided to visualize the data and metadata that compose our project, and the external resources used to do so.
Some libraries that we used throughout the whole developing of the website process:
- Bootstrap 5, the world’s most popular front-end open source toolkit;
- jQuery, a JavaScript library designed to simplify HTML DOM tree traversal and manipulation, event handling, and more;
- FontAwesome, Internet's icon library and toolkit;
- pandas, a fast, powerful, flexible and easy to use open source data analysis and manipulation tool for Python;
- rdflib, a pure Python package for working with RDF;
- urllib, a package that collects several modules for working with URLs.
We decided to visually represent the data we gathered in a Visualization page. There, the user can explore the datasets with an interactive map and graph.
For the Map section we used the following libraries and data:
- Leaflet, an open-source JavaScript library for mobile-friendly interactive maps, for the map interactions;
- OpenStreetMap, for the map data;
- Mapbox, for the map style;
- Openpolis geojson-italy repository, in particular the geojson file of the italian regions, based on ISTAT data (both under CC BY 4.0 license);
- we loosely followed this Leaflet tutorial.
For the Graph section we used Chart.js, a library for simple, clean and engaging HTML5 based JavaScript charts.
It's been difficult to try our hand at this field, since many different factors needed to be considered in order to have even a vague idea of the work done by the italian public authorities in the last 2 years. On the other side, having so much material available made the research easier and challenging. It wasn't always possible to obtain exhaustive details about metadata, while datasets needed a bit of maintenance before their actual use. Nonetheless, the website is able to provide a broad vision of the Italian situation between 2019 and 2020. Although certainly perfectible, the project has been completed in full compliance with web standards and the intellectual property of data, and we are quite happy and proud of the result.