Skip to content

chiarasharp/virulence

Repository files navigation

Virulence

Table of Contents

  1. Introduction
  2. Scenario
  3. Original Datasets and Mashup
    1. Original Datasets
    2. Mashup Dataset
  4. Data Analysis
    1. Quality Analysis
    2. Legal Analysis
    3. Technical Analysis
    4. Sustainability
  5. Website and Data Visualization
  6. Conclusion

The goal of this website is to explain, in a comprehensible way, the evolution of the disease COVID-19 in Italy in the first wave of the pandemic (february to may 2020), focusing our attention on the causes that made our country one of the most affected by the pandemic, and how these causes influenced each other in the process, all of this by looking at 2019 data. We took into consideration many aspects of the virus that have probably been underestimated or ignored at first, in order to provide people a clear idea of what COVID-19 is and which countermeasures could be adopted to deal with it. Virulence (that's the name of the project) aims to tell this story through a proper use of legal and ethical analysis, metadata exploitation and datasets cleaning, in order to provide a neat mashup of all the information sources gathered during the research steps. For convenience, the work has been split into 4 parts:

  • Metadata, featuring the metadata of Virulence dataset and the datasets from the primary sources;
  • Visualization, where data results are represented through comprehensible maps and graphs;
  • Documentation, where every step of the research is explained and where we analyze the collected data;
  • License, the license of our project.

Why did COVID-19 spread so rapidly in Italy, making us reach one of the highest death rates in the world? There was the need to look at the conditions that made it possible, gathering information through data and articles from multiple sources. It is often demonstrated that the most relevant events may originate from the most unexpected causes. We will indeed have the chance to look at them carefully.

Now we are gonna discuss the many aspects taken into consideration during the research. The following list has been considered for all the 20 italian regions:

  • Air pollution due to PM10;
  • Temperatures;
  • Density of population;
  • Age of population;
  • Number of hospitals.

All these factors have been considered only within 2019, that is shortly before the official appearance of COVID-19. We did so in order to have an idea of ​​how much Italy was actually predisposed for an extensive spread of the virus.

The datasets used for our project. In the next sections we are going to analyze them from various points of view.

ID FILE DESCRIPTION DATASET CATALOGUE URI LICENSE LAST UPDATE DOWNLOADED
1:COVID dpc-covid-19-ita-regioni.csv COVID-19 data for every italian region. We took the cases and deaths of the first pandemic wave (february to may 2020). COVID-19 Monitoraggio situazione Italia (RNDT - Serie) - Versione 2.0 RNDT - Repertorio Nazionale dei Dati Territoriali - Servizio di ricerca https://github.com/pcm-dpc/COVID-19/blob/master/dati-regioni/dpc-covid19-ita-regioni.csv https://creativecommons.org/licenses/by/4.0/ December 10, 2021 January 12, 2022
2:POP DCIS_POPORESBIL1_12012022143315331.csv Population of every italian region in 2019. We only took the ‘popolazione inizio periodo’ row from the database. Popolazione residente - bilancio I.Stat http://dati.istat.it/viewhtml.aspx?il=blank&vh=0000&vf=0&vcq=1100&graph=0&view-metadata=1&lang=it&QueryId=18461&metadata=DCIS_POPORESBIL1 https://creativecommons.org/licenses/by/3.0/it/ February 8, 2022
3:PM10 DataExtract.csv PM10 mean level of 2019 for every italian station that measure it. We calculated the mean of all the values of the stations in the same italian region and created a single value for every italian region. Air quality annual statistics calculated by the EEA Air Quality e-Reporting (AQ e-Reporting) https://discomap.eea.europa.eu/App/AirQualityStatistics/index.html?Country=Italy&AirPollutant=PM10&DataAggregationProcess=Annual%20mean%20/%201%20calendar%20year&ReportingYear=2019 https://creativecommons.org/licenses/by/4.0/ February 18, 2022 March 10, 2022
4:SUP DCCV_CARGEOMOR_ST_COM_27032022165808849.csv The total area of land for every italian region calculated at the beginning of 2020. We used it, together with the data about the population, to calculate the population density. Superfici territoriali I.Stat http://dati.istat.it/Index.aspx?DataSetCode=DCCV_CARGEOMOR_ST_COM# https://creativecommons.org/licenses/by/3.0/it/ March 27, 2022
5:AGE DCIS_INDDEMOG1_28032022142732546.csv The age mean of the population of every italian region in 2019. Indicatori demografici I.Stat http://dati.istat.it/Index.aspx?DataSetCode=DCIS_INDDEMOG1# https://creativecommons.org/licenses/by/3.0/it/ March 28, 2022
6:HOSP C_17_dataset_68_0_upFileUTF8CODREG.csv The list of hospital in the italian country in 2019. We grouped and count them by region. Information about Valle d'Aosta, Trentino Alto Adige, Molise and Abruzzo is missing from the original dataset. Aziende Ospedaliere, Aziende Ospedaliere Universitarie e IRCCS pubblici (anche costituiti in fondazione) Open Data Ministero della Salute http://www.dati.salute.gov.it/dataset/aziende_ospedaliere_e_aziende_ospedaliere_universitarie.jsp https://www.dati.gov.it/content/italian-open-data-license-v20 2019-12-31 2022-05-12
7:TEMP Tavole_dati_meteo_2019_capoluoghi-provincia.xlsx The tables show information about temperatures and rainfalls in 2019 for every capital of the provinces of italian regions. We only took the informations about the temperatures and we grouped and counted by region. TEMPERATURA E PRECIPITAZIONE NELLE CITTÀ CAPOLUOGO DI PROVINCIA Istat https://www.istat.it/it/files//2020/12/Tavole_dati_meteo_2019_capoluoghi-provincia.xlsx https://creativecommons.org/licenses/by/4.0/ 2022-05-15

0:VIR:

  • Formats: we decided to publish the data in CSV, GEOJSON and RDF format;
  • Metadata: we paired the data with RDF metadata in DCAT-AP IT standard;
  • Last update: 2022-06-02;
  • Description: for each italian region the dataset contains: region name, region istat code, covid-19 cases at the beginning of the pandemic, covid-19 deaths at the beginning of the pandemic, covid-19 cases at the beginning of the pandemic for every 100.000 people, covid-19 deaths at the beginning of the pandemic for every 100.000 people, pm10 level mean of 2019, the population density in 2019, the average age of the population in 2019, the average temperature in 2019, the number of hospitals in 2019;
  • Methodology: we manipulated and merged the data coming from the previously described datasets and followed the italian guidelines for the enhancement of public information assets, pairing our merged data with the appropriate metadata about both the original and mashup datasets.

The initial datasets have been subjected to a quality analysis, following the principles indicated in the "linee guida nazionali per la valorizzazione del patrimonio informativo pubblico":

CompletenessAccuracyCoherenceCurrentness
1:COVIDYesYesYesYes
2:POPYesYesYesYes
3:PM10NoYesYesYes
4:SUPYesYesYesNo
5:AGEYesYesYesNo
6:HOSPNoYesYesNo
7:TEMPNoYesYesYes

Below we justify our assessments for every dataset:

  • 1:COVID - The dataset shows plenty of information divided by date, region code, latitude/longitude, and many other details regarding the patients' conditions, so it's way more than we actually needed to extrapolate. The whole dataset is updated to 2022, but we considered only the period between February and May 2020, because we wanted to focus on the first months of COVID-19's spread.
  • 2:POP - Provided by ISTAT, this dataset, like the previous one, is constantly updated, and it doesn't show any issue related to the aspects considered in the table above.
  • 3:PM10 - Some information like "city", "city code" and "city population" are missing in almost every line, but the "air quality station name" often compensates for this lack. It's updated to 2022.
  • 4:SUP - Updated to the beginning of 2020, this dataset provides the total surface of every italian region, represented in hectares (ha) and square kilometers (kmq).
  • 5:AGE - Updated to 2021, the dataset features information about life expectation in Italy, depending on the age group, and the percentage of the same groups in the peninsula.
  • 6:HOSP - The dataset covers only a period between 2010 and 2017, besides it lacks data about some regions (the most absent ones result to be Molise, Basilicata, Valle d'Aosta and Trentino-Alto Adige). Anyway, this information doesn't show accuracy or coherence issues.
  • 7:TEMP - Updated to 2019, this dataset deals only with provincial capital municipalities and regional capital municipalities (comuni capoluogo di provincia e comuni capoluogo di regione). However, it is exhaustive in every other aspect.
  • 1:COVID - Provided by the RNDT (Repertorio Nazionali dei Dati Territoriali), this dataset is subjected to a "Creative Commons Attribution 4.0 International (CC BY 4.0)." license: those who find it can host, modify and share its information, even for commercial use.
  • 2:POP - This dataset, provided by ISTAT, is subjected to a "Creative Commons – Attribution – version 3.0." license. Hence, data is available for reproduction, distribution and broadcasting, without needing permission to create hypertext links to this site itself. The only condition is the source's citation.
  • 3:PM10 - This dataset is under an "EEA (European Environment Agency) standard re-use policy": this means that this content is freely available either for commercial and non-commercial use, as long as the source is acknowledged. This is possible thanks to "Discomap", a website that allows the re-use of map services created by developers and GIS (Geographic Information System) experts.
  • 4:SUP - Same as the 2:POP dataset.
  • 5:AGE - Same as the 2:POP dataset.
  • 6:HOSP - Provided by the website of the Ministry of Health, this dataset is tutelated by an "Italian Open Data Licence 2.0": data under this policy can be freely downloaded, consulted and shared. Besides, users have the possibility to merge this data with further information, in order to obtain a mashup for a product or an application. The only requirement is to mention the source and to include, if possible, a link to the license.
  • 7:TEMP - Same as the 2:POP dataset.
  • 1:COVID:

    • Formats: CSV;

    • Provenance: the COVID-19 Github repository made available by the italian Protezione Civile;

    • Metadata: we first found the metadata available in the same repository (this XML file), which was not in the DCAT-AP standard but in the RNDT standard, which is a standard used for territorial data in Italy.

      Then, in this md file of the description of the dataset we found this link to the metadata in the geodati.gov.it site. Ignoring the fact that it is not possible to access to the actual RDF file of this metadata, we realized that it was more updated than the github one. So the github XML file was old and not updated metadata.

      Doing more research we found this manual, which explains that the territorial data which is also open can be “translated” to the DCAT-AP IT standard through the GeoDCAT-AP IT specific. And that the version of the metadata in the DCAT-AP IT standard could be found at dati.gov.it.

      So we searched for the dataset in dati.gov.it catalogue and we found this, once again it was not possible to access the actual metadata file in RDF. Browsing the site more carefully we found this page and finally found the download URL of the RDF metadata of all the datasets in the geodati.gov.it catalogue, including the COVID-19 one.

      Concluding, finding the right metadata for a government dataset should not be this complicated and long search and it should have been made available in the Github repository;

  • 2:POP:

    • Formats: Excel, CSV, PC-Axis, SDMX. There is no real download URI for any of the formats;

    • Provenance: I.Stat, the ISTAT database;

    • Metadata: there is some metadata next the data presented in the tool, but it follows no official standard. So we made it ourselves, gathering information from the ISTAT site and following the DCAT-AP IT standard;

  • 3:PM10:

    • Formats: CSV, TSV, JSON. There is no real download URI for any format;
    • Provenance: Air Quality e-Reporting by the European Environment Agency;
    • Metadata: there is some metadata but it doesn’t follow the DCAT-AP standard and there is no download URL. So we decided to translate the information in the DCAT-AP IT standard;
  • 4:SUP:

    • Formats: Excel, CSV, PC-Axis, SDMX. There is no real download URI for any format;

    • Provenance: I.Stat, the ISTAT database;

    • Metadata: there is some metadata next the data presented in the tool, but it follows no official standard. So we made it ourselves, gathering information from the ISTAT site and following the DCAT-AP IT standard;

  • 5:AGE:

    • Formats: Excel, CSV, PC-Axis, SDMX. There is no real download URI for any format;

    • Provenance: I.Stat, the ISTAT database;

    • Metadata: there is some metadata next the data presented in the tool, but it follows no official standard. So we made it ourselves, gathering information from the ISTAT site and following the DCAT-AP IT standard;

  • 6:HOSP:

    • Formats: CSV;

    • Provenance: the Open Data Ministero della Salute catalogue site;

    • Metadata: we found the metadata through the same way that we used for the 1:COVID dataset: in this dati.gov.it page, where we found this link that contained the RDF metadata in the DCAT-AP IT standard for the catalogue and the dataset that we used;

  • 7:TEMP:

The Virulence datasets contains informations concerning the factors that could have impacted the first wave of the COVID-19 pandemic in the italian regions. The catalog was created for the Open Access and Digital Ethics course at the University of Bologna and will not be actively mantained in the future. But all of our datasets and scripts are openly available with the CC-BY-4.0 license here on Github to be used, reproduced and updated.

In this section we present how we decided to visualize the data and metadata that compose our project, and the external resources used to do so.

Some libraries that we used throughout the whole developing of the website process:

  • Bootstrap 5, the world’s most popular front-end open source toolkit;
  • jQuery, a JavaScript library designed to simplify HTML DOM tree traversal and manipulation, event handling, and more;
  • FontAwesome, Internet's icon library and toolkit;
  • pandas, a fast, powerful, flexible and easy to use open source data analysis and manipulation tool for Python;
  • rdflib, a pure Python package for working with RDF;
  • urllib, a package that collects several modules for working with URLs.

We decided to visually represent the data we gathered in a Visualization page. There, the user can explore the datasets with an interactive map and graph.

For the Map section we used the following libraries and data:

  • Leaflet, an open-source JavaScript library for mobile-friendly interactive maps, for the map interactions;
  • OpenStreetMap, for the map data;
  • Mapbox, for the map style;
  • Openpolis geojson-italy repository, in particular the geojson file of the italian regions, based on ISTAT data (both under CC BY 4.0 license);
  • we loosely followed this Leaflet tutorial.

For the Graph section we used Chart.js, a library for simple, clean and engaging HTML5 based JavaScript charts.

It's been difficult to try our hand at this field, since many different factors needed to be considered in order to have even a vague idea of ​​the work done by the italian public authorities in the last 2 years. On the other side, having so much material available made the research easier and challenging. It wasn't always possible to obtain exhaustive details about metadata, while datasets needed a bit of maintenance before their actual use. Nonetheless, the website is able to provide a broad vision of the Italian situation between 2019 and 2020. Although certainly perfectible, the project has been completed in full compliance with web standards and the intellectual property of data, and we are quite happy and proud of the result.

About

Project for the Open Access and Digital Ethics course at UniBo, academic year 21/22.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •