Use Python, Pandas, Spark etc to demontrate that correlation can be used as a basis for decision making.
This project consists of finding the correlation between the GDP (Gross Domestic Product) and social and economical indicators, such as population growth, fertility rates, investment in specific sectors or prices.
The Hypothesis: It is assumed that there exists a correlation between economic growth and indicators as infant mortality, access to education... We want to demonstarte the validity of this assumption based on available datasets.
In order to check the veracity of this hypothesis the following steps are going to be followed:
Execute the notebooks in the following order:
- Data_load
- Data_normalization and outliers
- Data_filling
- Data clustering by countries
- Data clustering by indicators
- Data predictions
- Data sequencies
This will create a series of output DataFrames as .csv files.
In order to study the correlation between the economic indicators and some socio-demographic indicators, we have to choose the different indicators :
-
Gdp from 1850 to 2020 in pounds
-
Infant mortality of children under 5 years old
-
Percentage of population age 15+ with tertiary schooling.
-
Fertility rate
-
gender inequality
-
Life expectancy
I choose to measure the economic growth to compare the indicators with the GDP of the country.
I choose to extract datasets about these indicators from the website Our world in data