Using the "Life Expectancy (WHO)" dataset, we have generated insights about the correlation between the 22 parameters.
We have also trained some linear regression models to try to predict "Life Expectancy (in ages)".
- Country
- Year
- Status: Developed or Developing status (for each country)
- Life Expectancy (in ages)
- Adult Mortality: rates of both sexes (probability of dying between 15 and 60 years per 1000 population)(%)
- infant deaths: Number of Infant Deaths per 1000 population
- Alcohol: recorded per capita (15+) consumption (in litres of pure alcohol)
- percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
- Hepatitis B: HepB immunization coverage among 1-year-olds (%)
- Measles (sarampo): number of reported cases per 1000 population
- BMI: Average Body Mass Index of entire population
- under-five deaths: Number of under-five deaths per 1000 population
- Polio: Pol3 immunization coverage among 1-year-olds (%)
- Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
- Diphtheria: diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
- HIV/AIDS: Deaths per 1000 live births HIV/AIDS (0-4 years)
- GPD: Gross Domestic Product per capita (in USD)
- Population
- thinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (%)
- thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)
- income composition of resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
- Schooling (in years)
You can try this code on your own by opening google colab, and chossing "File"> "Open notebook" > "GitHub" and inserting the URL for this project. Then you only need to select the notebook file that is shown there.
When running the code in a notebook environment, two files are generated: "fill_missing_gdp.csv" and "fill_missing_population.csv". You may fill them with real data so that the code use it instead of removind these records.
Python, Pandas, Data Visualization (Matplotlib, Seaborn), Scikit-learn (for training and evaluation models)
- Understand why the model is performing so well at the testing data (there may be data leakage)
- Input missing data
- Search for better ways to treat data
This project was initially developed during ADA's Data Science Path course - Statistics II module, along with collaborators.
