Realized by
This project presents the development of a complete ETL (Extract, Transform, Load) pipeline designed to process, analyze, and deliver insights on leukemia-related risk factors across various global regions. This project used a dataset extracted from Kaggle containing leukemia-related health data, which 143,194 patient records from 22 different countries.
The technologies used are:
- Python → Core programming language used throughout the project for data extraction, transformation, validation, and integration tasks.
- Jupyter Notebook → For the execution and documentation of Python code.
- PostgreSQL → Used as the central DB for storing the processed data in a dimensional model optimized for analytical queries.
- Airflow → Served as the orchestrator of the ETL pipeline, managing task dependencies and automating the data flow from extraction to loading.
- PowerBI Desktop → Used to design and publish interactive dashboards that visualize key leukemia risk indicators and regional trends.
- Great Expectations → Integrated into the pipeline to enforce data quality through expectations on schema, null values, ranges, and distributions.
- Kafka → Implemented to enable real-time data streaming, allowing a live feed of key leukemia metrics to be consumed and visualized dynamically.
The dependencies used in python are in a requirements.txt file
- patient_id: Patient ID is an auto-incremental number.
- age: patient's age.
- gender: the patient's gender, which could be Female or Male.
- country: the patient's country.
- wbc_count: white blood cells count.
- rbc_count: red blood cells count.
- platelet_count
- hemoglobin_level
- bone_marrow_blasts
- genetic_mutation: indicates whether you have a genetic mutation, set with “yes” or “no”.
- family_history: indicates whether you have a family history of leukemia, set with “yes” or “no”.
- smoking_status: indicates whether the patient smokes or has smoked, set with “yes” or “no”.
- alcohol_consumption: indicates whether the patient consumes or has consumed alcohol, set with “yes” or “no”.
- radiation_exposure: indicates whether the patient has been exposed to radiation, set with “yes” or “no”.
- infection_history: indicates whether the patient has a history of infections, set with “yes” or “no”.
- BMI: is the body mass index.
- chronic_illness: indicates whether the patient has a chronic disease, set with “yes” or “no”.
- immune_disorders: indicates whether the patient has an immune disorder, set with “yes” or “no”.
- ethnicity: indicates ethnicity, such as “A”, “B” or “C”.
- socioeconomic_status: indicates the socioeconomic status of the patient, it is set as “Medium”, “Low” and “High”.
- urban_rural: indicates whether the patient lives in a rural or urban area.
- leukemia_status: indicates whether the patient has leukemia or not, it is set as “Negative” or “Positive”.
Leukemia-Cancer-Risk-ETL/
├── airflow/ # Airflow-related files
│ ├── dags/
│ │ ├── dag_etl.py # Airflow DAG
│ ├── functions/ # Folder with all the functions used in the DAG
├── api/ # API data extraction and EDA
├── dashboard/
├── data/ # Data storaged
├── kafka/ # Scripts for the producer and the consumer in the streaming
├── notebooks/ # Jupyter notebooks
├── tests/ # Unit tests of the transformations
├── venv/ # Virtual environment
├── .gitignore # Git ignore file
├── docker-compose.yml # docker compose used for the proper functioning of kafka
└── requirements.txt # Project dependencies
Execute the following command to clone the repository
git clone https://github.com/valentinabc19/Leukemia-Cancer-Risk-ETL.git
From this point on all processes are done in Visual Studio Code
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateTo make a connection to the database you must have the database credentials in a JSON file called credentials. So this file must be created in the project folder, with the following syntax:
{
"db_host": "DB_HOST",
"db_name": "DB_NAME",
"db_user": "DB_USER",
"db_password": "DB_PASSWORD",
"db_port": "DB_PORT"
}Ensure this file is included in .gitignore.
Before running the pipeline, make sure the target database is properly created and configured, as it will serve as the destination for storing the processed data. This step is essential to ensure smooth data loading and prevent connection or schema-related issues during execution.
The necessary dependencies are stored in a file named requirements.txt. To install the dependencies you can use the command
pip install -r requirements.txtexport AIRFLOW_HOME=$(pwd)/airflow
airflow db init
airflow webserver --port 8080
airflow schedulerOpen a terminal in Visual Studio Code and start docker
docker-compose up -d --buildUse this command to see the containers that are running
docker psSelect the ID of the kafka container and open the bash of this one
docker exec -it ID bashRun this command to iniatilize the consumer
kafka-console-consumer --bootstrap-server IDContainer:9092 --topic fact_table --from-beginningRun this command to initialize airflow
airflow standaloneOpen your browser and go to http://localhost:8080.
Use the credentials given to you in Airflow standalone runtime, for example:
- Username:
airflow - Password:
airflow
- Locate the
leukemia_etlDAG. - Turn it On and click "Trigger DAG".
- Use the Airflow UI to track task status.
- Logs are available under
airflow/logs/.
- The data send to the consumer in turn is sent to a dashboard in Streamlit, which can be accessed by http://localhost:8501
The leukemia_etl DAG includes:
extract_leukemia_opextract_api_opprocess_api_opmerge_optransform_opvalidate_opload_opkafka_op