Leukemia-Cancer-Risk - ETL Project

Realized by

Description

This project presents the development of a complete ETL (Extract, Transform, Load) pipeline designed to process, analyze, and deliver insights on leukemia-related risk factors across various global regions. This project used a dataset extracted from Kaggle containing leukemia-related health data, which 143,194 patient records from 22 different countries.

The technologies used are:

Python → Core programming language used throughout the project for data extraction, transformation, validation, and integration tasks.
Jupyter Notebook → For the execution and documentation of Python code.
PostgreSQL → Used as the central DB for storing the processed data in a dimensional model optimized for analytical queries.
Airflow → Served as the orchestrator of the ETL pipeline, managing task dependencies and automating the data flow from extraction to loading.
PowerBI Desktop → Used to design and publish interactive dashboards that visualize key leukemia risk indicators and regional trends.
Great Expectations → Integrated into the pipeline to enforce data quality through expectations on schema, null values, ranges, and distributions.
Kafka → Implemented to enable real-time data streaming, allowing a live feed of key leukemia metrics to be consumed and visualized dynamically.

The dependencies used in python are in a requirements.txt file

Dataset information

patient_id: Patient ID is an auto-incremental number.
age: patient's age.
gender: the patient's gender, which could be Female or Male.
country: the patient's country.
wbc_count: white blood cells count.
rbc_count: red blood cells count.
platelet_count
hemoglobin_level
bone_marrow_blasts
genetic_mutation: indicates whether you have a genetic mutation, set with “yes” or “no”.
family_history: indicates whether you have a family history of leukemia, set with “yes” or “no”.
smoking_status: indicates whether the patient smokes or has smoked, set with “yes” or “no”.
alcohol_consumption: indicates whether the patient consumes or has consumed alcohol, set with “yes” or “no”.
radiation_exposure: indicates whether the patient has been exposed to radiation, set with “yes” or “no”.
infection_history: indicates whether the patient has a history of infections, set with “yes” or “no”.
BMI: is the body mass index.
chronic_illness: indicates whether the patient has a chronic disease, set with “yes” or “no”.
immune_disorders: indicates whether the patient has an immune disorder, set with “yes” or “no”.
ethnicity: indicates ethnicity, such as “A”, “B” or “C”.
socioeconomic_status: indicates the socioeconomic status of the patient, it is set as “Medium”, “Low” and “High”.
urban_rural: indicates whether the patient lives in a rural or urban area.
leukemia_status: indicates whether the patient has leukemia or not, it is set as “Negative” or “Positive”.

📂 Project Structure

Leukemia-Cancer-Risk-ETL/
├── airflow/                  # Airflow-related files
│   ├── dags/                      
│   │   ├── dag_etl.py        # Airflow DAG
│   ├── functions/            # Folder with all the functions used in the DAG
├── api/                      # API data extraction and EDA
├── dashboard/
├── data/                     # Data storaged
├── kafka/                    # Scripts for the producer and the consumer in the streaming  
├── notebooks/                # Jupyter notebooks
├── tests/                    # Unit tests of the transformations
├── venv/                     # Virtual environment
├── .gitignore                # Git ignore file
├── docker-compose.yml        # docker compose used for the proper functioning of kafka
└── requirements.txt          # Project dependencies

🛠️ Setup Instructions

Clone the repository

Execute the following command to clone the repository

git clone https://github.com/valentinabc19/Leukemia-Cancer-Risk-ETL.git

From this point on all processes are done in Visual Studio Code

Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Credentials

To make a connection to the database you must have the database credentials in a JSON file called credentials. So this file must be created in the project folder, with the following syntax:

{
    "db_host": "DB_HOST",
    "db_name": "DB_NAME",
    "db_user": "DB_USER",
    "db_password": "DB_PASSWORD",
    "db_port": "DB_PORT"    
}

Ensure this file is included in .gitignore.

Before running the pipeline, make sure the target database is properly created and configured, as it will serve as the destination for storing the processed data. This step is essential to ensure smooth data loading and prevent connection or schema-related issues during execution.

Installing the dependencies

The necessary dependencies are stored in a file named requirements.txt. To install the dependencies you can use the command

pip install -r requirements.txt

4. Configure Airflow

export AIRFLOW_HOME=$(pwd)/airflow
airflow db init
airflow webserver --port 8080
airflow scheduler

🚀 Usage

Initialize kafka

Open a terminal in Visual Studio Code and start docker

docker-compose up -d --build

Use this command to see the containers that are running

docker ps

Select the ID of the kafka container and open the bash of this one

docker exec -it ID bash

Run this command to iniatilize the consumer

kafka-console-consumer --bootstrap-server IDContainer:9092 --topic fact_table --from-beginning

Initialize Airflow

Run this command to initialize airflow

airflow standalone

Access Airflow UI

Open your browser and go to http://localhost:8080.
Use the credentials given to you in Airflow standalone runtime, for example:

Username: airflow
Password: airflow

Trigger the DAG

Locate the leukemia_etl DAG.
Turn it On and click "Trigger DAG".

Monitor the Pipeline

Use the Airflow UI to track task status.
Logs are available under airflow/logs/.

Output

The data send to the consumer in turn is sent to a dashboard in Streamlit, which can be accessed by http://localhost:8501

📝 Pipeline Tasks

The leukemia_etl DAG includes:

extract_leukemia_op
extract_api_op
process_api_op
merge_op
transform_op
validate_op
load_op
kafka_op

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Leukemia-Cancer-Risk - ETL Project

Description

Dataset information

📂 Project Structure

🛠️ Setup Instructions

Clone the repository

Create Virtual Environment

Credentials

Installing the dependencies

4. Configure Airflow

🚀 Usage

Initialize kafka

Initialize Airflow

Access Airflow UI

Trigger the DAG

Monitor the Pipeline

Output

📝 Pipeline Tasks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
airflow		airflow
api		api
dashboard		dashboard
data		data
kafka		kafka
notebooks		notebooks
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

valentinabc19/Leukemia-Cancer-Risk-ETL

Folders and files

Latest commit

History

Repository files navigation

Leukemia-Cancer-Risk - ETL Project

Description

Dataset information

📂 Project Structure

🛠️ Setup Instructions

Clone the repository

Create Virtual Environment

Credentials

Installing the dependencies

4. Configure Airflow

🚀 Usage

Initialize kafka

Initialize Airflow

Access Airflow UI

Trigger the DAG

Monitor the Pipeline

Output

📝 Pipeline Tasks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages