Feel free to view our website, which summarizes our Machine Learning Project.
- Machine Learning Modelling on Malaria Cell Image Recognition
- FLASK
- Front End Development for User Integration
- Presentation
- References
The primary goal of this project is to utilize machine learning to analyze cell images from both individuals with and without malaria. This analysis aims to predict whether a subject has malaria, offering valuable assistance to healthcare professionals in the diagnostic process and making it more accessible to the general public. To achieve this objective, we have developed a web application that integrates an optimized machine learning model. This application is designed for use by students in the fields of science and medicine. Users can select a set of images they suspect may contain malaria and compare their assessments to the predictions made by the machine learning model. The primary intention behind this application is to serve as an educational tool, facilitating users in the easier identification of malaria-infected cells.
- Julia Liou - Data Analyst/Engineer & Product Manager
- Kevin Wan - Flask & Web Developer
- Manpreet Sharma - Data Scientist & Back End Developer
- Srinivas Jayaram - Machine Learning Engineer & Data Scientist
- Python: Matplotlib, Numpy, Pandas, Scipy.stats, Sklearn, Tensorflow, PIL, OS, CV2, boto3, Keras, Shutil, IO, Random
- Flask
- SQLite
- S3.Bucket
- CSV Files
- HTML/CSS: Jinga2, Bootstrap
- Javascript: Plotly
- Sweetalert2
- GitHub
- Tableau
- Canva
- Miro
- Trello
- Contact Kevin Wan for the config.py that goes with the webapp.
- Git clone- https://github.com/jnliou/project4.git
- cd into directory- project4/Project lab
- Start Flask app with app.py by writing
python app.pyin terminal. - This will be able to run the web app.
- Project Management - Utilized Trello to keep track of deadlines.
- Solution Architecture - Utilized Miro to create our solutions architecture to sort out our project timeline.
- As the dataset was >100mb we were unable to upload it onto GitHub even after compressing the folder into a ZIP format.
- In order to access the original dataset, please download it from Kaggle: Kaggle Dataset on Malaria.
- Extract the folder into the
Datasetfolder. - Run the code via Jupyter Notebook Code for Data Preprocessing
- The original dataset contained 3 folders, one named
cell_images\cell_images, one namedcell_images\Uninfected, and one namedcell_images\Parasitized. As thecell_images\cell_imagesfolder contained the same data as the Uninfected and Parasitized folders, this folder was deleted to assist with easier processing of the data. - We then converted the photos into 25x25 pixels.
- Utilizing Jupyter Notebook and Python we selected 2500 photos from
Dataset\cell_images\Parasitizedand 2500 photos fromDataset\cell_images\Uninfectedand added 1750 infected photos intocell_images\clean\train\infected_processedand 1750 uninfected photos tocell_images\clean\train\uninfected_processedfor training data, and 750 infected photos intocell_images\clean\test\infected_processedand 750 uninfected photos intocell_images\clean\test\uninfected_processedfor testing data. This cut down our photos from 27,558 to 5,000.
Tableau Dashboard of Exploratory Data Analysis: https://public.tableau.com/app/profile/julia.liou6123/viz/EDAonCellImagesofMalaria-Tableau/RGB
This repository contains an analysis of cell images comparing unprocessed vs. processed and uninfected vs. infected cells. Various image analysis techniques, including blob detection, edge detection, edge density, and RGB color channel distribution, were utilized to determine differences in characteristics or properties between the two.

Blob detection was performed to identify and analyze blobs within the images. For both uninfected and infected cells, the mean blob size and the maximum blob size were calculated. Statistical differences were assessed using histograms and T-tests.
Edge detection was employed to visualize the differences between uninfected and infected cells in terms of their edge structures. The resulting images provide a clear visual representation of the variations in edges.
Edge density comparison between uninfected and infected cells was conducted. Histograms and T-tests were used to analyze the differences in edge density characteristics.
The average RGB color distribution of each image for infected and uninfected cells was compared. Histograms and T-tests were used to evaluate any statistical distinctions in average RGB color distribution.
Statistical Analysis - P Value
| EDA | Mann Whitney U T-test |
|---|---|
| Average Red Channel Distribution | 1.60 e-11 |
| Average Green Channel Distribution | 4.63 e-84 |
| Average Blue Channel Distribution | 1.30e-17 |
| Average Edge Density | 1.20 e-302 |
| Average Blob Size | 2.01e-17 |
| Max Blob Size | 1.65e-15 |
We used two methods to perform PCA on our image dataset.
- Approach1: Performing PCA over Image Characteristics and Features as mentioned below. Results are displayed as below.
- RGB Channel Distribution
- Max/Mean Blob
- Edge Density of the image
- Approach2: We performed PCA on our raw image dataset by following the steps below to see if there is a split between class labels.
- Read images
- Flatten images
- Process in PCA
- Plot on 2d map, color by class label
The results of the various analyses were integrated into 4 DataFrames. 2 for the training data: Dataset/eda_train_infect.csv, Dataset/eda_train_uninfect.csv, and two for the testing data: Dataset/eda_test_infect.csv, Dataset/eda_test_uninfect.csv for further analysis and reference for useage on the ML model.
Step 6: Data Export Data was exported to our website using a SQLite database which consisted of the predictions from our ML model, while our raw data (image dataset) was hosted on S3 bucket.
Step 6: Building the Machine Learning We tried a few different machine learning models to figure out the best accuracy for our end goal.
EDA:
| Model | Accuracy |
|---|---|
| Random Forest (RF) | 83% |
| Random Forest + hyperparameter | 89% |
| RF + Gradient Boosting | 83% |
| Linear Regression Model | 32% |
| SVC Model | 79% |
| SVC+Hyp | 85% |
| SVC + PCA | 79% |
| SVC+RF+NN | 81% |
| Decision Tree | 74% |
IMAGE:
| Model | Accuracy |
|---|---|
| CNN | 94% |
| K-NN | 60% |
| Xception | 80% |
| Xception Optimized | 79% |
A) CNN:
The CNN model was designed for the classification of cell images into two categories: uninfected (0) and infected (1). It's aimed at assisting in the automated detection of infected cells, a task of significance in the detection of Malaria.
-
Training Dataset:
- Dataset Size: 1750 cell images.
- Features: Each row in the dataset represents an image, with each pixel of the image treated as a feature.
- Target Variable: The "Target" column indicates the class label, where 0 represents uninfected cells, and 1 represents infected cells.
-
Data Preprocessing:
- Image Resizing: All images were resized to a consistent size (not specified in the provided information) to ensure uniform input dimensions for the CNN.
- Normalization: Pixel values were scaled to a range of [0, 1] by dividing by the maximum pixel value (e.g., 255 for 8-bit images). This standardization helps improve convergence during training.
-
Model Architecture:
- The CNN model architecture used for cell image classification is as follows:
- Input Layer: Accepts images with dimensions (32, 25, 25, 3)
- Convolutional Layers: Three convolutional layers were employed with varying numbers of filters and filter sizes.
- Max-Pooling Layers: Max-pooling layers followed each convolutional layer to reduce spatial dimensions.
- Flatten Layer: The output from the final max-pooling layer was flattened into a 1D vector of length 16.
- Dense Layers: Two dense (fully connected) layers were used.
- Dropout Layer: A dropout layer with a dropout rate of 0.5 was added after the first dense layer to prevent overfitting.
- The CNN model architecture used for cell image classification is as follows:
-
Model Training:
- Loss Function: Binary cross-entropy.
- Optimizer: Adam
- Batch Size: 32
- Epochs: 50
- Validation Split: 20% of the training data was used for validation during training to monitor model performance.
-
Model Evaluation:
- The model's performance was evaluated using common binary classification metrics, including:
- Accuracy: Measures the overall correctness of predictions. For this model, the accuracy was 94%.
- The model's performance was evaluated using common binary classification metrics, including:
B) Random Forest:
The EDA data was analyzed by a Random Forest model to predict if a cell was infected or not. The Random Forest model was tuned with hyperparameters, and the important features were identified.
-
Testing Data:
- The testing dataset of 750 images that were not seen by the model was used to test the dataset. The model was used to predict infected and uninfected cells.
-
Results:
- The Random forest with hyperparameter tuning gave an accuracy of 89%.
C) Xception:
-
Model Architecture: - Input Layer: Accepts images with dimensions (25, 25, 3) - Base Model: Xception with pre-trained weights - Input Layer: Accepts images with variable dimensions - Convolutional and Separable Convolutional Layers - Batch Normalization Layers - Activation Layers - Max-Pooling Layers - Global Average Pooling Layer - Two Dense (Fully Connected) Layers
- Output Layer: Dense layer with 1 neuron and sigmoid activation
- Freeze some layers in the base model to prevent them from being trained.
-
Model Training:
- Loss Function: Binary cross-entropy.
- Optimizer: Adam
- Batch Size: 32
- Epochs: 20 for base model and 10 for top model
- Validation Split: 20% of the training data was used for validation during training to monitor model performance.
-
Model Evaluation:
- The model's performance was evaluated using common binary classification metrics, including:
- Accuracy: For base model was 78-80% while for top model was 77%.
- The model's performance was evaluated using common binary classification metrics, including:
-
This is a pre-trained model on the popular image dataset called
imagenet. We built our base model using the pre-trained model and then added a layer of our own testing and training dataset to see how it performs. We got an accuracy of 79% over 20 epochs. -
Model was fine-tuned by adding creating
top_modelon top of thebase_modelby feeding the output from the base model to the top model. It was interesting to notice that the accuracy did not change much but the loss had a significant difference. -
Model Summary:
- Model architecture with detailed layer information for top model.
- Total parameters: 22,960,681 (87.59 MB)
- Trainable parameters: 2,099,201 (8.01 MB)
- Non-trainable parameters: 20,861,480 (79.58 MB)
- The graphs in
xception.ipynbdepict that the data was overfitting at certain points but the validation set performed better than the training data consistently.
| Type | Library |
|---|---|
| Data Handling & Processing | Numpy, Pandas |
| Web Framework | Flask |
| Storage & AWS Interaction | Boto3 |
| File Handling & Compression | Zipfile, IO |
| Randomization | Random |
| Database & ORM | SQLAlchemy, csv |
- To interact with our AWS storage, we generated pre-signed URLs from our S3 bucket name and key. This provides an API for Flask to retrieve image files.
- The essential aws_access_key is fetched from our config file, ensuring its security by not sharing it on GitHub.
- Flask plays a pivotal role in capturing user input data from our web game, which is temporarily stored in a global variable.
- After processing this data through our game logic, Jinja2 templating assists in parsing the variables to the frontend, making it accessible for various functions.
- With the combination of SQLAlchemy and Flask, we've set up API routes that output data in JSON format.
These intricacies, woven together, create a robust and interactive platform tailored to our users' needs.
- To run Flask: app.py
- SQLite Database: predictions.db
- SQL Database Generation: sqlite.ipynb
Pluggin used includes plotly, bootstrap and google fonts.
- HTML and CSS have been employed to design the visuals and effects. For user interactions, including an engaging game to showcase our machine learning model, we've used JavaScript.
- Plotly was instrumental in creating graphical representations like pie charts.
- Bootstrap and Google Fonts enhanced the website's aesthetics and readability.
- With Flask serving as our backend framework, we're efficiently reading data from our database.
- AWS S3 has been our choice for storage. It allows us to select image names from the database and subsequently extract and display the relevant image files on the website.
-
Users can select infected cells on the platform. Once they submit their selections, the data is sent to our backend for processing. By seamlessly integrating these tools, we've been able to craft a dynamic and interactive platform for our users.
- Byeon, E. (2020, September 11). Exploratory data analysis ideas for image classification. Medium. https://towardsdatascience.com/exploratory-data-analysis-ideas-for-image-classification-d3fc6bbfb2d2
- Centers for Disease Control and Prevention. (2019, October 18). CDC - Malaria - diagnosis & treatment (United States). Centers for Disease Control and Prevention. https://www.cdc.gov/malaria/diagnosis_treatment/index.html
- Google. (n.d.). Google fonts. https://fonts.google.com/
- Microscope photos, download the best free microscope stock ... - pexels. (n.d.). https://www.pexels.com/search/microscope/
- Team, K. (n.d.). Keras Documentation: Transfer Learning & Fine-tuning. https://keras.io/guides/transfer_learning/
- Tian, Y. (2020, June 17). Integrating image and tabular data for Deep Learning. Medium. https://towardsdatascience.com/integrating-image-and-tabular-data-for-deep-learning-9281397c7318
](/jnliou/project4/raw/main/Dataset/science.jpg)









