HealthcareApp

A prototype application for healthcare analysis.
A CAPSTONE project in the Master of Engineering program at the University of California, Berkeley

Landscape

The healthcare industry is shifting rapidly with AI and machine learning driving demand for precision medicine and personalized care. Providers face pressure to reduce errors, optimize workflows, and improve outcomes, while startups challenge traditional methods. Complex medical data remains underutilized without efficient analysis tools. This application bridges the gap, offering actionable insights to enhance clinicians' decision-making. It augments, NOT replaces, doctors by providing predictions based on patient’s data to improve accuracy and reduce false positives. Unlike generalized databases, it delivers specialized insights tailored to clinical needs.

Database

To construct the database effectively afterwards, a database schema has been drawn so that code implementation can be done by referencing the schema diagram.

Figure 1: Database schema of the user records

Arrows, together with a star at one end in the figure 1, indicate that there is a one-to-many relationship between each test and user. For example, while each user can have multiple brain tests, one brain test can only belong to one user.

Figure 2: Feature sets of phenotype model

In addition, the amount of data used in each class in different MRI and X-Ray models:

Figure 3: The amount of data used to build the model for each disease prediction.

In figure 2, we can see the hierarchical view of the database. Based on each disease, the fields were separated. The leaf nodes indicate the prefixes of the fields' identification codes. The same fields might be related to many diseases.
Key datasets used in our analysis include:
● DEMO (Demographic Data)
● BMX (Body Measures)
● BPX (Blood Pressure)
● LBX (Laboratory Values: Glucose, Insulin, HbA1c)
● DIQ (Diabetes Questionnaire)
These datasets provided variables such as age, gender, race/ethnicity, BMI, fasting glucose, insulin levels, blood pressure, and HbA1c—all of which are established markers relevant to diabetes risk.

Preprocessing

For each dataset, since the class distributions were varying, their weights should be generated so that the model can predict each class equally. If one class has more data than the other, then the model will predict that class more than the other ones. That means there will be a bias towards the classes with more training data.

Figure 4: Class frequencies of every CNN model. It shows a high variation, which can cause biased predictions

In order to solve this problem, a normalizer formula has been used to find the normalized weights. Also, because each image is processed independently, instead of processing them in one thread (the normal program flow), we can use multiple threads to process them in a parallel way. By doing that, we reduced the time spent on preprocessing by 83%.

Figure 5: The amount of time spent in preprocessing images with and without threads.

Architectures and Results

1.Brain Cancer Prediction

Figure 6: The model architecture for the brain cancer prediction

2.Alzheimer Prediction

Figure 7: The architecture of the model for the Alzheimer prediction

The model architecture used in the Alzheimer's prediction model has some differences. It is because of the nature of the data. Since the size of the data is big, the model should be less complex so that it can learn the trends instead of memorizing the data.

3.Lung Disease Prediction

Figure 8: The model architecture for the lung disease prediction

The model architecture for lung disease prediction is almost the same architecture used in the brain tumor prediction model. The only difference is the last fully connected layer, which, in the case of the lung disease model, contains 3 nodes instead of 4 because of the possible number of outcomes.

4.CNN Model Scores

Each different model has been trained multiple times in order to get the highest accuracy. Test and training data were separated so that at the end of the training, the model was able to test with the data that it hadn’t seen before. By doing that, the model’s generalizability is tested. Test results of the three image-based models are as follows:

Figure 9: The CNN model's test results

5.Gene Test

Simulated Polygenic Risk Scores (PRS) added approximately 12% improvement in AUC (from 0.75 to 0.84), confirming the added predictive power of genetic information.
● The TCF7L2 variant (Transcription Factor 7-Like 2), a well-established diabetes risk gene, was identified as significantly associated with increased risk, showing a 1.4x higher risk in modeled populations.
● Additional key genes identified via differential expression analysis included:
○ INSR (insulin receptor): central to the insulin signaling pathway
○ IRS1 (insulin receptor substrate 1): modulates insulin response
○ PPARG (peroxisome proliferator-activated receptor gamma): involved in adipocyte differentiation and glucose metabolism
○ SLC2A4 (GLUT4): glucose transporter gene regulating cellular uptake
● KEGG enrichment analysis revealed overrepresentation of insulin signaling, AMPK pathway, and type 2 diabetes mellitus pathways, further supporting biological relevance.

Figure 10: The percentage of each target gene

Some individuals with low-risk clinical profiles were predicted as high-risk due to their genetic load, underscoring the importance of genomic screening.
Ethnic disparities were observed in healthcare access and risk exposure, with certain minority populations showing underrepresentation in available genetic reference data, which may affect risk calibration.

Figure 11: The AUC curve of each target gene

6.Phenotype Test

Figure 12: ROC curves comparing classification performance across different machine learning models

This plot shows the Receiver Operating Characteristic (ROC) curves, which visualize the trade-off between the True Positive Rate (TPR: also known as sensitivity or recall, measures the proportion of actual positives correctly identified by the model.) and False Positive Rate (FPR: measures the proportion of actual negatives that are incorrectly classified as positive by the model.) for different classification thresholds.
● Some individuals with non-obese BMI and normal glucose levels still exhibited high predicted risk due to the presence of multiple coexisting social and clinical risk indicators.
● Ethnic disparities were observed, particularly among Mexican American and non-Hispanic Black subgroups, where risks were elevated even after adjusting for lifestyle factors. This highlights a potential intersection of genetic susceptibility and healthcare access.

High-Level Prototype in Figma

Since after starting the implementation of the program, changing the design will be challenging, a Figma prototype has been developed in order to decide on the application’s user interface. After carefully considering human-computer interaction heuristics, we decided to use the following user interface:

Figure 13: The first Figma prototype of the desktop application models

User Manual

The real program is ready to use, by cloning the directory in your local and executing the test_panels.py file in the App/main/ directory, you can execute it.

...User Manual is Coming...

Team

Cagin Tunc: UC Berkeley, Master of Engineering / Bioengineering
Haoyu Zhao: UC Berkeley, Master of Engineering / Bioengineering
Bikramjeet Singh: UC Berkeley, Master of Engineering / Industrial Engineering and Operations Research
Shuo Li: UC Berkeley, Master of Engineering / Bioengineering
Jiachen Xi: UC Berkeley, Master of Engineering / Bioengineering

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.venv		.venv
App		App
Gene Analysis		Gene Analysis
Machine Learning		Machine Learning
database		database
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
report.docx		report.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HealthcareApp

Table of Contents

Landscape

Database

Preprocessing

Architectures and Results

1.Brain Cancer Prediction

2.Alzheimer Prediction

3.Lung Disease Prediction

4.CNN Model Scores

5.Gene Test

6.Phenotype Test

High-Level Prototype in Figma

User Manual

Team

About

Uh oh!

Releases

Packages

Languages

License

cagintunc/HealthcareApp

Folders and files

Latest commit

History

Repository files navigation

HealthcareApp

Table of Contents

Landscape

Database

Preprocessing

Architectures and Results

1.Brain Cancer Prediction

2.Alzheimer Prediction

3.Lung Disease Prediction

4.CNN Model Scores

5.Gene Test

6.Phenotype Test

High-Level Prototype in Figma

User Manual

Team

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages