Refactor validate module #33

roshankern · 2023-06-20T02:10:12Z

This PR is ready for review!

In this PR, the validate module is refactored. Now, the Cell Health classification profiles (phenotypic class predictions averaged across perturbation) are derived in cell-health-data and simply loaded in to this repo. Correlations between these profiles and Cell Health labels are derived for all model types, feature types, across all cell lines, by each cell line, and for pearson and ccc correlation methods.

These correlations are also briefly viewed in this new version of the validate module.

There are about 475 lines to review, sorry for the longer PR 😿

review-notebook-app · 2023-06-20T02:10:16Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

gwaybio

LGTM!

A couple discussion items that need not delay merging:

It's a bit concerning to see the shuffled models outputting such high correlations. Why do you think that is?
I would also be interested in quickly seeing how each cell line performed, and how each feature set performed (CP alone, DP alone). Are you planning on expanding the notebook to include more clustermaps?

5.validate_model/README.md

5.validate_model/scripts/nbconverted/cell_health_correlations.py

utils/validate_utils.py

roshankern · 2023-06-30T02:55:47Z

Hmmmm, I am really struggling to interpret the cell health data classification profile correlations, both:

Understanding why the shuffled baseline models have such high correlations
Judging the model's effectiveness in being applied to the cell health dataset (from the correlations)

For 1), I figured I would look at the raw classification numbers (at https://github.com/roshankern/phenotypic_profiling_model/blob/add-classifications-preview/5.validate_model/preview_classifications.ipynb), but I am also not sure what insight these can give. The main difference I noticed is the final models seem to give a much higher probability for interphase. This seems reasonable as most of the nuclei I visually checked in Cell Health Data (from IDR Stream previewer) looked like interphase to me. Do you have any thoughts on how to aproach 1) above?

For 2), it is clear that some correlations we would expect are there (ex apoptosis classification profile and cc_percent_dead). But, these correlations can vary drastically across cell line and model type which makes it difficult for me to say with confidence that these correlations show the model's ability to get useful classification info from cell health data. Also, in some cases it seems that the correlations can be in the opposite of expected (like a negative correlation for the apoptosis classification profile and cc_percent_dead example). I'm thinking it might be worth coming up with a way to judge the correlations en masse, but my idea may be a bit out of the scope of the project. This is what I am imagining we could do:

Create an "expected correlations matrix". Here we could manually annotate correlations we would expect to see and which direction we would expect to see them in (although I am not sure it would be viable to include magnitude, just direction). Some example correlations we could annotate:

apoptosis and cc_percent_dead - positive correlation expected
large and cc_nucleus_area_mean - positive correlation expected
apoptosis and vb_percent_live - negative correlation expected

Evaluate each cell health data correlation matrices on our "expected correlations matrix" to derive some kind of validation_score
See if the final models have statistically significant higher validation_scores than the shuffled_baseline models
See which feature types and cell lines have higher validation_scores

I think this idea may have large scope creep and be unnecessary for our purposes with the model, but I am not sure how else to holistically review the correlation performance across cell lines and models. What do you think? Is there a better way to answer 2)?

Add correlations difference preview

roshankern · 2023-07-06T00:40:14Z

A preview_CH_correlation_differences.ipynb notebook has been added in c18edd2 that makes the differences between the final and shuffled_baseline models more clear.

The correlation differences for all, pearson, CP_and_DP seem to be as we would expect if the final model is performing better than the shuffled baseline model. This is supported by:

OutOfFocus and cell roundness pearson correlations higher for final model than shuffled baseline model
Large and area mean pearson correlations higher for final model than shuffled baseline model
Apoptosis and percent dead pearson correlations higher for final model than shuffled baseline model
Apoptosis and percent live pearson correlations lower for final model than shuffled baseline model

For now we will not validation_score idea mentioned above in #33 (comment).

* Refactor Download Module (#18) * refactor module * remove training data file * Update 0.download_data/scripts/nbconverted/download_data.py Co-authored-by: Erik Serrano <31600622+axiomcura@users.noreply.github.com> * eric suggestions --------- Co-authored-by: Erik Serrano <31600622+axiomcura@users.noreply.github.com> * Refactor Split Data Module (#19) * refactor module * greg suggestions * Train module refactor (#20) * refactor format module * use straify function * rerun train module * black formatting * docs, nbconvert * nbconvert * rerun pipeline, rename model * fix typo * Update 2.train_model/README.md Co-authored-by: Gregory Way <gregory.way@gmail.com> * Update 2.train_model/README.md Co-authored-by: Gregory Way <gregory.way@gmail.com> * Update 2.train_model/README.md Co-authored-by: Gregory Way <gregory.way@gmail.com> * notebook run --------- Co-authored-by: Gregory Way <gregory.way@gmail.com> * Refactor evaluate module (#21) * refactor clas pr curves * refactor confusion matrix * refactor F1 scores * refactor model predictions * documentation * dave suggestions * erik suggestions, reconvert * Refactor interpret module (#22) * refactor interpret notebook * docs, reconvert script * greg suggestions * Get Leave One Image Out Probabilities (#23) * add LOIO notebook * LOIO notebook * update notebook * download and split data with cell UUIDs * move LOIO * finish LOIO * black formatting * rerun notebook * rerun notebook, dave suggestions * greg comment * Train single class models (#25) * move multiclass models * rename files, fix sh * single class models notebook * run notebook * binarize labels * train single class models * reconvert notebooks * update readme * rename sh file * remove models * eric readme suggestions * rerun notebook, eric suggestions * Add Single Class Model PR Curves (#26) * get SCM PR curves * shuffled baseline * retrain single class models with correct kernel * rerun pr curves notebook * remove nones * rerun multiclass model * rerun notebook * move file * docs, black formatting * format notebook * Update 3.evaluate_model/README.md Co-authored-by: Dave Bunten <ekgto445@gmail.com> * dave suggestions * reconvert notebook --------- Co-authored-by: Dave Bunten <ekgto445@gmail.com> * Add SCM confusion matrices and F1 scores (#27) * get SCM PR curves * shuffled baseline * retrain single class models with correct kernel * rerun pr curves notebook * remove nones * rerun multiclass model * rerun notebook * move file * create SCM confusion matrix * rerun notebook * add changes from last PR * rerun notebook * add SCM F1, update SCM confusion matrices * documentation * rerun notebook * Update utils/evaluate_utils.py Co-authored-by: Dave Bunten <ekgto445@gmail.com> * Update utils/evaluate_utils.py Co-authored-by: Dave Bunten <ekgto445@gmail.com> * Update 3.evaluate_model/scripts/nbconverted/F1_scores.py Co-authored-by: Dave Bunten <ekgto445@gmail.com> * dave suggestions --------- Co-authored-by: Dave Bunten <ekgto445@gmail.com> * Get SCM Predictions and LOIO Probabilities (#29) * get SCM LOIO probas * reconvert notebook * get model predictions * rerun LOIO * reconvert notebook * save and reconvert notebook * eric suggestions * Add SCM Interpretations (#30) * add scm coefficients * rerun interpret multi-class model * compare model coefficients * nbconvert * readme * make all correlations negative * rerun training * rerun evaluate * rerun interpret * docs * newline * rerun LOIO * Remove unused cp features (#31) * rerun download/split modules * rerun multicalss models * rerun single class model * rerun evaluate module * get LOIO probas * rerun interpret module * rerun download data * Adding CP features to ggplot visualization (#24) * set colors for model types * visualize precision recall with CP and DP+CP * add F1 score barchart visualization * minor tweak of f1 score print * ignore mac files * merge main and rerun viz * change color scheme for increased contrast * add f1 score of the top model, and rerun with updated colors * nrow = 3 in facet * change name of weighted f1 score * update single cell images module (#32) * Refactor validate module (#33) * update validate module * refactor validation * get correlations * convert notebook * update readme * formatting, documentation * reset index * vadd view notebook * docs, black formatting * ccc credit * show all correlations * add notebook * remove preview notebook * convert notebook * add differences heatmaps * preview correlation differences * add docs * black formatting --------- Co-authored-by: Erik Serrano <31600622+axiomcura@users.noreply.github.com> Co-authored-by: Gregory Way <gregory.way@gmail.com> Co-authored-by: Dave Bunten <ekgto445@gmail.com>

roshankern added 9 commits May 26, 2023 08:15

update validate module

8bc67a1

refactor validation

7e6b4dd

get correlations

e840d5d

convert notebook

87bb044

update readme

8acf81a

formatting, documentation

e5cd9db

reset index

4c7fb59

vadd view notebook

266cb28

docs, black formatting

c6c713b

ccc credit

563fc29

gwaybio approved these changes Jun 21, 2023

View reviewed changes

roshankern added 4 commits June 25, 2023 18:39

show all correlations

59917f0

add notebook

47bb3fa

remove preview notebook

361331d

convert notebook

673b820

roshankern and others added 5 commits June 30, 2023 15:17

add differences heatmaps

e9d0425

preview correlation differences

9e68be3

add docs

8e251d2

black formatting

1f50b25

Merge pull request #1 from roshankern/add-classifications-preview

c18edd2

Add correlations difference preview

roshankern merged commit 2024a27 into WayScience:cp-feature-refactor Jul 6, 2023

roshankern deleted the refactor-validate-module branch July 6, 2023 00:40

roshankern mentioned this pull request Jul 6, 2023

Add classifications preview #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor validate module #33

Refactor validate module #33

Uh oh!

roshankern commented Jun 20, 2023 •

edited

Loading

Uh oh!

review-notebook-app bot commented Jun 20, 2023

Uh oh!

gwaybio left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

roshankern commented Jun 30, 2023 •

edited

Loading

Uh oh!

roshankern commented Jul 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor validate module #33

Refactor validate module #33

Uh oh!

Conversation

roshankern commented Jun 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Jun 20, 2023

Uh oh!

gwaybio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

roshankern commented Jun 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roshankern commented Jul 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

roshankern commented Jun 20, 2023 •

edited

Loading

roshankern commented Jun 30, 2023 •

edited

Loading