-
Notifications
You must be signed in to change notification settings - Fork 83
Create correlation matrices for RSEM TPM polya and stranded datasets #323
Create correlation matrices for RSEM TPM polya and stranded datasets #323
Conversation
basic structure of analysis files for correlation matrix loads rds files into python using pyreadr (added dependency to dockerfile)
Added bulk of code for this pull. Seems to work, but not yet validated on known input files. details: - added scikit-learn dependency - correlation_matrix code implemented. Split into two functions, prepare_expression and calculate_correlation - run on both tpm gene datasets, pbta-gene-expression-rsem-tpm.polya.rds and pbta-gene-expression-rsem-tpm.stranded.rds - utils: read_rds, write_rds now interact with files specifically in ../../data. ../../scratch, and results *only*.
- Added to CI - Rename correlation_matrix to step 01 - Roll run.py into step 01 - we'll split it out at the end - Bugfix on write_rds - need to store the index manually - removed results gitignore - maybe we do want to store results?
upon re-reading of output expectations this file does not "provide outputs intended for tables, figures, or supplementary tables or figures "
Shall I go ahead and merge your current master into my branch, or is that something one of you prefers to do? (The contributing page suggests I keep my fork synced with master only after the first pull is accepted, which is why I haven't done it already.) In any case, the current CI fail appears to be because my download-ci-files.sh script is out of date. |
Hi @e-t-k, thanks for filing this! Please feel free to pull master into this branch. I also took a quick look at the files changed here and it seems appropriate to add both files in a single pull request, so you are all set to mark as ready for review when you’re ready. |
Conflicts: .circleci/config.yml
@jaclyn-taroni thanks for your guidance! It's ready to review now (and thanks for restarting the CI, if that was you who gave it a kick) |
Happy to help! I've requested a review from @jashapiro. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @e-t-k. The code looks great, and I really appreciate the good comments/documentation! I have a couple of minor questions which occurred to me as I looked it over.
The biggest general comment is mainly about where some of the file paths and options are hard-coded into the script which could make things a bit more brittle if anything changes down the line. If you are using util.py
in other scripts, it can be nice to have some of the main paths centralized in that file, but it might also be a potential source of hidden errors, in that it might not be immediately clear when looking at later PRs where files are being written or read from.
My other comments are mostly just informational, so feel free to reply but whether your code changes at all is entirely up to you!
Thanks again for contributing, and I am sure we will get this merged quickly!
# Rank transform - values are now that rank of that gene within each sample | ||
print_v("Rank transforming") | ||
filtered_rank_transformed_df = np.apply_along_axis(scipy.stats.rankdata,0,expr_var_filtered_df) | ||
|
||
# Run pearson metric on ranked genes. For pairwise_distances we need samples=rows so transpose. | ||
print_v("Running pairwise distances") | ||
x_corr = 1 - sklp.pairwise_distances(X=filtered_rank_transformed_df.transpose(), metric="correlation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious here if there is a reason not to use scipy.stats.spearmanr()
? I would think it should have the same result, so it shouldn't really matter, but I'm curious if there might be some issue that this method addresses that led you to use it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also expect the result should be the same. This part of the code is copied pretty directly from our UCSC TumorMap library so that we can potentially generate TumorMaps from these data.
apply changes suggested by @jashapiro comments as appropriate: - changed input filenames to arguments rather than hardcoded (resulting in two circleci run commands for now) - enable verbose in ci - stopped using hardcoded filepaths - input (data) and scratch dirs are now provided as command line options (results not used yet). (removed utils.get_filepath as a consequence) - moved print_v below docstring (oops) - add proportion_unexpressed, variance_filter_level as command line options - use os.path.join for all filepaths vs forward slashes
someone else already installed scikit-learn so i don't need to install it again
Thank you @jashapiro ! I believe I've addressed your comments in b208c8b. |
Just some little formatting changes, with a few rearrangements of arguments for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your updates @e-t-k! This looks great. I mad a few little format changes to standardize argument orders and clean up long lines, but nothing substantive.
This should go in today!
for comparative-rnaseq-analysis: - update command lines in CI file; add dependency to dockerfile; add new version of outlier results tsv.gz file with sample filters applied.
…s); convert scratch files to .feather (#544) * update README (TODO: proofread markdown) * implement pull 3: sample filters and general cleanup - add tsv and feather IO functions to utils - add sample filters to step 01: check MEND qc status, and use pbta-histologies to filter to only tumor samples - For dropped genes, note if dropped by Expression filter or Variance filter - all internal .rds files converted to .feather - all path handling in step 01 moved to main() * [#323] update CI file, dockerfile, results file for comparative-rnaseq-analysis: - update command lines in CI file; add dependency to dockerfile; add new version of outlier results tsv.gz file with sample filters applied. * [#229] tidy up README markdown. * [#229] analyses/README update comparative-rna-seq * [#229] check MEND QC files to confirm parsed ok Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and crash if they didn't. Also call out any samples that didn't have MEND QC files (these will be treated as fails and not included in the dataset). * [#229] maintain consistent column order in outlier results tsv Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix. Includes new results file which has data identical to previous, but columns in correct order.
I'm pulling this as a draft request because I'm adding two files,
01-correlation-matrix.py
and autils.py
module with shared functions. Does it make sense to merge the utils as standalone, even though it's a helper that doesn't run anything by itself? Or is it reasonable to review both these files together? Thank you!Purpose/implementation Section
What scientific question is your analysis addressing?
What is the correlation between samples using RSEM TPM gene expression? We are interested in correlation within the polyA and ribodeplete (stranded) data sets; we don't calculate correlation between the datasets as they are not comparable.
What was your approach?
Then, apply the tumormap-creation process used in Vaske et al. (2019) "Comparative Tumor RNA Sequencing Analysis...", which comprises the following steps:
Apply filters to the expression matrix to reduce noise that obscures cohort-wide patterns.
Calculate the Spearman correlations between each pair of samples.
What GitHub issue does your pull request address?
#229 (This is the first of 4+ pulls planned for this issue)
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
I was able to test the filtering and spearman correlations steps in prepare_expression() and calculate_correlation() on Treehouse data sets and results were identical to the output generated by our internal pipeline.
Is there anything that you want to discuss further?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Not yet - no figures/tables generated
Results
What types of results are included (e.g., table, figure)?
No result files yet; not sure if all-by-all file is worthwhile to include as a supplementary table.
Scratch files include:
All-by-all Spearman correlation matrix for each of the polya and stranded RSEM TPM datasets:
List of genes surviving the filters for each dataset:
What is your summary of the results?
No analysis performed on these matrices yet - they will also be used as inputs for next step of this analysis
Reproducibility Checklist