Create correlation matrices for RSEM TPM polya and stranded datasets #323

e-t-k · 2019-12-09T22:23:33Z

I'm pulling this as a draft request because I'm adding two files, 01-correlation-matrix.py and autils.py module with shared functions. Does it make sense to merge the utils as standalone, even though it's a helper that doesn't run anything by itself? Or is it reasonable to review both these files together? Thank you!

Purpose/implementation Section

What scientific question is your analysis addressing?

What is the correlation between samples using RSEM TPM gene expression? We are interested in correlation within the polyA and ribodeplete (stranded) data sets; we don't calculate correlation between the datasets as they are not comparable.

What was your approach?

Normalize the RSEM TPM gene expression values by log₂(TPM + 1). (This is not strictly necessary for the correlation step but will be used in later steps of the analysis.)

Then, apply the tumormap-creation process used in Vaske et al. (2019) "Comparative Tumor RNA Sequencing Analysis...", which comprises the following steps:

Apply filters to the expression matrix to reduce noise that obscures cohort-wide patterns.
- Drop all genes that are expressed in fewer than 20% of samples
- Sort the genes by variance and filter out the 20% least varying genes.
Calculate the Spearman correlations between each pair of samples.

What GitHub issue does your pull request address?

#229 (This is the first of 4+ pulls planned for this issue)

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Any pull request instructions I've missed or appropriate organization of code/output files.
RDS file input/output (in utils.py) and the log2 conversion step

I was able to test the filtering and spearman correlations steps in prepare_expression() and calculate_correlation() on Treehouse data sets and results were identical to the output generated by our internal pipeline.

Is there anything that you want to discuss further?

Is RDS an appropriate format for output/scratch files? do you have a standardized format?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Not yet - no figures/tables generated

Results

What types of results are included (e.g., table, figure)?

No result files yet; not sure if all-by-all file is worthwhile to include as a supplementary table.
Scratch files include:

All-by-all Spearman correlation matrix for each of the polya and stranded RSEM TPM datasets:

rsem-tpm-polya-all_by_all_correlations.rds,
rsem-tpm-stranded-all_by_all_correlations.rds

List of genes surviving the filters for each dataset:

rsem-tpm-polya-filtered_genes_to_keep.rds
rsem-tpm-stranded-filtered_genes_to_keep.rds

What is your summary of the results?

No analysis performed on these matrices yet - they will also be used as inputs for next step of this analysis

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

basic structure of analysis files for correlation matrix loads rds files into python using pyreadr (added dependency to dockerfile)

Added bulk of code for this pull. Seems to work, but not yet validated on known input files. details: - added scikit-learn dependency - correlation_matrix code implemented. Split into two functions, prepare_expression and calculate_correlation - run on both tpm gene datasets, pbta-gene-expression-rsem-tpm.polya.rds and pbta-gene-expression-rsem-tpm.stranded.rds - utils: read_rds, write_rds now interact with files specifically in ../../data. ../../scratch, and results *only*.

- Added to CI - Rename correlation_matrix to step 01 - Roll run.py into step 01 - we'll split it out at the end - Bugfix on write_rds - need to store the index manually - removed results gitignore - maybe we do want to store results?

upon re-reading of output expectations this file does not "provide outputs intended for tables, figures, or supplementary tables or figures "

e-t-k · 2019-12-09T22:49:56Z

Shall I go ahead and merge your current master into my branch, or is that something one of you prefers to do? (The contributing page suggests I keep my fork synced with master only after the first pull is accepted, which is why I haven't done it already.)

In any case, the current CI fail appears to be because my download-ci-files.sh script is out of date.

jaclyn-taroni · 2019-12-09T23:08:11Z

Hi @e-t-k, thanks for filing this! Please feel free to pull master into this branch. I also took a quick look at the files changed here and it seems appropriate to add both files in a single pull request, so you are all set to mark as ready for review when you’re ready.

Conflicts: .circleci/config.yml

e-t-k · 2019-12-09T23:50:05Z

@jaclyn-taroni thanks for your guidance! It's ready to review now (and thanks for restarting the CI, if that was you who gave it a kick)

jaclyn-taroni · 2019-12-10T00:32:40Z

Happy to help! I've requested a review from @jashapiro.

jashapiro

Thank you for this contribution @e-t-k. The code looks great, and I really appreciate the good comments/documentation! I have a couple of minor questions which occurred to me as I looked it over.

The biggest general comment is mainly about where some of the file paths and options are hard-coded into the script which could make things a bit more brittle if anything changes down the line. If you are using util.py in other scripts, it can be nice to have some of the main paths centralized in that file, but it might also be a potential source of hidden errors, in that it might not be immediately clear when looking at later PRs where files are being written or read from.

My other comments are mostly just informational, so feel free to reply but whether your code changes at all is entirely up to you!

Thanks again for contributing, and I am sure we will get this merged quickly!

analyses/comparative-RNASeq-analysis/01-correlation-matrix.py

.circleci/config.yml

analyses/comparative-RNASeq-analysis/utils.py

analyses/comparative-RNASeq-analysis/01-correlation-matrix.py

jashapiro · 2019-12-10T16:08:44Z

analyses/comparative-RNASeq-analysis/01-correlation-matrix.py

+    # Rank transform - values are now that rank of that gene within each sample
+    print_v("Rank transforming")
+    filtered_rank_transformed_df = np.apply_along_axis(scipy.stats.rankdata,0,expr_var_filtered_df)
+
+    # Run pearson metric on ranked genes. For pairwise_distances we need samples=rows so transpose.
+    print_v("Running pairwise distances")
+    x_corr = 1 - sklp.pairwise_distances(X=filtered_rank_transformed_df.transpose(), metric="correlation")


I'm curious here if there is a reason not to use scipy.stats.spearmanr()? I would think it should have the same result, so it shouldn't really matter, but I'm curious if there might be some issue that this method addresses that led you to use it here?

Yeah, I also expect the result should be the same. This part of the code is copied pretty directly from our UCSC TumorMap library so that we can potentially generate TumorMaps from these data.

analyses/comparative-RNASeq-analysis/01-correlation-matrix.py

@jashapiro

apply changes suggested by @jashapiro comments as appropriate: - changed input filenames to arguments rather than hardcoded (resulting in two circleci run commands for now) - enable verbose in ci - stopped using hardcoded filepaths - input (data) and scratch dirs are now provided as command line options (results not used yet). (removed utils.get_filepath as a consequence) - moved print_v below docstring (oops) - add proportion_unexpressed, variance_filter_level as command line options - use os.path.join for all filepaths vs forward slashes

someone else already installed scikit-learn so i don't need to install it again

e-t-k · 2019-12-12T05:34:08Z

Thank you @jashapiro ! I believe I've addressed your comments in b208c8b.

Just some little formatting changes, with a few rearrangements of arguments for consistency.

jashapiro

Thanks for your updates @e-t-k! This looks great. I mad a few little format changes to standardize argument orders and clean up long lines, but nothing substantive.
This should go in today!

for comparative-rnaseq-analysis: - update command lines in CI file; add dependency to dockerfile; add new version of outlier results tsv.gz file with sample filters applied.

…s); convert scratch files to .feather (#544) * update README (TODO: proofread markdown) * implement pull 3: sample filters and general cleanup - add tsv and feather IO functions to utils - add sample filters to step 01: check MEND qc status, and use pbta-histologies to filter to only tumor samples - For dropped genes, note if dropped by Expression filter or Variance filter - all internal .rds files converted to .feather - all path handling in step 01 moved to main() * [#323] update CI file, dockerfile, results file for comparative-rnaseq-analysis: - update command lines in CI file; add dependency to dockerfile; add new version of outlier results tsv.gz file with sample filters applied. * [#229] tidy up README markdown. * [#229] analyses/README update comparative-rna-seq * [#229] check MEND QC files to confirm parsed ok Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and crash if they didn't. Also call out any samples that didn't have MEND QC files (these will be treated as fails and not included in the dataset). * [#229] maintain consistent column order in outlier results tsv Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix. Includes new results file which has data identical to previous, but columns in correct order.

e-t-k added 4 commits November 21, 2019 14:17

initial commit

dd70367

basic structure of analysis files for correlation matrix loads rds files into python using pyreadr (added dependency to dockerfile)

Bugfix + cleanup to conform to templates

5be907c

- Added to CI - Rename correlation_matrix to step 01 - Roll run.py into step 01 - we'll split it out at the end - Bugfix on write_rds - need to store the index manually - removed results gitignore - maybe we do want to store results?

all-by-all matrix moved to scratch

1294461

upon re-reading of output expectations this file does not "provide outputs intended for tables, figures, or supplementary tables or figures "

Merge branch 'master' into create-correlation-matrices

2c2b146

Conflicts: .circleci/config.yml

e-t-k marked this pull request as ready for review December 9, 2019 23:38

jaclyn-taroni requested a review from jashapiro December 10, 2019 00:27

jaclyn-taroni mentioned this pull request Dec 10, 2019

Add section about stale branches to contributing guidelines #324

Merged

jashapiro reviewed Dec 10, 2019

View reviewed changes

e-t-k added 3 commits December 11, 2019 16:34

Merge branch 'master' into create-correlation-matrices

378b263

comparative rna - update dockerfile deps

26644ba

someone else already installed scikit-learn so i don't need to install it again

Formatting updates

76dd476

Just some little formatting changes, with a few rearrangements of arguments for consistency.

jashapiro approved these changes Dec 12, 2019

View reviewed changes

jaclyn-taroni merged commit fd2a85b into AlexsLemonade:master Dec 12, 2019

e-t-k mentioned this pull request Feb 6, 2020

Generate outlier thresholds #513

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create correlation matrices for RSEM TPM polya and stranded datasets #323

Create correlation matrices for RSEM TPM polya and stranded datasets #323

e-t-k commented Dec 9, 2019

e-t-k commented Dec 9, 2019

jaclyn-taroni commented Dec 9, 2019

e-t-k commented Dec 9, 2019

jaclyn-taroni commented Dec 10, 2019 •

edited

Loading

jashapiro left a comment

jashapiro Dec 10, 2019

e-t-k Dec 11, 2019

e-t-k commented Dec 12, 2019

jashapiro left a comment

Create correlation matrices for RSEM TPM polya and stranded datasets #323

Create correlation matrices for RSEM TPM polya and stranded datasets #323

Conversation

e-t-k commented Dec 9, 2019

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

e-t-k commented Dec 9, 2019

jaclyn-taroni commented Dec 9, 2019

e-t-k commented Dec 9, 2019

jaclyn-taroni commented Dec 10, 2019 • edited Loading

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Dec 10, 2019

Choose a reason for hiding this comment

e-t-k Dec 11, 2019

Choose a reason for hiding this comment

e-t-k commented Dec 12, 2019

jashapiro left a comment

Choose a reason for hiding this comment

jaclyn-taroni commented Dec 10, 2019 •

edited

Loading