Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Create correlation matrices for RSEM TPM polya and stranded datasets #323

Merged

Conversation

e-t-k
Copy link
Contributor

@e-t-k e-t-k commented Dec 9, 2019

I'm pulling this as a draft request because I'm adding two files, 01-correlation-matrix.py and autils.py module with shared functions. Does it make sense to merge the utils as standalone, even though it's a helper that doesn't run anything by itself? Or is it reasonable to review both these files together? Thank you!

Purpose/implementation Section

What scientific question is your analysis addressing?

What is the correlation between samples using RSEM TPM gene expression? We are interested in correlation within the polyA and ribodeplete (stranded) data sets; we don't calculate correlation between the datasets as they are not comparable.

What was your approach?

  • Normalize the RSEM TPM gene expression values by log2(TPM + 1). (This is not strictly necessary for the correlation step but will be used in later steps of the analysis.)

Then, apply the tumormap-creation process used in Vaske et al. (2019) "Comparative Tumor RNA Sequencing Analysis...", which comprises the following steps:

  • Apply filters to the expression matrix to reduce noise that obscures cohort-wide patterns.

    • Drop all genes that are expressed in fewer than 20% of samples
    • Sort the genes by variance and filter out the 20% least varying genes.
  • Calculate the Spearman correlations between each pair of samples.

What GitHub issue does your pull request address?

#229 (This is the first of 4+ pulls planned for this issue)

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • Any pull request instructions I've missed or appropriate organization of code/output files.
  • RDS file input/output (in utils.py) and the log2 conversion step

I was able to test the filtering and spearman correlations steps in prepare_expression() and calculate_correlation() on Treehouse data sets and results were identical to the output generated by our internal pipeline.

Is there anything that you want to discuss further?

  • Is RDS an appropriate format for output/scratch files? do you have a standardized format?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Not yet - no figures/tables generated

Results

What types of results are included (e.g., table, figure)?

No result files yet; not sure if all-by-all file is worthwhile to include as a supplementary table.
Scratch files include:

All-by-all Spearman correlation matrix for each of the polya and stranded RSEM TPM datasets:

rsem-tpm-polya-all_by_all_correlations.rds,
rsem-tpm-stranded-all_by_all_correlations.rds

List of genes surviving the filters for each dataset:

rsem-tpm-polya-filtered_genes_to_keep.rds
rsem-tpm-stranded-filtered_genes_to_keep.rds

What is your summary of the results?

No analysis performed on these matrices yet - they will also be used as inputs for next step of this analysis

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

basic structure of analysis files for correlation matrix
loads rds files into python using pyreadr (added dependency to dockerfile)
Added bulk of code for this pull.

Seems to work, but not yet validated on known input files.

details:

- added scikit-learn dependency

- correlation_matrix code implemented. Split into two functions, prepare_expression and calculate_correlation

- run on both tpm gene datasets, pbta-gene-expression-rsem-tpm.polya.rds and pbta-gene-expression-rsem-tpm.stranded.rds

- utils: read_rds, write_rds now interact with files specifically in ../../data. ../../scratch, and results *only*.
- Added to CI
- Rename correlation_matrix to step 01
- Roll run.py into step 01 - we'll split it out at the end
- Bugfix on write_rds - need to store the index manually
- removed results gitignore - maybe we do want to store results?
upon re-reading of output expectations this file does not "provide outputs intended for tables, figures, or supplementary tables or figures "
@e-t-k
Copy link
Contributor Author

e-t-k commented Dec 9, 2019

Shall I go ahead and merge your current master into my branch, or is that something one of you prefers to do? (The contributing page suggests I keep my fork synced with master only after the first pull is accepted, which is why I haven't done it already.)

In any case, the current CI fail appears to be because my download-ci-files.sh script is out of date.

@jaclyn-taroni
Copy link
Member

Hi @e-t-k, thanks for filing this! Please feel free to pull master into this branch. I also took a quick look at the files changed here and it seems appropriate to add both files in a single pull request, so you are all set to mark as ready for review when you’re ready.

@e-t-k e-t-k marked this pull request as ready for review December 9, 2019 23:38
@e-t-k
Copy link
Contributor Author

e-t-k commented Dec 9, 2019

@jaclyn-taroni thanks for your guidance! It's ready to review now (and thanks for restarting the CI, if that was you who gave it a kick)

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Dec 10, 2019

Happy to help! I've requested a review from @jashapiro.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @e-t-k. The code looks great, and I really appreciate the good comments/documentation! I have a couple of minor questions which occurred to me as I looked it over.

The biggest general comment is mainly about where some of the file paths and options are hard-coded into the script which could make things a bit more brittle if anything changes down the line. If you are using util.py in other scripts, it can be nice to have some of the main paths centralized in that file, but it might also be a potential source of hidden errors, in that it might not be immediately clear when looking at later PRs where files are being written or read from.

My other comments are mostly just informational, so feel free to reply but whether your code changes at all is entirely up to you!

Thanks again for contributing, and I am sure we will get this merged quickly!

Comment on lines 71 to 77
# Rank transform - values are now that rank of that gene within each sample
print_v("Rank transforming")
filtered_rank_transformed_df = np.apply_along_axis(scipy.stats.rankdata,0,expr_var_filtered_df)

# Run pearson metric on ranked genes. For pairwise_distances we need samples=rows so transpose.
print_v("Running pairwise distances")
x_corr = 1 - sklp.pairwise_distances(X=filtered_rank_transformed_df.transpose(), metric="correlation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious here if there is a reason not to use scipy.stats.spearmanr()? I would think it should have the same result, so it shouldn't really matter, but I'm curious if there might be some issue that this method addresses that led you to use it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also expect the result should be the same. This part of the code is copied pretty directly from our UCSC TumorMap library so that we can potentially generate TumorMaps from these data.

apply changes suggested by @jashapiro comments as appropriate:

- changed input filenames to arguments rather than hardcoded
(resulting in two circleci run commands for now)

- enable verbose in ci

- stopped using hardcoded filepaths - input (data) and scratch dirs are now provided as command line options (results not used yet). (removed utils.get_filepath as a consequence)

- moved print_v below docstring (oops)

- add proportion_unexpressed, variance_filter_level as command line options

- use os.path.join for all filepaths vs forward slashes
someone else already installed scikit-learn so i don't need to install it again
@e-t-k
Copy link
Contributor Author

e-t-k commented Dec 12, 2019

Thank you @jashapiro ! I believe I've addressed your comments in b208c8b.

Just some little formatting changes, with a few rearrangements of arguments for consistency.
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your updates @e-t-k! This looks great. I mad a few little format changes to standardize argument orders and clean up long lines, but nothing substantive.
This should go in today!

@jaclyn-taroni jaclyn-taroni merged commit fd2a85b into AlexsLemonade:master Dec 12, 2019
@e-t-k e-t-k mentioned this pull request Feb 6, 2020
5 tasks
e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this pull request Feb 19, 2020
for comparative-rnaseq-analysis:
- update command lines in CI file; add dependency to dockerfile;
add new version of outlier results tsv.gz file with sample filters applied.
jaclyn-taroni pushed a commit that referenced this pull request Feb 20, 2020
…s); convert scratch files to .feather (#544)

* update README (TODO: proofread markdown)

* implement pull 3: sample filters and general cleanup

- add tsv and feather IO functions to utils

- add sample filters to step 01: check MEND qc status, and use pbta-histologies to filter to only tumor samples
- For dropped genes, note if dropped by Expression filter or Variance filter
- all internal .rds files converted to .feather
- all path handling in step 01 moved to main()

* [#323] update CI file, dockerfile, results file

for comparative-rnaseq-analysis:
- update command lines in CI file; add dependency to dockerfile;
add new version of outlier results tsv.gz file with sample filters applied.

* [#229] tidy up README markdown.

* [#229] analyses/README update comparative-rna-seq

* [#229] check MEND QC files to confirm parsed ok

Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and
crash if they didn't.
Also call out any samples that didn't have MEND QC files (these will be treated as
fails and not included in the dataset).

* [#229] maintain consistent column order in outlier results tsv

Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix.

Includes new results file which has data identical to previous, but columns in correct order.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants