Name		Name	Last commit message	Last commit date
parent directory ..
results		results
00-repeated-samples.Rmd		00-repeated-samples.Rmd
00-repeated-samples.nb.html		00-repeated-samples.nb.html
01-generate-independent-specimens.R		01-generate-independent-specimens.R
02-generate-independent-rnaseq.R		02-generate-independent-rnaseq.R
README.md		README.md
independent-samples.R		independent-samples.R
independent_rna_samples.R		independent_rna_samples.R
run-independent-samples.sh		run-independent-samples.sh

README.md

Independent Samples

Summary

Many analyses that involve mutation frequencies or co-occurence require that all samples be independent. However, the PBTA data set includes many cases where multiple speciments were taken from a single individual. This analysis creates lists of samples such that there are no cases where more than one specimen is included from each individual.

As different analyses may require different sets of data, we actually generate a few different sets, stored in the results subdirectory:

Primary specimens only with whole genome sequence (WGS):
independent-specimens.wgs.primary.tsv
Primary and secondary specimens with WGS:
independent-specimens.wgs.primary-plus.tsv
Primary specimens only with either WGS or whole exome sequence (WXS):
independent-specimens.wgswxs.primary.tsv
Primary and secondary specimens with WGS or WXS:
independent-specimens.wgswxs.primary-plus.tsv
Primary and secondary specimens matching WGS and WXS independent sample_ids plus only-RNA-Seq for polya samples independent-specimens.rnaseq.primary-plus-polya.tsv
Primary and secondary specimens matching WGS and WXS independent sample_ids plus only-RNA-Seq for stranded samples independent-specimens.rnaseq.primary-plus-stranded.tsv

Generating sample lists

To generate the independent sample lists and associated analysis of redundancies in the overall data set, run the following script from the project root directory:

Use OPENPBTA_BASE_RELEASE=1 to run this module using the pbta-histologies-base.tsv from data folder while preparing analysis files for release:

OPENPBTA_BASE_RELEASE=1 ../analyses/independent-samples/run-independent-samples.sh

OR by default uses pbta-histologies.tsv from data folder

bash analyses/independent-samples/run-independent-samples.sh

Methods

When presented with more than one specimen from a given individual, the script randomly selects one specimen to include, with preference for primary tumors and whole genome sequences where available. There is also a preference for the earliest collected samples, but as this data is not currently available, that code is not currently relevant.

When multiple RNA-Seq samples exist per participant, the script matches the independent whole genome or whole exome sample_ids to gather matched RNA-Seq sample. If participant has onle RNA-Seq sample then a primary (and secondary if applicable) sample is randomly selected per participant

Relevant links

The methods are described in the manuscript here: https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#selection-of-independent-samples

Output data files are also described in the main README here: https://github.com/AlexsLemonade/OpenPBTA-analysis#data-formats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

independent-samples

independent-samples

README.md

Independent Samples

Summary

Generating sample lists

Methods

Relevant links

Files

independent-samples

Directory actions

More options

Directory actions

More options

Latest commit

History

independent-samples

Folders and files

parent directory

README.md

Independent Samples

Summary

Generating sample lists

Methods

Relevant links