Analyses including multiple samples from the same individual #155
Description
Question/issue
Many of the analyses that have been proposed are sensitive to the fact that we have multiples tumor samples from different time points or tumors from the same individual. For example, biospecimens BS_K07KNTFY
and BS_AQMKA8NC
are both tumor WGS data (initial and recurrence, respectively) from participant PT_00G007DM
. While this is extremely useful data, it presents questions for many particular analyses, which I would like to discuss in this issue.
In particular, analyses of mutation prevalence, variant allele frequency distributions, classification accuracy, etc. are likely to be affected by these non-independent samples. In some cases, a simple awareness of the issue will be sufficient, and analyses can be written to account for or take advantage of the redundancy in the data. However for many analyses, decisions of which samples to include or exclude will need to be made, and it would be good to have an agreed upon set of standards and procedures.
For a specific example, in the analysis of mutation co-occurrence (#13), including all samples would result in many spurious reports of co-occurrence, as it is quite common for two samples from the same individual to have the same sets of mutations. Similarly, analyses of recurrent fusions (#10), distribution of tumor mutation burden (#3), etc. will likely be affected.
One potential solution is to use only primary tumors and/or the earliest sampled tumor from each individual in analyses such as this. However, this would miss some potential co-occurrence patterns that may be important in progression and recurrence, which might suggest that the latest tumor from each individual would be better. Doing both is of course an option as well, but I am curious to hear what others think is most appropriate. Ultimately, we may want to add a recommendation to the documentation for future analyses.
Activity