Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generally high tumor proportion from TCGA data #28

Open
SBaek613 opened this issue Jun 16, 2022 · 1 comment
Open

Generally high tumor proportion from TCGA data #28

SBaek613 opened this issue Jun 16, 2022 · 1 comment

Comments

@SBaek613
Copy link

SBaek613 commented Jun 16, 2022

Hi, again.

I was able to solve issues with running BayesPrism thanks to your help.

Now I have been using both CIBERSORTx and BayesPrism to analyze various TCGA data with single-cell matrix of my own.

The most distinct result from those tools was how BayesPrism would end up with very high proportion of tumor cells (70-90%) while CIBERSORTx usually gave 20-30% using the same sample and single cell reference.

I have tried to re-scale non-tumor cells by removing tumor proportion and scaling each sample's proportion to 1. However, with the presence of other CD45- cells like Fibroblast and endothelial, I was unable to retrieve immune cell proportion with most of the immune cells having around 1^10-6 to 1^10-3. I could have removed all CD45- cell types but with such low proportion of CD45+ cell types, there were too much fluctuation between samples.

While actual tumor cell proportion might vary between samples and tumor types, I would think that tumor proportion is probably not as high as ~80% but probably not as low as ~25%. From your paper I observed similar pattern of having high proportion of tumor cells. I am curious about your interpretation of different deconvolution tools having such wide range of tumor cell proportion results.

  • I am using fairly detailed cell type annotations for immune cells. Maybe that's why it was difficult to compare proportion of them between tumor types (with many outliers and fluctuations)? I would appreciate any comments or general feedbacks. Thanks!
@tinyi
Copy link
Collaborator

tinyi commented Jun 18, 2022

Thank you for your feedback. A few potential reasons are as follows.

First, the fraction inferred by BayesPrism represents the fraction of reads (rather than the cell count) of each cell
type in each bulk. As a result, cells with low total transcription level will have lower fraction of reads. Tumor may have higher amount of total transcription than other cells, such as T cell. This may contribute to the seemingly over-estimated tumor fractions. On the other hand, CIBERSORTx uses a signature matrix, and then performs deconvolution over the signature genes, and hence the fraction inferred by CIBERSORTx is over the signature genes selected, which may also cause the difference between these two methods. You may also try running BayesPrism over the signature genes selected by CIBERSORTx and then compare the results. That being said, when compare BayesPrism and CIBERSORTx with the tumor purity estimated by other methods, including IHC, ABSOLUTE and ESTIMATE, we did not seems to detect systematic overestimation for the cancer types tested by our hands (see Supplementary Fig. 2 of the paper).

The second potential cause for this is that when non-tumor cells in the reference are too few, non-tumor cells will have a sparser representation than tumor e , so that the reads in bulk will be assigned to tumor for those genes with zero expression in non-tumor cells. We also observe similar effects in T cells of GBM (see Supplementary Fig. 1e of our BayesPrism paper). Under such circumstance, although the absolute fraction will be underestimated for some cell types with too few cells, the relative fractions are still accurate. We recommend user represent each cell type with sufficient number of cells, say > 20 or even >50.

The third reason might be related to the high granularity of cell type definition in your reference. In one spatial transcriptomics dataset we tested, when the reference cell types are too similar/co-linear, the quality, e.g. number of cells representing the cell type, might have higher impact in the reference, causing some cell types to be close to zero (due to the weak/sparse prior). In fact high co-linearity will also cause the linear regression to be unstable (higher standard error in regression coefficients). If that is the case, users may merge the cell types to a granularity of higher confidence, or simply treat them as cell states, which will be summed up by BayesPrism.

Hope that I have clarified this. Let me know if there is any other questions.

Best,

Tinyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants