Skip to content

Seurat FindMarkers calculates wrong log2 fold change #6202

Closed
@yimingsun12138

Description

@yimingsun12138

related issue: #4178

I discovered great difference between log2fc calculated by Seurat FindMarkers function and the script I wrote myself. Usually, the log2fc is underestimated as mentioned in issue #4178.

I didn't find the source code of FindMarkers function, but I guess you use exp install of expm1, or add the pseudocount 1 when convert the log1p transformed data to depth normalized counts. scRNA-seq expression matrix is usually sparse and the pseudocount 1 may greatly change the original distribution.

See the case below, when i use exp to process the data slot, I get the exact log2fc calculated by FindMarkers. But if I use expm1 to process the data slot or just use the count slot, I get great larger log2fc (1.7 vs 2.8).

Also, I guess it is a better way to add the pseudocount: log2(sum(expm1(data)[gene,cells.1])+pseudocount) - log2(sum(expm1(data)[gene,cells.2])+pseudocount) +log2(cells.2 number) - log2(cells.1 number)

So, please check the log2fc calculated by FindMarkers. Also, I am not quite familiar with R, so I would be very grateful if you can show me the link to your FindMarkers source code.

Thank you very much, and thank you for your amazing work.
Yiming

> #load data
> setwd('/data/User/sunym/temp/')
> .libPaths('/data/User/sunym/software/R_lib/R_4.1.3/')
> library(Seurat)
Attaching SeuratObject
Attaching sp
> library(SeuratData)
── Installed datasets ────────────────────────────────────────────────────────────────────────────────────── SeuratData v0.2.2 ──
✔ ifnb         3.1.0pbmc3k       3.1.4panc8        3.0.2pbmcMultiome 0.1.2

────────────────────────────────────────────────────────────── Key ──────────────────────────────────────────────────────────────
✔ Dataset loaded successfullyDataset built with a newer version of Seurat than installedUnknown version of Seurat installed

> data('pbmc3k')
> 
> #normalize
> pbmc3k <- NormalizeData(object = pbmc3k,assay = 'RNA',normalization.method = 'LogNormalize',scale.factor = 10000)
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> 
> #subset
> temp_T <- pbmc3k[,pbmc3k$seurat_annotations == 'CD8 T']
> temp_NK <- pbmc3k[,pbmc3k$seurat_annotations == 'NK']
> 
> #find marker using Seurat findmarkers
> FindMarkers(object = pbmc3k,ident.1 = 'CD8 T',ident.2 = 'NK',group.by = 'seurat_annotations',assay = 'RNA',slot = 'data',features = 'CD8A')
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s  
            p_val avg_log2FC pct.1 pct.2    p_val_adj
CD8A 9.280891e-18   1.755037 0.506 0.084 1.272781e-13
> 
> #the process I guess how seurat calculate log2fc
> log2(rowMeans(exp(temp_T@assays$RNA@data['CD8A',,drop = FALSE]))/rowMeans(exp(temp_NK@assays$RNA@data['CD8A',,drop = FALSE])))
    CD8A 
1.755037 
> 
> #calculate log2fc myself
> #use slot data
> log2(rowMeans(expm1(temp_T@assays$RNA@data['CD8A',,drop = FALSE]))/rowMeans(expm1(temp_NK@assays$RNA@data['CD8A',,drop = FALSE])))
    CD8A 
2.840186 
> #use slot count
> log2(mean(temp_T@assays$RNA@counts['CD8A',]/temp_T$nCount_RNA)/mean(temp_NK@assays$RNA@counts['CD8A',]/temp_NK$nCount_RNA))
[1] 2.840186

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions