Description
related issue: #4178
I discovered great difference between log2fc calculated by Seurat FindMarkers function and the script I wrote myself. Usually, the log2fc is underestimated as mentioned in issue #4178.
I didn't find the source code of FindMarkers function, but I guess you use exp install of expm1, or add the pseudocount 1 when convert the log1p transformed data to depth normalized counts. scRNA-seq expression matrix is usually sparse and the pseudocount 1 may greatly change the original distribution.
See the case below, when i use exp to process the data slot, I get the exact log2fc calculated by FindMarkers. But if I use expm1 to process the data slot or just use the count slot, I get great larger log2fc (1.7 vs 2.8).
Also, I guess it is a better way to add the pseudocount: log2(sum(expm1(data)[gene,cells.1])+pseudocount) - log2(sum(expm1(data)[gene,cells.2])+pseudocount) +log2(cells.2 number) - log2(cells.1 number)
So, please check the log2fc calculated by FindMarkers. Also, I am not quite familiar with R, so I would be very grateful if you can show me the link to your FindMarkers source code.
Thank you very much, and thank you for your amazing work.
Yiming
> #load data
> setwd('/data/User/sunym/temp/')
> .libPaths('/data/User/sunym/software/R_lib/R_4.1.3/')
> library(Seurat)
Attaching SeuratObject
Attaching sp
> library(SeuratData)
── Installed datasets ────────────────────────────────────────────────────────────────────────────────────── SeuratData v0.2.2 ──
✔ ifnb 3.1.0 ✔ pbmc3k 3.1.4
✔ panc8 3.0.2 ✔ pbmcMultiome 0.1.2
────────────────────────────────────────────────────────────── Key ──────────────────────────────────────────────────────────────
✔ Dataset loaded successfully
❯ Dataset built with a newer version of Seurat than installed
❓ Unknown version of Seurat installed
> data('pbmc3k')
>
> #normalize
> pbmc3k <- NormalizeData(object = pbmc3k,assay = 'RNA',normalization.method = 'LogNormalize',scale.factor = 10000)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
>
> #subset
> temp_T <- pbmc3k[,pbmc3k$seurat_annotations == 'CD8 T']
> temp_NK <- pbmc3k[,pbmc3k$seurat_annotations == 'NK']
>
> #find marker using Seurat findmarkers
> FindMarkers(object = pbmc3k,ident.1 = 'CD8 T',ident.2 = 'NK',group.by = 'seurat_annotations',assay = 'RNA',slot = 'data',features = 'CD8A')
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
p_val avg_log2FC pct.1 pct.2 p_val_adj
CD8A 9.280891e-18 1.755037 0.506 0.084 1.272781e-13
>
> #the process I guess how seurat calculate log2fc
> log2(rowMeans(exp(temp_T@assays$RNA@data['CD8A',,drop = FALSE]))/rowMeans(exp(temp_NK@assays$RNA@data['CD8A',,drop = FALSE])))
CD8A
1.755037
>
> #calculate log2fc myself
> #use slot data
> log2(rowMeans(expm1(temp_T@assays$RNA@data['CD8A',,drop = FALSE]))/rowMeans(expm1(temp_NK@assays$RNA@data['CD8A',,drop = FALSE])))
CD8A
2.840186
> #use slot count
> log2(mean(temp_T@assays$RNA@counts['CD8A',]/temp_T$nCount_RNA)/mean(temp_NK@assays$RNA@counts['CD8A',]/temp_NK$nCount_RNA))
[1] 2.840186