Accurate and fast cell marker gene identification with COSG
COSG is a cosine similarity-based method for more accurate and scalable marker gene identification.
- COSG is a general method for cell marker gene identification across different data modalities, e.g., scRNA-seq, scATAC-seq and spatially resolved transcriptome data.
- Marker genes or genomic regions identified by COSG are more indicative and with greater cell-type specificity.
- COSG is ultrafast for large-scale datasets, and is capable of identifying marker genes for one million cells in less than two minutes.
The method and benchmarking results are described in Dai et al., (2022). The preprint is available in bioRxiv.
Here is the R version for COSG, and the python version is hosted in https://github.com/genecell/COSG.
# install.packages('remotes')
remotes::install_github(repo = 'genecell/COSGR')
Please check out the vignette and the PBMC10K tutorial to get started.
suppressMessages(library(Seurat))
data('pbmc_small',package='Seurat')
# Check cell groups:
table(Idents(pbmc_small))
#>
#> 0 1 2
#> 36 25 19
#######
# Run COSG:
marker_cosg <- cosg(
pbmc_small,
groups='all',
assay='RNA',
slot='data',
mu=1,
n_genes_user=100)
#######
# Check the marker genes:
head(marker_cosg$names)
#> 0 1 2
#> 1 CD7 S100A8 MS4A1
#> 2 CCL5 TYMP CD79A
#> 3 GNLY S100A9 TCL1A
#> 4 LAMP1 FCGRT NT5C
#> 5 GZMA IFITM3 CD79B
#> 6 LCK LST1 FCER2
head(marker_cosg$scores)
#> 0 1 2
#> 1 0.6391917 0.8954042 0.6922908
#> 2 0.6391267 0.8312083 0.5832425
#> 3 0.6328148 0.8120045 0.5757478
#> 4 0.6164937 0.7755955 0.5533107
#> 5 0.5846589 0.7413060 0.5163446
#> 6 0.5795238 0.7380483 0.5115180
####### Run COSG for selected groups, i.e., '0' and 2':
#######
marker_cosg <- cosg(
pbmc_small,
groups=c('0', '2'),
assay='RNA',
slot='data',
mu=1,
n_genes_user=100)
- If you would like to identify more specific marker genes, you could assign
mu
to larger values, such asmu=10
ormu=100
. - You could set the parameter
remove_lowly_expressed
toTRUE
to not consider genes expressed very lowly in the target cell group, and you can use the parameterexpressed_pct
to adjust the threshold for the percentage. For example:
marker_region<-cosg(
seo,
groups='all',
assay='peaks',
slot='data',
mu=100,
n_genes_user=100,
remove_lowly_expressed=TRUE,
expressed_pct=0.1
)
If COSG is useful for your research, please consider citing Dai, M., Pei, X., Wang, X.-J., 2022. Accurate and fast cell marker gene identification with COSG. Brief. Bioinform. bbab579.