Skip to content

Commit fa143e0

Browse files
authored
Merge pull request #9 from CahanLab/pc_revamp
Pc revamp
2 parents 1aded0c + b5a158b commit fa143e0

25 files changed

+3095
-447
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ inProgressFunctions/
33
build/
44
dist/
55
pySingleCellNet.egg-info/
6+
__pycache__/

.readthedocs.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
version: 2
2+
3+
sphinx:
4+
configuration: docs/source/conf.py
5+
fail_on_warning: false
6+
7+
python:
8+
version: 3.8
9+
install:
10+
- requirements: docs/requirements.txt
11+
12+
submodules:
13+
include: all
14+
15+
build:
16+
image: latest
17+

README.md

Lines changed: 5 additions & 237 deletions
Original file line numberDiff line numberDiff line change
@@ -1,241 +1,9 @@
1-
21
[![Documentation Status](https://readthedocs.org/projects/pysinglecellnet/badge/?version=latest)](https://pysinglecellnet.readthedocs.io/en/latest/?badge=latest)
32

4-
# pySingleCellNet
5-
6-
### Introduction
7-
SingleCellNet (SCN) is a tool to perform 'cell typing', or classification of single cell RNA-Seq data. Two nice features of SCN are that it works (1) across species and (2) across platforms. See [the original paper](https://doi.org/10.1016/j.cels.2019.06.004) for more details. This repository contains the Python version of SCN. The [original code](https://github.com/pcahan1/SingleCellNet/) was written in R.
8-
9-
### Prerequisites
10-
11-
```python
12-
pip install pandas numpy sklearn scanpy sklearn statsmodels scipy matplotlib seaborn umap-learn
13-
```
14-
15-
### Installation
16-
17-
```python
18-
!pip install git+https://github.com/pcahan1/PySingleCellNet/
19-
```
20-
21-
#### Summary
22-
23-
Below is a brief tutorial that shows you how to use SCN. In this example, we train a classifier based on mouse lung cells, we assess the performance of the classifier on held out data, then we apply the classifier to analyze indepdendent mouse lung data.
24-
25-
#### Training data
26-
SCN has to be trainined on well-annotated reference data. In this example, we use data generated as part of the Tabula Muris (Senis) project. Specifically, we use the droplet lung data. We have compiled several other training data sets as listed below.
27-
28-
[Lung training data](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adLung_TabSen_100920.h5ad)
29-
30-
#### Query data
31-
To illustrate how you might use SCN to perform cell tying, we apply it to another dataset from mouse lung:
32-
33-
>Angelidis I, Simon LM, Fernandez IE, Strunz M et al. An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics. Nat Commun 2019 Feb 27;10(1):963. PMID: 30814501
34-
35-
36-
[Query expression data](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/query/GSE124872_raw_counts_single_cell.mtx.gz) <- You will need to decompress this file prior to loading it.
37-
38-
[Query meta-data](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/query/GSE124872_Angelidis_2018_metadata.csv)
39-
40-
[Query gene list](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/query/genes.csv)
41-
42-
##### Initialize session
43-
44-
```python
45-
import pandas as pd
46-
import matplotlib
47-
import matplotlib.pyplot as plt
48-
import seaborn as sns
49-
import scanpy as sc
50-
import scipy as sp
51-
import numpy as np
52-
# Loompy is only needed if using loom files
53-
# import loompy
54-
import anndata
55-
56-
sc.settings.verbosity = 3
57-
sc.logging.print_header()
58-
59-
import pySingleCellNet as pySCN
60-
```
61-
62-
##### Load the training data.
63-
You should always start with the raw counts.
64-
```python
65-
adTrain = sc.read("adLung_TabSen_100920.h5ad")
66-
adTrain
67-
# AnnData object with n_obs × n_vars = 14813 × 21969 ...
68-
69-
```
70-
71-
##### Load the query data.
72-
You should always start with the raw counts. If your expression is stored as a numpy array, you can convert it with the check_adX(adata) function.
73-
```python
74-
qDatT = sc.read_mtx("GSE124872_raw_counts_single_cell.mtx")
75-
qDat = qDatT.T
76-
77-
genes = pd.read_csv("genes.csv")
78-
qDat.var_names = genes.x
79-
80-
qMeta = pd.read_csv("GSE124872_Angelidis_2018_metadata.csv")
81-
qMeta.columns.values[0] = "cellid"
82-
83-
qMeta.index = qMeta["cellid"]
84-
qDat.obs = qMeta.copy()
85-
86-
# If your expression data is stored as a numpy array, convert it
87-
# type(qDat.X)
88-
# <class 'numpy.ndarray'>
89-
# pySCN.check_adX(qDat)
90-
```
91-
92-
##### Find common genes
93-
When you train the classifier, you should ensure that the query data and the reference data are limited to a common set of genes. In this case, we also limit the query data to those cells with at least 500 genes.
94-
```python
95-
genesTrain = adTrain.var_names
96-
genesQuery = qDat.var_names
97-
98-
cgenes = genesTrain.intersection(genesQuery)
99-
len(cgenes)
100-
# 16543
101-
102-
adTrain1 = adTrain[:,cgenes]
103-
adQuery = qDat[:,cgenes].copy()
104-
adQuery = adQuery[adQuery.obs["nGene"]>=500,:].copy()
105-
adQuery
106-
# AnnData object with n_obs × n_vars = 4240 × 16543
107-
```
108-
109-
##### Split the reference data into training data and data held out for later assessment.
110-
Ideally, we would assess performance on an indepdendent data. The dLevel parameter indicates the label used to group cells into categories or classes. Set this argument as appropriate for your training data.
111-
```python
112-
expTrain, expVal = pySCN.splitCommonAnnData(adTrain1, ncells=200,dLevel="cell_ontology_class")
113-
```
114-
115-
##### Train the classifier.
116-
This can take several minutes.
117-
```python
118-
[cgenesA, xpairs, tspRF] = pySCN.scn_train(expTrain, nTopGenes = 100, nRand = 100, nTrees = 1000 ,nTopGenePairs = 100, dLevel = "cell_ontology_class", stratify=True, limitToHVG=True)
119-
```
120-
121-
##### Classify the held-out data and visualize.
122-
Rows indicate class labels as defined in the dLevel argument. Columns represent cells, which are grouped by the class with the maximum score.
123-
``` python
124-
adVal = pySCN.scn_classify(expVal, cgenesA, xpairs, tspRF, nrand = 0)
125-
126-
ax = sc.pl.heatmap(adVal, adVal.var_names.values, groupby='SCN_class', cmap='viridis', dendrogram=False, swap_axes=True)
127-
```
128-
![png](md_img/HM_Val_Lung_100920.png)
129-
130-
131-
##### Determine how well the classifier predicts cell type of held out data and plot
132-
The assessment object holds other evaluation metrics including multiLogLoss, Kappa, and accuracy.
133-
```python
134-
assessment = pySCN.assess_comm(expTrain, adVal, resolution = 0.005, nRand = 0, dLevelSID = "cell", classTrain = "cell_ontology_class", classQuery = "cell_ontology_class")
135-
136-
pySCN.plot_PRs(assessment)
137-
plt.show()
138-
```
139-
140-
![png](md_img/PR_curves_Lung_100920.png)
141-
142-
143-
##### Classify the independent query data and visualize the results.
144-
The heatmap groups the cells according to the cell type with the maximum SCN classification score. Cells in the 'rand' SCN_class or category have a higher SCN score in the 'random' SCN_class than any cell type from the training data.
145-
```python
146-
adQlung = pySCN.scn_classify(adQuery, cgenesA, xpairs, tspRF, nrand = 0)
147-
148-
ax = sc.pl.heatmap(adQlung, adQlung.var_names.values, groupby='SCN_class', cmap='viridis', dendrogram=False, swap_axes=True)
149-
```
150-
151-
![png](md_img/HM_Val_Other_softmax_100920.png)
152-
153-
##### Visualize again.
154-
Now group cells according to the annotation provided by the associated study.
155-
```python
156-
ax = sc.pl.heatmap(adQlung, adQlung.var_names.values, groupby='celltype', cmap='viridis', dendrogram=False, swap_axes=True)
157-
```
158-
159-
![png](md_img/HM_Val_Other_100920.png)
160-
161-
##### Add the classification result to the query annData object
162-
We add the SCN scores as well as the softmax classification (i.e. a label corresponding to the cell type with the maximum SCN score -- this goes in adata.obs["SCN_class"]).
163-
```python
164-
pySCN.add_classRes(adQuery, adQlung)
165-
```
166-
167-
##### Now, you can run your typical Scanpy pipeline to find cell clusters.
168-
###### Normalization and PCA
169-
```python
170-
adM1Norm = adQuery.copy()
171-
sc.pp.filter_genes(adM1Norm, min_cells=5)
172-
sc.pp.normalize_per_cell(adM1Norm, counts_per_cell_after=1e4)
173-
sc.pp.log1p(adM1Norm)
174-
175-
sc.pp.highly_variable_genes(adM1Norm, min_mean=0.0125, max_mean=4, min_disp=0.5)
176-
177-
adM1Norm.raw = adM1Norm
178-
sc.pp.scale(adM1Norm, max_value=10)
179-
sc.tl.pca(adM1Norm, n_comps=100)
180-
181-
sc.pl.pca_variance_ratio(adM1Norm, 100)
182-
```
183-
184-
![png](md_img/pca_lung_101120.png)
185-
186-
187-
###### KNN, Leiden, UMAP
188-
```python
189-
npcs = 20
190-
sc.pp.neighbors(adM1Norm, n_neighbors=10, n_pcs=npcs)
191-
sc.tl.leiden(adM1Norm,.1)
192-
sc.tl.umap(adM1Norm, .5)
193-
sc.pl.umap(adM1Norm, color=["leiden", "SCN_class"], alpha=.9, s=15, legend_loc='on data')
194-
```
195-
196-
![png](md_img/UMAP_Lung_Other_101120.png)
197-
198-
##### To plot a heatmap with the clustering information, you need to add this annotation to the annData object that is returned from scn_classify()
199-
```python
200-
adQlung.obs['leiden'] = adM1Norm.obs['leiden'].copy()
201-
adQlung.uns['leiden_colors'] = adM1Norm.uns['leiden_colors']
202-
ax = sc.pl.heatmap(adQlung, adQlung.var_names.values, groupby='leiden', cmap='viridis', dendrogram=False, swap_axes=True)
203-
```
204-
205-
![png](md_img/HM_Lung_Leiden_101120.png)
206-
207-
##### You can also overlay SCN score on UMAP embedding.
208-
```python
209-
sc.pl.umap(adM1Norm, color=["epithelial cell", "stromal cell", "B cell"], alpha=.9, s=15, legend_loc='on data', wspace=.3)
210-
```
211-
212-
![png](md_img/UMAP_Lung_SCN_101120.png)
213-
214-
##### Cross-species classification
215-
Cross-species classification depends on an ortholog table. [Mouse-to-human ortholog table](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/oTab.csv)
216-
217-
To use this, you need to convert query gene symbols to the ortholog names of the species of the training data
218-
```python
219-
oTab = pd.read_csv("oTab.csv")
220-
[adQuery,adTrain] = pySCN.csRenameOrth(adQuery, adTrain, oTab)
221-
````
222-
223-
Then you can proceed with the same training and analysis steps as above, starting with the call to splitCommonAnnData.
224-
225-
226-
### Training data (currently only from Tabula senis)
227-
228-
1. [Bladder](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adBladder_TabSen_101320.h5ad)
229-
2. [Fat](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adFat_TabSen_101320.h5ad)
230-
3. [Heart](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adHeart_TabSen_101320.h5ad)
231-
4. [Kidney](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adKidney_TabSen_101320.h5ad)
232-
5. [Large Intestine](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adL_Intestine_TabSen_101320.h5ad)
233-
6. [Lung](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adLung_TabSen_100920.h5ad)
234-
7. [Mammary Gland](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adMammary_Gland_TabSen_101320.h5ad)
235-
8. [Marrow](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adMarrow_TabSen_101320.h5ad)
236-
9. [Pancreas](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adPancreas_TabSen_101320.h5ad)
237-
10. [Skeletal Muscle](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adSkel_Muscle_TabSen_101320.h5ad)
238-
11. [Skin](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adSkin_TabSen_101320.h5ad)
239-
12. [Trachea](https://cnobjects.s3.amazonaws.com/singleCellNet/pySCN/training/adTrachea_TabSen_101320.h5ad)
3+
# PySingleCellNet: classify scRNAseq data in Python
4+
PySingleCellNet (PySCN) predicts the 'cell type' of query scRNA-seq data by Random forest multi-class classification. See [Tan & Cahan 2019] for more details. PySCN includes functionality to aid in the analysis of engineered cell populations (i.e. cells derived via directed differentiation of pluripotent stem cells or via direct conversion).
2405

6+
[Tan & Cahan 2019]: https://doi.org/10.1016/j.cels.2019.06.004
7+
[github]: https://github.com/pcahan1/PySingleCellNet
8+
[original version]: https://github.com/pcahan1/PySingleCellNet
2419

docs/Makefile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
SPHINXOPTS ?=
7+
SPHINXBUILD ?= sphinx-build
8+
SOURCEDIR = source
9+
BUILDDIR = build
10+
11+
# Put it first so that "make" without argument is like "make help".
12+
help:
13+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14+
15+
.PHONY: help Makefile
16+
17+
# Catch-all target: route all unknown targets to Sphinx using the new
18+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19+
%: Makefile
20+
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

docs/make.bat

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
@ECHO OFF
2+
3+
pushd %~dp0
4+
5+
REM Command file for Sphinx documentation
6+
7+
if "%SPHINXBUILD%" == "" (
8+
set SPHINXBUILD=sphinx-build
9+
)
10+
set SOURCEDIR=source
11+
set BUILDDIR=build
12+
13+
%SPHINXBUILD% >NUL 2>NUL
14+
if errorlevel 9009 (
15+
echo.
16+
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
17+
echo.installed, then set the SPHINXBUILD environment variable to point
18+
echo.to the full path of the 'sphinx-build' executable. Alternatively you
19+
echo.may add the Sphinx directory to PATH.
20+
echo.
21+
echo.If you don't have Sphinx installed, grab it from
22+
echo.https://www.sphinx-doc.org/
23+
exit /b 1
24+
)
25+
26+
if "%1" == "" goto help
27+
28+
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
29+
goto end
30+
31+
:help
32+
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
33+
34+
:end
35+
popd

docs/requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
myst-nb
2+
sphinx-copybutton
3+
sphinx-book-theme

0 commit comments

Comments
 (0)