-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Dear all,
Thank you for this great tool. I am very interested in exploring the combination of milo and miloDE in a variety of datasets.
However, analyzing datasets of 200k cells and above, I just haven´t been able to make miloDE fast enough, or at times prevent it from crashing.
Therefore, I have been attempting to bring at least some functionality to python, using the pertpy-implementation of milo.
I am not computationally trained by background learning everything on the go, so I have questions regarding algorithms and implementation and would appreciate any help you can give me.
My first questions regards assign_neighbourhoods:
To my understanding, the function builds the milo-graph at k_init for sampling vertices, rebuilds the graph at k, and then refines the vertices with the vertices previously sampled at k_init. Are these steps not the same, but similar, to building a graph with miloR and running makeNhoods with graph-refinement?
I had significant issues calling RcppGreedySetCover with rpy2, so I use the greedy set cover approach from https://www.researchgate.net/publication/362015207_ML2R_Coding_Nuggets_Greedy_Set_Cover_with_Native_Python_Data_Types
import pertpy as pt
def assign_neighbourhoods(
adata,
k = 25,
reducedDim_name = "X_pca",
prop = 0.1,
):
milo = pt.tl.Milo()
mdata = milo.load(adata)
# from sklearn_ann.kneighbors.annoy import AnnoyTransformer
# transformer = AnnoyTransformer(n_neighbors=k)
sc.pp.neighbors(mdata["rna"], use_rep=reducedDim_name,# , transformer = transformer
n_neighbors = k
)
milo.make_nhoods(mdata["rna"], prop=prop)
# nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
# plt.hist(nhood_size, bins=100)
# plt.xlabel("# cells in nhood")
# plt.ylabel("# nhoods");
# plt.show()
return mdata
def greedySetCoverV2(U, S):
# build inverted index
I = {x : [i for i in S if x in S[i]] for x in U}
# pre-select necessary sets Sj into C
C = {j : S[j] for j in [I[x][0] for x in I if len(I[x]) == 1]}
U = U - set().union(*[C[j] for j in C])
while U:
j = max(S.keys(), key = lambda k: len(S[k] & U))
C[j] = S[j]
U = U - S[j]
return C
def _filter_neighbourhoods(mdata):
nhood_df = pd.DataFrame()
for nhood_idx in range(mdata["rna"].obsm["nhoods"].shape[1]):
cur_nhoods = mdata["rna"].obsm["nhoods"][:, nhood_idx].toarray().ravel()
cur_cells = adata.obs_names[cur_nhoods.astype(bool)]
cur_df = pd.DataFrame({
"set": nhood_idx,
"elements": cur_cells
})
nhood_df = pd.concat([nhood_df, cur_df], axis = 0)
S = []
for s in nhood_df.set.unique():
S.append(set(nhood_df.query('set == @s').elements.unique()))
S = dict(enumerate(S))
U = nhood_df.elements.unique()
U = set(U)
gsc_result = greedySetCoverV2(U, S)
nhs_to_use = list(gsc_result.keys())
nhs_to_use = np.asarray(nhs_to_use)
index_for_refined = adata.obs[adata.obs.nhood_ixs_refined == 1].index[nhs_to_use.astype(int)]
mdata["rna"].obs["nhood_ixs_refined"] = 0
mdata["rna"].obs.loc[index_for_refined, "nhood_ixs_refined"] = 1
mdata["rna"].obsm["nhoods"] = mdata["rna"].obsm["nhoods"][:, nhs_to_use.astype(int)]
As the python implementation does not include graph refinement, given that my assumptions above are true, my python implementation first builds a kNN-graph with scanpy/annoy. Then we run make_nhoods, which uses (non-graph) sampling for refined neighbourhoods as in the default miloR/milopy workflow. Refined sampling is (I think) performed on the adjacency matrix.
I am aware this is quite a deviation from the approach in R, but for the time being I would like to build upon that. Or do you think the non-graph refinement is ill-suited for this task? One thing I noticed when running the python implementation is, that neighbourhood sizes are a lot smaller when compared to your original package for the same values of k on the embryo dataset from the tutorial (about 10-fold, see example below). This also leads to a higher number neighbourhoods after filtering. But is this plausible?
I tested graph building in R and making the graph in python, so I found out that the higher neighbourhood sizes stem from the R implementation of buildGraph.
Is it sufficient to simply set k higher to achieve the desired neighbourhood sizes?
For calculating the AUC, pertpy provides an implementation of Augur. However, in your implementation, calculate_auc is called per neighbourhood with the parameter n_subsamples = 0. This is not possible in the python version. I do not fully understand what exactly this option does, except being a special case within Augur. Could you explain why n_subsamples = 0, or if other values might be possible?
Subsequently, I would post a feature request on the repository.
However, some tests I did suggest, that AUC values are generally high for most, if not all neighbourhoods when using python.
Then the DE testing per neighbourhood. For this, I call edgeR using an adapted version of the EdgeR-class in pertpy. Even parallelization worked well in my tests.
And finally, the spatial adjustment of p-values across neighbourhoods. Below I am posting my python-implementation of your code. This is my own translation, so I would appreciate any help to correct it.
import numpy as np
def get_weights(nhoods_x):
intersect_mat = np.dot(nhoods_x.T, nhoods_x)
t_connect = np.sum(intersect_mat, axis=1)
weights = 1 / t_connect
return weights
def _return_null_df():
null_df = pd.DataFrame(index = adata.var_names)
null_df["log_fc"] = np.nan
null_df["logCPM"] = np.nan
null_df["F"] = np.nan
null_df["p_value"] = np.nan
null_df["adj_p_value"] = np.nan
null_df = null_df.reset_index(names = ["variable"])
return null_df
def _run_edger(ad, design, contrast, model):
try:
mod = model(ad, design)
mod.fit()
res_df = mod._test_single_contrast(contrast)
return res_df
except:
return _return_null_df()
def de_stat_neighbourhoods(
mdata,
sample_col = "patient",
design = "~condition",
covariates = "condition",
contrast = None,
# subset_nhoods = stat_auc$Nhood[!is.na(stat_auc$auc)],
layout = "X_umap",
n_jobs = 1,
):
def get_cells_in_nhoods(mdata, nhood_ids):
'''
Get cells in neighbourhoods of interest '''
in_nhoods = np.array(mdata["rna"].obsm['nhoods'][:,nhood_ids.astype('int')].sum(1))
ad = mdata["rna"][in_nhoods.astype(bool), :].copy()
return ad
covs = covariates
from itertools import repeat
from joblib import Parallel, delayed
nhoods = mdata["rna"].obsm["nhoods"]
n_nhoods = nhoods.shape[1]
# n_nhoods = 100
# get generator with all nhoods
func = "sum"
layer = "counts"
all_nhoods = (get_cells_in_nhoods(mdata, nhood_ids = np.asarray([i])) for i in range(n_nhoods))
aggregated_nhoods = (
sc.get.aggregate(ad, by = [sample_col, covs], func = func, layer=layer)
for ad in all_nhoods
)
def _func_to_X(ad, func):
ad.X = ad.layers[func].copy()
return ad
aggregated_nhoods = (
_func_to_X(ad, func)
for ad in aggregated_nhoods
)
import diff_exp
if contrast is None:
from formulaic import model_matrix
mm = model_matrix("~condition", mdata["rna"].obs)
n_coefs = mm.shape[1]
contrast = np.zeros(n_coefs)
contrast[n_coefs-1] = 1
contrast = list(contrast)
del mm
args = zip(
aggregated_nhoods,
repeat(design),
repeat(contrast),
repeat(diff_exp.EdgeR)
)
start = time.time()
pval_df_list = Parallel(n_jobs=n_jobs)(delayed(_run_edger)(ad, design, contrast, model) for ad, design, contrast, model in args)
end = time.time()
print("edger done in")
print(end-start)
print("seconds")
# return pval_df_list
pval_by_nhood = pd.concat([df.set_index("variable")[["p_value"]] for df in pval_df_list], axis = 1)
FDR_across_genes = pd.concat([df.set_index("variable")[["adj_p_value"]] for df in pval_df_list], axis = 1)
logfc_nhoods = pd.concat([df.set_index("variable")[["log_fc"]] for df in pval_df_list], axis = 1)
print("Calculating FDR across nhoods")
n_genes = pval_by_nhood.shape[0]
FDR_across_nhoods = pval_by_nhood.copy()
for i in range(n_genes):
pvalues = pval_by_nhood.iloc[i, :].values
idx_not_nan = np.isnan(pvalues)
_weights = get_weights(mdata["rna"].obsm["nhoods"][:, ~idx_not_nan])
_weights = np.asarray(_weights)
_weights = _weights.ravel()
# o = order_matrix(pvalues)
pvalues = pvalues[~idx_not_nan]
o = np.argsort(pvalues)
weights = _weights[o]
pvalues = pvalues[o]
_weights
# weights = weights[o]
adjp = np.zeros(len(o))
adjp[o] = np.flip(np.minimum.accumulate(np.flip(np.sum(weights) * pvalues / np.cumsum(weights))))
adjp = np.minimum(adjp, 1)
FDR_across_nhoods.iloc[i, :] = np.nan
FDR_across_nhoods.iloc[i, ~idx_not_nan] = adjp
return pval_by_nhood, logfc_nhoods, FDR_across_genes, FDR_across_nhoods
I would appreciate any help to make this work, and thank you in advance.
Best,
max
Here an example using the miloDE tutorial data:
r('''
library(miloDE)
# library containing toy data
suppressMessages(library(MouseGastrulationData))
# analysis libraries
library(scuttle)
suppressMessages(library(miloR))
suppressMessages(library(uwot))
library(scran)
suppressMessages(library(dplyr))
library(reshape2)
# packages for gene cluster analysis
# library(scWGCNA)
#>
# suppressMessages(library(Seurat))
# plotting libraries
library(ggplot2)
library(viridis)
#> Loading required package: viridisLite
library(ggpubr)
''')
r('''
library(BiocParallel)
ncores = 4
mcparam = MulticoreParam(workers = ncores)
register(mcparam)
# load chimera Tal1
sce = suppressMessages(MouseGastrulationData::Tal1ChimeraData())
# downsample to few selected cell types
cts = c("Spinal cord" , "Haematoendothelial progenitors", "Endothelium" , "Blood progenitors 1" , "Blood progenitors 2")
sce = sce[, sce$celltype.mapped %in% cts]
# let's rename Haematoendothelial progenitors
sce$celltype.mapped[sce$celltype.mapped == "Haematoendothelial progenitors"] = "Haem. prog-s."
# delete row for tomato
sce = sce[!rownames(sce) == "tomato-td" , ]
# add logcounts
sce = logNormCounts(sce)
# update tomato field to be more interpretable
sce$tomato = sapply(sce$tomato , function(x) ifelse(isTRUE(x) , "Tal1_KO" , "WT"))
# for this exercise, we focus on 3000 highly variable genes (for computational efficiency)
dec.sce = modelGeneVar(sce)
hvg.genes = getTopHVGs(dec.sce, n = 3000)
sce = sce[hvg.genes , ]
# change rownames to symbol names
rowdata = as.data.frame(rowData(sce))
rownames(sce) = rowdata$SYMBOL
# add UMAPs
set.seed(32)
umaps = as.data.frame(uwot::umap(reducedDim(sce , "pca.corrected")))
# let's store UMAPs - we will use them for visualisation
reducedDim(sce , "UMAP") = umaps
# plot UMAPs, colored by cell types
umaps = cbind(as.data.frame(colData(sce)) , reducedDim(sce , "UMAP"))
names(EmbryoCelltypeColours)[names(EmbryoCelltypeColours) == "Haematoendothelial progenitors"] = "Haem. prog-s."
# cols_ct = EmbryoCelltypeColours[names(EmbryoCelltypeColours) %in% unique(umaps$celltype.mapped)]
''')
# _r_to_py is a self-written helper function for rpy conversion
counts = _r_to_py(r('counts(sce)'))
counts = counts.T
counts
obs = _r_to_py(r('as.data.frame(colData(sce))'))
var = _r_to_py(r('as.data.frame(rowData(sce))'))
X = _r_to_py(r('logcounts(sce)'))
X = X.T
pca_corrected = _r_to_py(r('reducedDim(sce, "pca.corrected")'))
X_umap = _r_to_py(r('reducedDim(sce, "UMAP")'))
adata = anndata.AnnData(
X = X,
obs = obs,
var = var
)
adata.layers["counts"] = counts
adata.obsm["pca_corrected"] = pca_corrected
adata.obsm["X_umap"] = X_umap
adata.obsm["X_umap"] = adata.obsm["X_umap"].values
sc.pl.umap(adata,
color = "celltype.mapped")
%%time
plot = True
LATENT_KEY = "pca_corrected"
k_grid = list(range(10, 40, 5))
# k_grid.append(100)
prop = 0.5
filtering = True
size_df = pd.DataFrame()
for k in k_grid:
print(k)
mdata = assign_neighbourhoods(adata, k = k, prop = prop)
if filtering:
print("filtering")
_filter_neighbourhoods(mdata)
nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
_size_df = pd.DataFrame(dict(nhood_size= nhood_size))
_size_df = _size_df.assign(k = str(k))
size_df = pd.concat([size_df, _size_df], axis = 0)
if plot:
sns.boxplot(
size_df,
x = "k",
y = "nhood_size",
palette = sns.color_palette("Paired", len(k_grid))
)
sns.despine()
%%time
pval, logfc, FDR_genes, FDR_nhoods = de_stat_neighbourhoods(
mdata,
sample_col = "sample",
design = "0+tomato",
covariates = "tomato",
contrast=[-1, 1],
n_jobs = 5
)
nhood_size = np.array(mdata["rna"].obsm["nhoods"].sum(0)).ravel()
mdata["milo"].var["Nhood_size"] = nhood_size
mdata["milo"].var["nhood_adjusted"] = np.sum(FDR_nhoods <= 0.05, axis = 0).values
mdata["milo"].var["gene_adjusted"] = np.sum(FDR_genes <= 0.05, axis = 0).values
sc.pl.embedding(mdata["milo"].T,
basis = "X_milo_graph",
size = mdata["milo"].var["Nhood_size"],
color = "nhood_adjusted"
)
sc.pl.embedding(mdata["milo"].T,
basis = "X_milo_graph",
size = mdata["milo"].var["Nhood_size"],
color = "gene_adjusted"
)
