Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark on number of cells and runtime #3

Open
Rohit-Satyam opened this issue Oct 31, 2023 · 10 comments
Open

Benchmark on number of cells and runtime #3

Rohit-Satyam opened this issue Oct 31, 2023 · 10 comments

Comments

@Rohit-Satyam
Copy link

Hi @yjgeno @jamesjcai @guanxunli

Thanks for developing GenKI. I was excited to use it to knockout a gene from Malaria Cell Atlas and see if it gives better results in terms of biology than scTenifoldKnk. Since the MCA atlas is large, nearly 29K cells, I split them into their sell stages and then tried running GenKI. I have 7K cells in current run and I am using 12 cores but it's been entire day and it seems to have stuck at DataLoader step.

Do you have any benchmark as to how much time will it take against number of cells?

@Rohit-Satyam
Copy link
Author

Hi it's been two days now and GenKI appears to have stuck!!

@yjgeno
Copy link
Owner

yjgeno commented Nov 1, 2023 via email

@Rohit-Satyam
Copy link
Author

Rohit-Satyam commented Dec 24, 2023

Hi @yjgeno I went to the paper supplementary and saw that according to the benchmark it should take around 200 minutes if you have up to 5K genes. I put a gene knockout on run yesterday at 9 PM and it has been more than 12 hours and it's still running without any output even when I use all available cores this time using multiprocessing.cpu_count() as opposed to 12 CPUs previously. I have 5K genes and 3.8K cells. This shouldn't take a lot of time. I observe that the R package scTenifoldKnk your lab developed is much faster (finishes a KO within 45 minutes or an hour) for the same data

use all the cells (3801) in data
build GRN
2023-12-23 20:55:23,942	INFO worker.py:1642 -- Started a local Ray instance.
ray init, using 112 CPUs

@yjgeno
Copy link
Owner

yjgeno commented Jan 2, 2024

Hi @Rohit-Satyam, it looks like you used more than 100 CPUs, and I doubted the delay was caused by their interaction handled by Ray. If you still have the problem, may you try to reduce CPUs to a lower number like 8 for example, when initiating?

@Rohit-Satyam
Copy link
Author

Hi @yjgeno. I was using just 12 CPUs previously but still, it went on forever. So that's when I decided to use all the CPUs. Besides, since there is no progress bar, it's difficult to know if it is proceeding at all or is just stuck!!

@Rohit-Satyam
Copy link
Author

@yjgeno I tried 8 cpus as well and it's been two days. Instead of running the code in jupyter notebook, I converted your code into an executable file execute.py (see attachment below), and then ran the code on the command line to but it is still running. I have tried following number of CPUs (120, 112, 12, 8)

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                             
1141836 subudhak  35  15   43.6g 892200  97584 R 100.7   0.3 674:44.59 ray::pc_net_par

I am attaching execute.py which is nothing else but commandline format of your notebook and the test data and can be executed as python execute.py --h5ad_path late_troph.h5ad --gene_id PF3D7-0420300 --result_name PF3D7-0420300_result
test.zip

@yjgeno
Copy link
Owner

yjgeno commented Apr 14, 2024

@Rohit-Satyam Hi, have you finished your run? I didn't encounter any problems handling data on this scale. If you're still facing issues, feel free to send your data my way, and I'll take care of the run for you

@LPH-BIG
Copy link

LPH-BIG commented Aug 24, 2024

Hi @yjgeno @jamesjcai @guanxunli

I have the same runtime issue using GenKI. I have been using GenKI with the 10X PBMC3k scRNA-seq dataset, which contains 2,698 cells and 1,865 highly variable genes after filtering. I am attempting to simulate the knockout of a single gene. However, even after running the computation on 60 CPUs for an entire week, I have not yet obtained any results.
Here is the code I am using:
`import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc

sc.settings.verbosity = 0

import GenKI as gk
from GenKI.preprocesing import build_adata
from GenKI.dataLoader import DataLoader
from GenKI.train import VGAE_trainer
from GenKI import utils

adata = build_adata("pbmc3k_10X_filtered_scaled.h5ad")

data_wrapper = DataLoader(
adata, # adata object
target_gene = ["SLC19A1"], # KO gene name
target_cell = None, # obsname for cell type, if none use all
obs_label = "ident", # colname for genes
GRN_file_dir = "GRNs", # folder name for GRNs
rebuild_GRN = True, # whether build GRN by pcNet
pcNet_name = "pcNet_example", # GRN file name
verbose = True, # whether verbose
n_cpus = 60, # multiprocessing
)

data_wt = data_wrapper.load_data()
data_ko = data_wrapper.load_kodata()

hyperparams = {"epochs": 2,
"lr": 7e-2,
"beta": 1e-4,
"seed": 1}
log_dir=None

sensei = VGAE_trainer(data_wt,
epochs=hyperparams["epochs"],
lr=hyperparams["lr"],
log_dir=log_dir,
beta=hyperparams["beta"],
seed=hyperparams["seed"],
verbose=False,
)
sensei.train()

z_mu_wt, z_std_wt = sensei.get_latent_vars(data_wt)
z_mu_ko, z_std_ko = sensei.get_latent_vars(data_ko)
dis = gk.utils.get_distance(z_mu_ko, z_std_ko, z_mu_wt, z_std_wt, by="KL")
print(dis.shape)

z_mu_wt=pd.DataFrame(z_mu_wt)
z_std_wt=pd.DataFrame(z_std_wt)

res_raw = utils.get_generank(data_wt, dis, rank=True)
res_raw.head()
res_raw.to_csv("pbmc_res_raw.csv")

null = sensei.pmt(data_ko, n=5, by="KL")
res = utils.get_generank(data_wt, dis, null,save_significant_as = 'gene_list_pbmc_5epoch')
res
res.to_csv("pbmc_res.csv")
`
The only output is "2024-08-15 07:31:34,189 INFO worker.py:1781 -- Started a local Ray instance".

Could you please provide any guidance on whether this computation time is expected, or if there are any optimizations I can apply to speed up the process? Any help would be greatly appreciated.

Thank you for your assistance!

@yjgeno
Copy link
Owner

yjgeno commented Sep 17, 2024

Hi @LPH-BIG It seems that your single-cell dataset is relatively small, so the GRN calculation shouldn't be this slow. Could you try reducing the number of CPUs to something like n_cpus = 4 and see if that improves the performance? The multiprocessing might not be running correctly in your cluster.

And if you wish you could share your data with me and I run it for you, if it is not sensitive.

@Rohit-Satyam
Copy link
Author

@LPH-BIG Were you able to resolve it using 4 CPUs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants