Benchmark on number of cells and runtime #3

Rohit-Satyam · 2023-10-31T17:36:35Z

Thanks for developing GenKI. I was excited to use it to knockout a gene from Malaria Cell Atlas and see if it gives better results in terms of biology than scTenifoldKnk. Since the MCA atlas is large, nearly 29K cells, I split them into their sell stages and then tried running GenKI. I have 7K cells in current run and I am using 12 cores but it's been entire day and it seems to have stuck at DataLoader step.

Do you have any benchmark as to how much time will it take against number of cells?

The text was updated successfully, but these errors were encountered:

Rohit-Satyam · 2023-11-01T18:04:05Z

Hi it's been two days now and GenKI appears to have stuck!!

yjgeno · 2023-11-01T18:17:47Z

Hi Rohit, I’m currently not in lab, and sorry for the late reply. Since GenkI internally constructs gene network using a regression based method, you may encounter delay if your number of genes is beyond 3k. So I suggest you first run it with the top features such as highly variable genes. We have a running speed evaluation in the supplementary material, hope it helps. Feel free to let me know if you have any questions. Best, Yongjian

…

On Wed, Nov 1, 2023 at 14:04 Rohit Satyam ***@***.***> wrote: Hi it's been two days now and GenKI appears to have stuck!! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned. Message ID: <yjgeno/GenKI/issues/3/1789414604@ github. com> ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Hi it's been two days now and GenKI appears to have stuck!! — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/yjgeno/GenKI/issues/3*issuecomment-1789414604__;Iw!!KwNVnqRv!G3GIEtXtFS25ndyA1kT4vAbQ9qigrMA8iAi--pYOUMit3YwBRNUDVC3R4JPTFCwE2wftIJhVSdAZWcxqFGzSTdupdBaA$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ASQBQCWGJIZJWFSM2FHSBMTYCKFJ7AVCNFSM6AAAAAA6YCWW2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBZGQYTINRQGQ__;!!KwNVnqRv!G3GIEtXtFS25ndyA1kT4vAbQ9qigrMA8iAi--pYOUMit3YwBRNUDVC3R4JPTFCwE2wftIJhVSdAZWcxqFGzSTZZrHaCr$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Rohit-Satyam · 2023-12-24T14:55:09Z

Hi @yjgeno I went to the paper supplementary and saw that according to the benchmark it should take around 200 minutes if you have up to 5K genes. I put a gene knockout on run yesterday at 9 PM and it has been more than 12 hours and it's still running without any output even when I use all available cores this time using multiprocessing.cpu_count() as opposed to 12 CPUs previously. I have 5K genes and 3.8K cells. This shouldn't take a lot of time. I observe that the R package scTenifoldKnk your lab developed is much faster (finishes a KO within 45 minutes or an hour) for the same data

use all the cells (3801) in data
build GRN
2023-12-23 20:55:23,942	INFO worker.py:1642 -- Started a local Ray instance.
ray init, using 112 CPUs

yjgeno · 2024-01-02T16:25:32Z

Hi @Rohit-Satyam, it looks like you used more than 100 CPUs, and I doubted the delay was caused by their interaction handled by Ray. If you still have the problem, may you try to reduce CPUs to a lower number like 8 for example, when initiating?

Rohit-Satyam · 2024-01-07T12:57:57Z

Hi @yjgeno. I was using just 12 CPUs previously but still, it went on forever. So that's when I decided to use all the CPUs. Besides, since there is no progress bar, it's difficult to know if it is proceeding at all or is just stuck!!

Rohit-Satyam · 2024-03-09T08:51:45Z

@yjgeno I tried 8 cpus as well and it's been two days. Instead of running the code in jupyter notebook, I converted your code into an executable file execute.py (see attachment below), and then ran the code on the command line to but it is still running. I have tried following number of CPUs (120, 112, 12, 8)

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                             
1141836 subudhak  35  15   43.6g 892200  97584 R 100.7   0.3 674:44.59 ray::pc_net_par

I am attaching execute.py which is nothing else but commandline format of your notebook and the test data and can be executed as python execute.py --h5ad_path late_troph.h5ad --gene_id PF3D7-0420300 --result_name PF3D7-0420300_result
test.zip

yjgeno · 2024-04-14T15:23:05Z

@Rohit-Satyam Hi, have you finished your run? I didn't encounter any problems handling data on this scale. If you're still facing issues, feel free to send your data my way, and I'll take care of the run for you

LPH-BIG · 2024-08-24T04:57:53Z

Hi @yjgeno @jamesjcai @guanxunli

I have the same runtime issue using GenKI. I have been using GenKI with the 10X PBMC3k scRNA-seq dataset, which contains 2,698 cells and 1,865 highly variable genes after filtering. I am attempting to simulate the knockout of a single gene. However, even after running the computation on 60 CPUs for an entire week, I have not yet obtained any results.
Here is the code I am using:
`import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc

sc.settings.verbosity = 0

import GenKI as gk
from GenKI.preprocesing import build_adata
from GenKI.dataLoader import DataLoader
from GenKI.train import VGAE_trainer
from GenKI import utils

adata = build_adata("pbmc3k_10X_filtered_scaled.h5ad")

data_wrapper = DataLoader(
adata, # adata object
target_gene = ["SLC19A1"], # KO gene name
target_cell = None, # obsname for cell type, if none use all
obs_label = "ident", # colname for genes
GRN_file_dir = "GRNs", # folder name for GRNs
rebuild_GRN = True, # whether build GRN by pcNet
pcNet_name = "pcNet_example", # GRN file name
verbose = True, # whether verbose
n_cpus = 60, # multiprocessing
)

data_wt = data_wrapper.load_data()
data_ko = data_wrapper.load_kodata()

hyperparams = {"epochs": 2,
"lr": 7e-2,
"beta": 1e-4,
"seed": 1}
log_dir=None

sensei = VGAE_trainer(data_wt,
epochs=hyperparams["epochs"],
lr=hyperparams["lr"],
log_dir=log_dir,
beta=hyperparams["beta"],
seed=hyperparams["seed"],
verbose=False,
)
sensei.train()

z_mu_wt, z_std_wt = sensei.get_latent_vars(data_wt)
z_mu_ko, z_std_ko = sensei.get_latent_vars(data_ko)
dis = gk.utils.get_distance(z_mu_ko, z_std_ko, z_mu_wt, z_std_wt, by="KL")
print(dis.shape)

z_mu_wt=pd.DataFrame(z_mu_wt)
z_std_wt=pd.DataFrame(z_std_wt)

res_raw = utils.get_generank(data_wt, dis, rank=True)
res_raw.head()
res_raw.to_csv("pbmc_res_raw.csv")

null = sensei.pmt(data_ko, n=5, by="KL")
res = utils.get_generank(data_wt, dis, null,save_significant_as = 'gene_list_pbmc_5epoch')
res
res.to_csv("pbmc_res.csv")
`
The only output is "2024-08-15 07:31:34,189 INFO worker.py:1781 -- Started a local Ray instance".

Could you please provide any guidance on whether this computation time is expected, or if there are any optimizations I can apply to speed up the process? Any help would be greatly appreciated.

Thank you for your assistance!

yjgeno · 2024-09-17T21:10:53Z

Hi @LPH-BIG It seems that your single-cell dataset is relatively small, so the GRN calculation shouldn't be this slow. Could you try reducing the number of CPUs to something like n_cpus = 4 and see if that improves the performance? The multiprocessing might not be running correctly in your cluster.

And if you wish you could share your data with me and I run it for you, if it is not sensitive.

Rohit-Satyam · 2024-09-19T09:05:23Z

@LPH-BIG Were you able to resolve it using 4 CPUs?

Rohit-Satyam mentioned this issue Dec 25, 2023

Regarding Log2FC value cailab-tamu/scTenifoldKnk#23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark on number of cells and runtime #3

Benchmark on number of cells and runtime #3

Rohit-Satyam commented Oct 31, 2023

Rohit-Satyam commented Nov 1, 2023

yjgeno commented Nov 1, 2023 via email

Rohit-Satyam commented Dec 24, 2023 •

edited

Loading

yjgeno commented Jan 2, 2024 •

edited

Loading

Rohit-Satyam commented Jan 7, 2024

Rohit-Satyam commented Mar 9, 2024

yjgeno commented Apr 14, 2024

LPH-BIG commented Aug 24, 2024 •

edited

Loading

yjgeno commented Sep 17, 2024 •

edited

Loading

Rohit-Satyam commented Sep 19, 2024

Benchmark on number of cells and runtime #3

Benchmark on number of cells and runtime #3

Comments

Rohit-Satyam commented Oct 31, 2023

Rohit-Satyam commented Nov 1, 2023

yjgeno commented Nov 1, 2023 via email

Rohit-Satyam commented Dec 24, 2023 • edited Loading

yjgeno commented Jan 2, 2024 • edited Loading

Rohit-Satyam commented Jan 7, 2024

Rohit-Satyam commented Mar 9, 2024

yjgeno commented Apr 14, 2024

LPH-BIG commented Aug 24, 2024 • edited Loading

yjgeno commented Sep 17, 2024 • edited Loading

Rohit-Satyam commented Sep 19, 2024

Rohit-Satyam commented Dec 24, 2023 •

edited

Loading

yjgeno commented Jan 2, 2024 •

edited

Loading

LPH-BIG commented Aug 24, 2024 •

edited

Loading

yjgeno commented Sep 17, 2024 •

edited

Loading