Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
78954c0
preproc script
dorien-er Mar 6, 2024
9068e7a
preproc script
dorien-er Mar 6, 2024
dbe5204
tokenize and pad script
dorien-er Mar 6, 2024
89a9c6a
tokenize and pad script
dorien-er Mar 6, 2024
9e446f8
embedding script
dorien-er Mar 6, 2024
94dd10c
test resourcers and evaluation script
dorien-er Mar 11, 2024
3edf3c0
cross check gene set
dorien-er Mar 11, 2024
085cdc4
pad_tokenize module
dorien-er Mar 12, 2024
724427e
updat image
dorien-er Mar 12, 2024
f9aadfa
remove test resources, update inputs
dorien-er Mar 13, 2024
33c9ffe
use pytorch image
dorien-er Mar 13, 2024
0c6316d
remove integration component
dorien-er Mar 13, 2024
47f5dda
remove nvidia reqs
dorien-er Mar 13, 2024
9d2ffd0
Merge branch 'main' of github.com:openpipelines-bio/openpipeline into…
jakubmajercik Mar 15, 2024
0f74ebd
remove load_model option
dorien-er Mar 18, 2024
52fb38c
Fix retag for viash-hub not using correct namespace separator (#745)
DriesSchaumont Mar 15, 2024
accf980
CI - Build: Fix second occurance of namespace separator (#746)
DriesSchaumont Mar 15, 2024
b1dd6ce
script to download scgpt test data
dorien-er Mar 18, 2024
18db6d6
remove test resources script
dorien-er Mar 18, 2024
6c3fec0
adjust preprocessing script
dorien-er Mar 19, 2024
acd3600
add scgpt full preproc module
dorien-er Mar 19, 2024
3e31204
integration submodule
dorien-er Mar 19, 2024
b5d1970
integration submodule and add normalize_total flag
dorien-er Mar 19, 2024
ec326f8
add params
dorien-er Mar 19, 2024
2dddc1c
Merge pull request #751 from openpipelines-bio/scgpt-preprocessor
dorien-er Mar 19, 2024
dbb0ea5
Add script to download scgpt test resources (#750)
dorien-er Mar 20, 2024
adcd6f0
embedding module
dorien-er Mar 20, 2024
bd7a32f
Merge pull request #755 from openpipelines-bio/scgpt-dev
dorien-er Mar 20, 2024
154ef26
add unit tests
dorien-er Mar 21, 2024
a7e08bc
undo subsampling test data
dorien-er Mar 21, 2024
1fe1386
update tests
dorien-er Mar 22, 2024
bfea411
update tests
dorien-er Mar 22, 2024
4eee70b
update memory requirements
dorien-er Mar 22, 2024
283de5b
update tests
dorien-er Mar 22, 2024
0496513
update changelog
dorien-er Mar 22, 2024
21be79a
update component name
dorien-er Mar 22, 2024
b7587ee
fix tests, update changelog
dorien-er Mar 22, 2024
045126a
run tests on subsampled data
dorien-er Mar 22, 2024
cf9da6e
adjust shm size
dorien-er Mar 22, 2024
c72575a
update test
dorien-er Mar 22, 2024
779006a
update memory requirements nextflow
dorien-er Mar 22, 2024
1e12613
update test
dorien-er Mar 22, 2024
b460c17
update test
dorien-er Mar 22, 2024
3cb4682
update test
dorien-er Mar 22, 2024
992cae7
expand unit tests, update script with loggers and todo
dorien-er Mar 24, 2024
9ccc4a3
Add ATAC demux (#726)
VladimirShitov Mar 25, 2024
5a2822a
run tests with subsampled data
dorien-er Mar 25, 2024
ab9a182
use specific model input files instead of directory
dorien-er Mar 26, 2024
418687a
update test data
dorien-er Mar 26, 2024
41b60be
Remove muon as test dependency for concatenate_h5mu. (#773)
DriesSchaumont Mar 27, 2024
7ec3ba4
scGPT binning component (#765)
dorien-er Mar 28, 2024
5f2e092
Merge branch 'develop' into scgpt
DriesSchaumont Mar 28, 2024
9e1d35a
update embedding dependencies and gene name layer handling
dorien-er Mar 28, 2024
4e9d916
update input handling
dorien-er Apr 3, 2024
c11db8c
include dsbn logic
dorien-er Apr 4, 2024
e3faf4b
update unit tests
dorien-er Apr 4, 2024
0ba4e9c
update config
dorien-er Apr 4, 2024
350ba33
expand unit tests, fix dsbn
dorien-er Apr 5, 2024
86ff4ef
Update CHANGELOG.md
dorien-er Apr 19, 2024
8fb4a68
Update src/scgpt/embedding/config.vsh.yaml
dorien-er Apr 19, 2024
e12b2e4
update required, remove shared memory docker
dorien-er Apr 19, 2024
b6083f0
Merge branch 'scgpt' into embed
dorien-er Apr 19, 2024
5bec37a
Add scGPT padding and tokenization component (#754)
dorien-er Apr 19, 2024
832d754
enable gpu device option
dorien-er Apr 19, 2024
e0ee58c
update dsbn
dorien-er Apr 19, 2024
5d6ef32
Merge branch 'scgpt' into embed
dorien-er Apr 19, 2024
d224787
remove temporary, unused components
dorien-er Apr 19, 2024
c3e159a
update error messages, remove device param
dorien-er Apr 24, 2024
2daa6f6
remove dropout param
dorien-er Apr 24, 2024
6ddd7c1
fix typo
dorien-er Apr 24, 2024
0ae8cdb
fix typo
dorien-er Apr 24, 2024
a5977ea
Merge pull request #761 from openpipelines-bio/embed
dorien-er Apr 25, 2024
638ac50
Generate scgpt cross check genes module (#758)
jakubmajercik Apr 25, 2024
94b955c
Merge branch 'main' into scgpt
dorien-er Jun 14, 2024
49285b0
undo concat changes
dorien-er Jun 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
update unit tests
  • Loading branch information
dorien-er committed Apr 4, 2024
commit e3faf4bdbe30f32fded338ef36251b718c0c0d33
19 changes: 12 additions & 7 deletions src/scgpt/embedding/config.vsh.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,27 +42,32 @@ functionality:
example: args.json
description: |
Path to model config file.
- name: "--input_obsm_gene_tokens"
- name: "--obsm_gene_tokens"
required: true
type: string
default: "gene_id_tokens"
description: |
The key of the .obsm array containing the gene token ids
example: values.pt
- name: "--input_obsm_tokenized_values"
- name: "--obsm_tokenized_values"
type: string
required: true
default: values_tokenized
description: |
The key of the .obsm array containing the count values of the tokenized genes
- name: "--input_obsm_padding_mask"
- name: "--obsm_padding_mask"
type: string
required: true
default: padding_mask
description: |
The key of the .obsm array containing the padding mask.
- name: "--input_var_gene_names"
- name: "--var_gene_names"
type: string
required: true
description: |
The name of the .var column containing gene names. When no gene_name_layer is provided, the .var index will be used.
- name: "--input_obs_batch_label"
- name: "--obs_batch_label"
required: true
type: string
description: |
The name of the adata.obs column containing the batch labels.
Expand All @@ -83,7 +88,7 @@ functionality:
choices: ["gzip", "lzf"]
description: |
The compression algorithm to use for the output h5mu file.
- name: "--embedding_layer_key"
- name: "--obsm_embeddings"
type: string
default: "X_scGPT"
required: false
Expand Down Expand Up @@ -113,7 +118,7 @@ functionality:
type: boolean
default: true
description: |
Apply domain-specific batch normalization
Apply domain-specific batch normalization. When set to True, 'obs_batch_labels' must be set as well.
- name: "--batch_size"
type: integer
default: 64
Expand Down
36 changes: 15 additions & 21 deletions src/scgpt/embedding/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,29 +9,23 @@
## VIASH START
par = {
"input": "resources_test/scgpt/test_resources/Kim2020_Lung_tokenized.h5mu",
"input_obsm_gene_tokens": 'gene_id_tokens',
"input_obsm_tokenized_values": 'values_tokenized',
"input_obsm_padding_mask": 'padding_mask',
"obsm_gene_tokens": 'gene_id_tokens',
"obsm_tokenized_values": 'values_tokenized',
"obsm_padding_mask": 'padding_mask',
"model": "resources_test/scgpt/source/best_model.pt",
"model_config": "resources_test/scgpt/source/args.json",
"model_vocab": "resources_test/scgpt/source/vocab.json",
"output": "Kim2020_Lung_embedded.h5ad",
"input_var_gene_names": "gene_name",
"input_obs_batch_label": "sample",
"embedding_layer_key": "X_scGPT",
"var_gene_names": "gene_name",
"obs_batch_label": "sample",
"obsm_embeddings": "X_scGPT",
"pad_token": "<pad>",
"pad_value": -2,
"batch_size": 64,
"modality": "rna",
"dropout": 0.2,
"GEPC": True,
"DSBN": True,
"n_input_bins": 51,
"ecs_threshold": 0.8,
"explicit_zero_prob": True,
"use_fast_transformer": False,
"pre_norm": False,
"batch_size": 64,
"output_compression": None
}
## VIASH END

Expand Down Expand Up @@ -71,16 +65,16 @@ def setup_logger():
input_adata = mdata.mod[par["modality"]]
adata = input_adata.copy()

all_gene_ids = adata.obsm[par["input_obsm_gene_tokens"]]
all_values = adata.obsm[par["input_obsm_tokenized_values"]]
padding_mask = adata.obsm[par["input_obsm_padding_mask"]]
all_gene_ids = adata.obsm[par["obsm_gene_tokens"]]
all_values = adata.obsm[par["obsm_tokenized_values"]]
padding_mask = adata.obsm[par["obsm_padding_mask"]]

# Fetch batch ids for domain-specific batch normalization
if par["DSBN"]:
if not par["input_obs_batch_label"]:
if not par["obs_batch_label"]:
raise ValueError("When DSBN is set to True, you are required to provide batch labels (input_obs_batch_labels).")
else:
batch_id_cats = adata.obs[par["input_obs_batch_label"]].astype("category")
batch_id_cats = adata.obs[par["obs_batch_label"]].astype("category")
batch_id_labels = batch_id_cats.cat.codes.values
batch_ids = batch_id_labels.tolist()
batch_ids = np.array(batch_ids)
Expand All @@ -92,10 +86,10 @@ def setup_logger():
special_tokens = [pad_token, "<cls>", "<eoc>"]

# Fetching gene names
if not par["input_var_gene_names"]:
if not par["var_gene_names"]:
genes = adata.var.index.astype(str).tolist()
else:
genes = adata.var[par["input_var_gene_names"]].astype(str).tolist()
genes = adata.var[par["var_gene_names"]].astype(str).tolist()

logger.info("Loading model, vocab and configs")
# Model files
Expand Down Expand Up @@ -178,6 +172,6 @@ def setup_logger():

logger.info("Writing output data")
# Write output
adata.obsm[par["embedding_layer_key"]] = cell_embeddings
adata.obsm[par["obsm_embeddings"]] = cell_embeddings
mdata.mod[par["modality"]] = adata
mdata.write(par["output"], compression=par["output_compression"])
8 changes: 4 additions & 4 deletions src/scgpt/embedding/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,10 +123,10 @@ def test_integration_embedding(run_component, tmp_path):
"--model", model_file,
"--model_vocab", vocab_file,
"--model_config", model_config_file,
"--input_obs_batch_label", "sample",
"--input_obsm_gene_tokens", "gene_id_tokens",
"--input_obsm_tokenized_values", "values_tokenized",
"--input_obsm_padding_mask", "padding_mask",
"--obs_batch_label", "sample",
"--obsm_gene_tokens", "gene_id_tokens",
"--obsm_tokenized_values", "values_tokenized",
"--obsm_padding_mask", "padding_mask",
"--output", output_embedding_file
])

Expand Down