Skip to content
Merged
Show file tree
Hide file tree
Changes from 226 commits
Commits
Show all changes
239 commits
Select commit Hold shift + click to select a range
78954c0
preproc script
dorien-er Mar 6, 2024
9068e7a
preproc script
dorien-er Mar 6, 2024
dbe5204
tokenize and pad script
dorien-er Mar 6, 2024
89a9c6a
tokenize and pad script
dorien-er Mar 6, 2024
9e446f8
embedding script
dorien-er Mar 6, 2024
94dd10c
test resourcers and evaluation script
dorien-er Mar 11, 2024
3edf3c0
cross check gene set
dorien-er Mar 11, 2024
085cdc4
pad_tokenize module
dorien-er Mar 12, 2024
724427e
updat image
dorien-er Mar 12, 2024
f9aadfa
remove test resources, update inputs
dorien-er Mar 13, 2024
33c9ffe
use pytorch image
dorien-er Mar 13, 2024
0c6316d
remove integration component
dorien-er Mar 13, 2024
47f5dda
remove nvidia reqs
dorien-er Mar 13, 2024
9d2ffd0
Merge branch 'main' of github.com:openpipelines-bio/openpipeline into…
jakubmajercik Mar 15, 2024
0f74ebd
remove load_model option
dorien-er Mar 18, 2024
52fb38c
Fix retag for viash-hub not using correct namespace separator (#745)
DriesSchaumont Mar 15, 2024
accf980
CI - Build: Fix second occurance of namespace separator (#746)
DriesSchaumont Mar 15, 2024
b1dd6ce
script to download scgpt test data
dorien-er Mar 18, 2024
18db6d6
remove test resources script
dorien-er Mar 18, 2024
6c3fec0
adjust preprocessing script
dorien-er Mar 19, 2024
acd3600
add scgpt full preproc module
dorien-er Mar 19, 2024
3e31204
integration submodule
dorien-er Mar 19, 2024
b5d1970
integration submodule and add normalize_total flag
dorien-er Mar 19, 2024
ec326f8
add params
dorien-er Mar 19, 2024
2dddc1c
Merge pull request #751 from openpipelines-bio/scgpt-preprocessor
dorien-er Mar 19, 2024
dbb0ea5
Add script to download scgpt test resources (#750)
dorien-er Mar 20, 2024
adcd6f0
embedding module
dorien-er Mar 20, 2024
bd7a32f
Merge pull request #755 from openpipelines-bio/scgpt-dev
dorien-er Mar 20, 2024
154ef26
add unit tests
dorien-er Mar 21, 2024
a7e08bc
undo subsampling test data
dorien-er Mar 21, 2024
1fe1386
update tests
dorien-er Mar 22, 2024
bfea411
update tests
dorien-er Mar 22, 2024
4eee70b
update memory requirements
dorien-er Mar 22, 2024
283de5b
update tests
dorien-er Mar 22, 2024
0496513
update changelog
dorien-er Mar 22, 2024
21be79a
update component name
dorien-er Mar 22, 2024
b7587ee
fix tests, update changelog
dorien-er Mar 22, 2024
045126a
run tests on subsampled data
dorien-er Mar 22, 2024
cf9da6e
adjust shm size
dorien-er Mar 22, 2024
c72575a
update test
dorien-er Mar 22, 2024
779006a
update memory requirements nextflow
dorien-er Mar 22, 2024
1e12613
update test
dorien-er Mar 22, 2024
b460c17
update test
dorien-er Mar 22, 2024
3cb4682
update test
dorien-er Mar 22, 2024
992cae7
expand unit tests, update script with loggers and todo
dorien-er Mar 24, 2024
9ccc4a3
Add ATAC demux (#726)
VladimirShitov Mar 25, 2024
5a2822a
run tests with subsampled data
dorien-er Mar 25, 2024
ab9a182
use specific model input files instead of directory
dorien-er Mar 26, 2024
418687a
update test data
dorien-er Mar 26, 2024
41b60be
Remove muon as test dependency for concatenate_h5mu. (#773)
DriesSchaumont Mar 27, 2024
32446bb
minimal workflow
dorien-er Mar 27, 2024
57cbccd
subworkflow for cross-checking genes and binning
dorien-er Mar 27, 2024
ee58ddf
add required modules
dorien-er Mar 27, 2024
525cc41
add zero shot integration modules
dorien-er Mar 27, 2024
1239e02
update workflow
dorien-er Mar 27, 2024
7ec3ba4
scGPT binning component (#765)
dorien-er Mar 28, 2024
5f2e092
Merge branch 'develop' into scgpt
DriesSchaumont Mar 28, 2024
7491832
update tokenize pad dependencies and layer handling
dorien-er Mar 28, 2024
ab9a6ba
update embedding dependencies and gene name layer handling
dorien-er Mar 28, 2024
9e1d35a
update embedding dependencies and gene name layer handling
dorien-er Mar 28, 2024
299bd2f
update embedding dependencies and gene name layer handling
dorien-er Mar 28, 2024
8a8ea7a
update rna_scgpt nextflow.config
dorien-er Mar 28, 2024
a7e1353
explicitly set scanpy version
dorien-er Mar 29, 2024
3c7197a
update scgpt preproc
dorien-er Apr 2, 2024
cbc89e1
remove niceview
dorien-er Apr 2, 2024
4e9d916
update input handling
dorien-er Apr 3, 2024
a6d8d4e
update modules
dorien-er Apr 4, 2024
09cdb09
update input output handling
dorien-er Apr 4, 2024
8d2b9d1
expand integration pipeline
dorien-er Apr 4, 2024
8294fcf
add nextflow label directives
dorien-er Apr 4, 2024
c11db8c
include dsbn logic
dorien-er Apr 4, 2024
785bca0
include dsbn logic
dorien-er Apr 4, 2024
e3faf4b
update unit tests
dorien-er Apr 4, 2024
0ba4e9c
update config
dorien-er Apr 4, 2024
350ba33
expand unit tests, fix dsbn
dorien-er Apr 5, 2024
4b5cfa4
update parameters, temp workaround for troubleshoot
dorien-er Apr 8, 2024
6a5d457
temporary workaround test
dorien-er Apr 8, 2024
0e60d64
update workflow without new state output
dorien-er Apr 9, 2024
11ec0cd
retry workflow_output
dorien-er Apr 9, 2024
03969de
fetch workflow output from state
dorien-er Apr 9, 2024
98e1d60
remove temporary workflow workaround
dorien-er Apr 9, 2024
d86e602
basic integration test added
jakubmajercik Apr 12, 2024
1247039
tests pass + scgpt test resources updated
jakubmajercik Apr 12, 2024
b5e321e
add neighbors and umap to scgpt workflow
dorien-er Apr 16, 2024
1e3a858
Merge pull request #789 from openpipelines-bio/scgpt-integration-qc
dorien-er Apr 16, 2024
0f668ea
update descriptions to avoid shortcuts in launchpad
dorien-er Apr 17, 2024
592eeb4
add clustering to scgpt integration workflow
dorien-er Apr 17, 2024
7d1852b
add leiden clustering to scgpt integration workflow
dorien-er Apr 18, 2024
86ff4ef
Update CHANGELOG.md
dorien-er Apr 19, 2024
8fb4a68
Update src/scgpt/embedding/config.vsh.yaml
dorien-er Apr 19, 2024
e12b2e4
update required, remove shared memory docker
dorien-er Apr 19, 2024
b6083f0
Merge branch 'scgpt' into embed
dorien-er Apr 19, 2024
5bec37a
Add scGPT padding and tokenization component (#754)
dorien-er Apr 19, 2024
832d754
enable gpu device option
dorien-er Apr 19, 2024
e0ee58c
update dsbn
dorien-er Apr 19, 2024
5d6ef32
Merge branch 'scgpt' into embed
dorien-er Apr 19, 2024
d224787
remove temporary, unused components
dorien-er Apr 19, 2024
c3e159a
update error messages, remove device param
dorien-er Apr 24, 2024
2daa6f6
remove dropout param
dorien-er Apr 24, 2024
6ddd7c1
fix typo
dorien-er Apr 24, 2024
0ae8cdb
fix typo
dorien-er Apr 24, 2024
a5977ea
Merge pull request #761 from openpipelines-bio/embed
dorien-er Apr 25, 2024
f76eb03
remove temporary components
dorien-er Apr 25, 2024
e8227a2
Build(deps): Bump nf-core/setup-nextflow from 1.5.2 to 2.0.0 (#725)
dependabot[bot] Mar 1, 2024
444f654
Cellranger Multi: better test for absolute path (#727)
DriesSchaumont Mar 4, 2024
79bb4a1
grep_annotation_column: fix calculating fraction for observation with…
DriesSchaumont Mar 4, 2024
dd75b50
grep_annotation_column: fix fractions for sparse input data /w low bi…
DriesSchaumont Mar 5, 2024
3d4a072
Revert "Cellranger multi: better test for absolute path" (#732)
DriesSchaumont Mar 6, 2024
73527ef
Update nextflow resource labels for cellbender (#736)
DriesSchaumont Mar 6, 2024
9834714
Fix --output arguments in workflows (#740)
DriesSchaumont Mar 11, 2024
dac5a52
Prepare CHANGELOG for 1.0.0rc2 [ci skip]
DriesSchaumont Mar 11, 2024
45221d8
Use correct semver notation for 1.0.0-rc2 tag [ci skip].
DriesSchaumont Mar 11, 2024
493698e
rna_singlesample: fix 'obs_name_mitochondrial_fraction' (#743)
DriesSchaumont Mar 12, 2024
2401925
Remove unused CI step and improve input checks. (#744)
DriesSchaumont Mar 14, 2024
8ba8fdc
CI: fix ternary operator for concurrency groups
DriesSchaumont Mar 14, 2024
864e981
Update tests prebuilt asserters (#735)
jakubmajercik Mar 14, 2024
ed27333
Add t-SNE component (#742)
jakubmajercik Mar 14, 2024
2c6b6d4
Change namespace separator (#712)
rcannood Mar 15, 2024
5b0e04e
Trigger Build
DriesSchaumont Mar 15, 2024
f8ad965
Fix retag for viash-hub not using correct namespace separator (#745)
DriesSchaumont Mar 15, 2024
82413a6
CI - Build: Fix second occurance of namespace separator (#746)
DriesSchaumont Mar 15, 2024
f36dc8e
Add script to download scgpt test resources (#750)
dorien-er Mar 20, 2024
68800d3
script to download scgpt test data
dorien-er Mar 18, 2024
a198b79
remove test resources script
dorien-er Mar 18, 2024
e7359c6
Fix missing 'ps' in container images (#756)
DriesSchaumont Mar 20, 2024
c99b113
Fix publishing in process_samples and process_batches (#759)
DriesSchaumont Mar 21, 2024
9c2cacb
Update CI resources
DriesSchaumont Mar 21, 2024
a56e111
Typo
DriesSchaumont Mar 21, 2024
c80b8d5
Update CHANGELOG
DriesSchaumont Mar 21, 2024
98e7806
CI: Fix tests in release build [ci skip]
DriesSchaumont Mar 21, 2024
177de2a
Subset scGPT resources (#764)
dorien-er Mar 25, 2024
27556cc
Add ATAC demux (#726)
VladimirShitov Mar 25, 2024
218ae41
Remove muon as test dependency for concatenate_h5mu. (#773)
DriesSchaumont Mar 27, 2024
b57882d
scGPT binning component (#765)
dorien-er Mar 28, 2024
a3f4740
Add scGPT padding and tokenization component (#754)
dorien-er Apr 19, 2024
9becb83
embedding module
dorien-er Mar 20, 2024
2098509
add unit tests
dorien-er Mar 21, 2024
8314fd0
undo subsampling test data
dorien-er Mar 21, 2024
49c96c5
update tests
dorien-er Mar 22, 2024
cb597fc
update tests
dorien-er Mar 22, 2024
8094c75
update memory requirements
dorien-er Mar 22, 2024
a93fba0
update tests
dorien-er Mar 22, 2024
56b7938
update changelog
dorien-er Mar 22, 2024
61df081
update component name
dorien-er Mar 22, 2024
cdea404
fix tests, update changelog
dorien-er Mar 22, 2024
4c512ed
run tests on subsampled data
dorien-er Mar 22, 2024
ad9188a
adjust shm size
dorien-er Mar 22, 2024
00dea46
update test
dorien-er Mar 22, 2024
9929f02
update memory requirements nextflow
dorien-er Mar 22, 2024
09dbf0c
update test
dorien-er Mar 22, 2024
ba8b543
update test
dorien-er Mar 22, 2024
bb05d47
update test
dorien-er Mar 22, 2024
1a01803
expand unit tests, update script with loggers and todo
dorien-er Mar 24, 2024
42d07ec
run tests with subsampled data
dorien-er Mar 25, 2024
f0042c2
use specific model input files instead of directory
dorien-er Mar 26, 2024
a7f3c51
update test data
dorien-er Mar 26, 2024
cd0301a
update embedding dependencies and gene name layer handling
dorien-er Mar 28, 2024
8ffe9f5
update input handling
dorien-er Apr 3, 2024
d785de6
include dsbn logic
dorien-er Apr 4, 2024
8d6eb4d
update unit tests
dorien-er Apr 4, 2024
80252a6
update config
dorien-er Apr 4, 2024
0b984c6
expand unit tests, fix dsbn
dorien-er Apr 5, 2024
9cce61b
Update CHANGELOG.md
dorien-er Apr 19, 2024
07f3623
Update src/scgpt/embedding/config.vsh.yaml
dorien-er Apr 19, 2024
5c19bcc
update required, remove shared memory docker
dorien-er Apr 19, 2024
c568af1
enable gpu device option
dorien-er Apr 19, 2024
224e297
update dsbn
dorien-er Apr 19, 2024
75d47b8
update error messages, remove device param
dorien-er Apr 24, 2024
eefab85
remove dropout param
dorien-er Apr 24, 2024
14a70b0
fix typo
dorien-er Apr 24, 2024
4bb2f3f
fix typo
dorien-er Apr 24, 2024
ade7b09
rebase scgpt branch
dorien-er Mar 27, 2024
0366176
resolve merge conflicts
dorien-er Apr 2, 2024
638ac50
Generate scgpt cross check genes module (#758)
jakubmajercik Apr 25, 2024
078fc52
resolve merge conflicts
dorien-er Apr 4, 2024
43775bc
resolve merge conflicts
dorien-er Apr 4, 2024
ea65c80
resolve merge conflicts
dorien-er Apr 8, 2024
ffe6228
remove temporary components
dorien-er Apr 25, 2024
70d9677
resolve merge
dorien-er Apr 25, 2024
01c8926
remove resources test script
dorien-er Apr 25, 2024
78c96ac
Merge branch 'scgpt' into scgpt-integration
dorien-er Apr 25, 2024
a42f11c
remove changes
dorien-er Apr 25, 2024
2d04d3f
remove changes
dorien-er Apr 25, 2024
0cea2c2
remove outdated workflow
dorien-er Apr 25, 2024
79a53c0
make workflow more compact
dorien-er Apr 25, 2024
5d5288a
update broken configs
dorien-er Apr 25, 2024
6ae66c4
update parameter var gene names
dorien-er Apr 25, 2024
bd035a2
Merge branch 'scgpt-integration' into 784-integration-test-scgpt-inte…
dorien-er Apr 30, 2024
8095eb7
update nextflow labels
dorien-er Apr 30, 2024
211650c
fix dsbn typo
dorien-er Apr 30, 2024
bf64d29
Merge pull request #797 from openpipelines-bio/784-integration-test-s…
dorien-er Apr 30, 2024
e4b82f4
update integration test
dorien-er Apr 30, 2024
8821d99
fix small inconsistencies
dorien-er Apr 30, 2024
8754ac7
change test params to reduce resource requirements
dorien-er Apr 30, 2024
1c24ee8
view workflow output
dorien-er May 5, 2024
3ff5a8b
remove integration metrics
dorien-er May 5, 2024
8e5b442
remove integrationqc workflow
dorien-er May 5, 2024
f74e250
add niceviews to integration tests
dorien-er May 5, 2024
c8a57d3
fix typo
dorien-er May 5, 2024
611023a
fix integration test
dorien-er May 5, 2024
0c76878
update changelog
dorien-er May 6, 2024
2c3dc77
remove scgpt resources script
dorien-er May 6, 2024
179ac44
add scgpt resources script
dorien-er May 6, 2024
b849242
add scgpt resources script
dorien-er May 6, 2024
dd720d7
Update src/feature_annotation/highly_variable_features_scanpy/config.…
dorien-er May 8, 2024
746d9e6
Update src/filter/do_filter/config.vsh.yaml
dorien-er May 8, 2024
9e2a444
Update src/scgpt/binning/config.vsh.yaml
dorien-er May 8, 2024
f45f6ce
Update src/workflows/integration/scgpt_leiden/config.vsh.yaml
dorien-er May 8, 2024
f2ffbb3
pr fixes
dorien-er May 8, 2024
8e26cfe
work with single channel
dorien-er May 8, 2024
e86b0e2
update channels and remove niceview
dorien-er May 8, 2024
82b1cb0
add input layer to hvg
dorien-er May 8, 2024
6f8e0b9
add input layer to hvg
dorien-er May 8, 2024
cb89ac1
fix arguments in workflow
dorien-er May 8, 2024
28d1078
merge main into scgpt-integration
dorien-er Jun 14, 2024
4328b3a
reintroduce scgpt files
dorien-er Jun 14, 2024
a557850
update changelog
dorien-er Jun 14, 2024
6aa66b2
update changelog
dorien-er Jun 14, 2024
7823fea
remove resources test
dorien-er Jun 19, 2024
d1a7d32
add scgpt resources script
dorien-er Jun 19, 2024
731d019
add scgpt resources script
dorien-er Jun 19, 2024
d3e862f
Update CHANGELOG.md
dorien-er Jul 1, 2024
7ecdbb7
allow for finetuned models
dorien-er Jul 17, 2024
454a32e
enable finetuned models in workflow
dorien-er Jul 17, 2024
e84ff78
add variable for integrated obsm
dorien-er Jul 24, 2024
3f25c3b
Merge remote-tracking branch 'origin/main' into scgpt-integration
DriesSchaumont Aug 2, 2024
cd7410e
update authorship, fix parsing error
dorien-er Aug 14, 2024
9f06e34
remove compression
dorien-er Aug 15, 2024
21bcbf7
Apply suggestions from code review
dorien-er Aug 15, 2024
0742316
Merge branch 'main' into scgpt-integration
dorien-er Aug 15, 2024
9d0225e
fix unit test
dorien-er Aug 15, 2024
1a40003
fix unit test
dorien-er Aug 15, 2024
1cbe506
Merge remote-tracking branch 'origin/main' into scgpt-integration
DriesSchaumont Aug 23, 2024
8be4858
Merge branch 'main' into scgpt-integration
dorien-er Aug 30, 2024
c9fc39d
update to viash 9
dorien-er Aug 30, 2024
c2b11a2
fix test
dorien-er Aug 30, 2024
9871631
Update src/scgpt/embedding/script.py
dorien-er Sep 4, 2024
c194078
update integration_test.sh
dorien-er Sep 4, 2024
72df593
update integration test
dorien-er Sep 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@

* `scgpt/binning` component: Added a scGPT pre-processing binning component (PR #765).

* `workflows/integration/scgpt_leiden` workflow integrate with scGPT followed by Leiden clustering (PR #794).

* `transform/clr` component: Added the option to set the `axis` along which to apply CLR. Possible to override
on workflow level as well (PR #767).

Expand Down
Empty file modified resources_test_scripts/scgpt.sh
100755 → 100644
Empty file.
50 changes: 29 additions & 21 deletions src/scgpt/embedding/config.vsh.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,27 +25,6 @@ functionality:
type: string
default: "rna"
required: false
- name: "--model"
type: file
direction: input
required: true
example: best_model.pt
description: |
Path to scGPT model file.
- name: "--model_vocab"
type: file
direction: input
required: true
example: vocab.json
description: |
Path to scGPT model vocabulary file.
- name: "--model_config"
type: file
direction: input
required: true
example: args.json
description: |
Path to scGPT model config file.
- name: "--obsm_gene_tokens"
type: string
default: "gene_id_tokens"
Expand All @@ -70,6 +49,35 @@ functionality:
type: string
description: |
The name of the adata.obs column containing the batch labels. Must be provided when 'dsbn' is set to True.
- name: Model
arguments:
- name: "--model"
type: file
direction: input
required: true
example: best_model.pt
description: |
Path to scGPT model file.
- name: "--model_vocab"
type: file
direction: input
required: true
example: vocab.json
description: |
Path to scGPT model vocabulary file.
- name: "--model_config"
type: file
direction: input
required: true
example: args.json
description: |
Path to scGPT model config file.
- name: "--finetuned_checkpoints_key"
type: string
required: false
example: model_state_dict
description: |
Key in the model file containing the pretrained checkpoints. Only relevant for fine-tuned models.

- name: Outputs
arguments:
Expand Down
15 changes: 14 additions & 1 deletion src/scgpt/embedding/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,9 +145,22 @@ def setup_logger():
pre_norm=False #TODO: Parametrize when GPU-based machine types are supported
)


logger.info("Loading model")
model_file = par["model"]
model_dict = torch.load(model_file, map_location=device)

# Ensure the provided model has the correct architecture
if par["finetuned_checkpoints_key"]:
if par["finetuned_checkpoints_key"] not in model_dict.keys():
finetuned_checkpoints_key = par["finetuned_checkpoints_key"]
raise KeyError(f"The key '{finetuned_checkpoints_key}' provided for '--finetuned_checkpoints_key' could not be found in the provided --model file. The finetuned model file for cell type annotation requires valid keys for the checkpoints and the label mapper.")
model_dict = model_dict[par["finetuned_checkpoints_key"]]

# Load model
load_pretrained(
model,
torch.load(model_file, map_location=device),
model_dict,
verbose=False
)

Expand Down
72 changes: 72 additions & 0 deletions src/scgpt/embedding/test.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import pytest
import subprocess
import torch
import re
import sys
import mudata as mu
Expand All @@ -20,10 +21,23 @@

input = f"{meta['resources_dir']}/Kim2020_Lung_subset.h5mu"
model_file = f"{meta['resources_dir']}/source/best_model.pt"
ft_model = f'{meta["resources_dir"]}/ft_best_model.pt'
vocab_file = f"{meta['resources_dir']}/source/vocab.json"
model_config_file = f"{meta['resources_dir']}/source/args.json"
input_file = mu.read(input)

def scgpt_to_ft_scgpt(scgpt_path, ft_scgpt_path, state_dict_key, mapper_key):
f_model_dict = torch.load(scgpt_path, map_location="cpu")
model_dict = {}
model_dict[state_dict_key] = f_model_dict
model_dict[mapper_key] = {k: str(k) for k in range(15)}
torch.save(model_dict, ft_scgpt_path)

# Convert foundation model into fine-tuned model architecture:
# To be able to do a cell type label mapping, the model architecture needs to contain a class to label mapper dictionary
scgpt_to_ft_scgpt(model_file, ft_model, "model_state_dict", "id_to_class")


## START TEMPORARY WORKAROUND DATA PREPROCESSING
#TODO: Remove this workaround once full scGPT preprocessing workflow is implemented
# Read in data
Expand Down Expand Up @@ -259,5 +273,63 @@ def test_integration_embedding_non_existing_keys(run_component, tmp_path):
err.value.stdout.decode('utf-8'))


def test_finetuned_model(run_component, tmp_path):
output_embedding_file = tmp_path / "Kim2020_Lung_subset_embedded.h5mu"

run_component([
"--input", tokenized_data_path,
"--modality", "rna",
"--model", ft_model,
"--model_vocab", vocab_file,
"--model_config", model_config_file,
"--dsbn", "True",
"--obs_batch_label", "sample",
"--obsm_gene_tokens", "gene_id_tokens",
"--obsm_tokenized_values", "values_tokenized",
"--obsm_padding_mask", "padding_mask",
"--finetuned_checkpoints_key", "model_state_dict",
"--output", output_embedding_file
])

# Read output file
output_mdata = mu.read(output_embedding_file)
output_adata = output_mdata.mod["rna"]

# check that embedding obs is present
assert 'X_scGPT' in output_adata.obsm.keys(), "X_scGPT is not present in anndata obsm keys"

# check embedding size
assert output_adata.obsm["X_scGPT"].shape[1] == 512, "Embedding size does not equal 512"

# check embedding value range
assert not all(np.isnan(output_adata.obsm["X_scGPT"][0])), "Embedding values are nan"
assert all([all(i > -1) & all(i < 1) for i in output_adata.obsm["X_scGPT"]]), "Range of embedding values is outside of [-1, 1]"


def test_finetuned_model_architecture(run_component, tmp_path):
output_embedding_file = tmp_path / "Kim2020_Lung_subset_embedded.h5mu"

args = [
"--input", tokenized_data_path,
"--modality", "rna",
"--model", ft_model,
"--model_vocab", vocab_file,
"--model_config", model_config_file,
"--dsbn", "True",
"--obs_batch_label", "sample",
"--obsm_gene_tokens", "gene_id_tokens",
"--obsm_tokenized_values", "values_tokenized",
"--obsm_padding_mask", "padding_mask",
"--finetuned_checkpoints_key", "dummy_checkpoints_key",
"--output", output_embedding_file
]

with pytest.raises(subprocess.CalledProcessError) as err:
run_component(args)
assert re.search(
r'KeyError: "The key \'dummy_checkpoints_key\' provided for \'--finetuned_checkpoints_key\' could not be found in the provided --model file. The finetuned model file for cell type annotation requires valid keys for the checkpoints and the label mapper."',
err.value.stdout.decode('utf-8'))


if __name__ == '__main__':
sys.exit(pytest.main([__file__]))
180 changes: 180 additions & 0 deletions src/workflows/integration/scgpt_leiden/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
ffunctionality:
name: "scgpt_leiden"
namespace: "workflows/integration"
description: "Run scGPT integration (cell embedding generation) followed by neighbour calculations, leiden clustering and run umap on the result."
authors:
- __merge__: /src/authors/dorien_roosen.yaml
roles: [ author, maintainer ]
argument_groups:
- name: "Inputs"
arguments:
- name: "--id"
required: true
type: string
description: ID of the sample.
example: foo
- name: "--input"
type: file
required: true
description: Path to the input file.
example: input.h5mu
- name: "--modality"
type: string
default: "rna"
required: false
- name: "--input_layer"
type: string
required: False
description: |
Mudata layer (key from layers) to use as input data for hvg subsetting and binning; if not specified, X is used.
- name: "--var_gene_names"
type: string
required: false
description: |
The name of the adata var column containing gene names; when no gene_name_layer is provided, the var index will be used.
- name: "--obs_batch_label"
type: string
description: |
The name of the adata obs column containing the batch labels.
- name: Model
arguments:
- name: "--model"
type: file
required: true
example: resources_test/scgpt/best_model.pt
description: |
Path to scGPT model file.
- name: "--model_vocab"
type: file
direction: input
required: true
example: resources_test/scgpt/vocab.json
description: |
Path to scGPT model vocabulary file.
- name: "--model_config"
type: file
direction: input
required: true
example: args.json
description: |
Path to scGPT model config file.
- name: "--finetuned_checkpoints_key"
type: string
required: false
example: model_state_dict
description: |
Key in the model file containing the pretrained checkpoints. Only relevant for fine-tuned models.
- name: "Outputs"
arguments:
- name: "--output"
type: file
required: true
direction: output
description: Output file path
example: output.h5mu
- name: "--obsm_integrated"
type: string
default: "X_scgpt"
required: false
description: "In which .obsm slot to store the resulting integrated embedding."
- name: "--output_compression"
type: string
example: "gzip"
required: false
choices: ["gzip", "lzf"]
description: |
The compression algorithm to use for the output h5mu file.
- name: "Padding arguments"
arguments:
- name: "--pad_token"
type: string
default: "<pad>"
required: false
description: |
Token used for padding.
- name: "--pad_value"
type: integer
default: -2
required: false
description: |
The value of the padding token.
- name: "HVG subset arguments"
arguments:
- name: "--n_hvg"
type: integer
default: 1200
description: |
Number of highly variable genes to subset for.
- name: "Tokenization arguments"
arguments:
- name: "--max_seq_len"
type: integer
required: false
description: |
The maximum sequence length of the tokenized data.
- name: "Embedding arguments"
arguments:
- name: --dsbn
type: boolean
default: true
description: |
Apply domain-specific batch normalization
- name: "--batch_size"
type: integer
default: 64
description: |
The batch size to be used for embedding inference.
- name: "Binning arguments"
arguments:
- name: "--n_input_bins"
type: integer
default: 51
required: False
min: 1
description: |
The number of bins to discretize the data into; When no value is provided, data won't be binned.
- name: "--seed"
type: integer
required: false
description: |
Seed for random number generation used for binning. If not set, no seed is used.
- name: "Clustering arguments"
arguments:
- name: "--leiden_resolution"
type: double
description: Control the coarseness of the clustering. Higher values lead to more clusters.
default: [1]
multiple: true

resources:
- type: nextflow_script
path: main.nf
entrypoint: run_wf

dependencies:
- name: scgpt/cross_check_genes
- name: scgpt/binning
- name: feature_annotation/highly_variable_features_scanpy
- name: filter/do_filter
- name: scgpt/pad_tokenize
- name: scgpt/embedding
- name: dimred/umap
- name: neighbors/find_neighbors
- name: cluster/leiden
- name: metadata/move_obsm_to_obs

test_resources:
- type: nextflow_script
path: test.nf
entrypoint: test_wf
- type: nextflow_script
path: test.nf
entrypoint: test_wf2
- path: /resources_test/scgpt

platforms:
- type: nextflow
Loading