Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
78954c0
preproc script
dorien-er Mar 6, 2024
9068e7a
preproc script
dorien-er Mar 6, 2024
dbe5204
tokenize and pad script
dorien-er Mar 6, 2024
89a9c6a
tokenize and pad script
dorien-er Mar 6, 2024
9e446f8
embedding script
dorien-er Mar 6, 2024
94dd10c
test resourcers and evaluation script
dorien-er Mar 11, 2024
3edf3c0
cross check gene set
dorien-er Mar 11, 2024
085cdc4
pad_tokenize module
dorien-er Mar 12, 2024
724427e
updat image
dorien-er Mar 12, 2024
f9aadfa
remove test resources, update inputs
dorien-er Mar 13, 2024
33c9ffe
use pytorch image
dorien-er Mar 13, 2024
0c6316d
remove integration component
dorien-er Mar 13, 2024
47f5dda
remove nvidia reqs
dorien-er Mar 13, 2024
9d2ffd0
Merge branch 'main' of github.com:openpipelines-bio/openpipeline into…
jakubmajercik Mar 15, 2024
0f74ebd
remove load_model option
dorien-er Mar 18, 2024
52fb38c
Fix retag for viash-hub not using correct namespace separator (#745)
DriesSchaumont Mar 15, 2024
accf980
CI - Build: Fix second occurance of namespace separator (#746)
DriesSchaumont Mar 15, 2024
b1dd6ce
script to download scgpt test data
dorien-er Mar 18, 2024
18db6d6
remove test resources script
dorien-er Mar 18, 2024
6c3fec0
adjust preprocessing script
dorien-er Mar 19, 2024
acd3600
add scgpt full preproc module
dorien-er Mar 19, 2024
3e31204
integration submodule
dorien-er Mar 19, 2024
b5d1970
integration submodule and add normalize_total flag
dorien-er Mar 19, 2024
ec326f8
add params
dorien-er Mar 19, 2024
2dddc1c
Merge pull request #751 from openpipelines-bio/scgpt-preprocessor
dorien-er Mar 19, 2024
dbb0ea5
Add script to download scgpt test resources (#750)
dorien-er Mar 20, 2024
adcd6f0
embedding module
dorien-er Mar 20, 2024
bd7a32f
Merge pull request #755 from openpipelines-bio/scgpt-dev
dorien-er Mar 20, 2024
154ef26
add unit tests
dorien-er Mar 21, 2024
a7e08bc
undo subsampling test data
dorien-er Mar 21, 2024
1fe1386
update tests
dorien-er Mar 22, 2024
bfea411
update tests
dorien-er Mar 22, 2024
4eee70b
update memory requirements
dorien-er Mar 22, 2024
283de5b
update tests
dorien-er Mar 22, 2024
0496513
update changelog
dorien-er Mar 22, 2024
21be79a
update component name
dorien-er Mar 22, 2024
b7587ee
fix tests, update changelog
dorien-er Mar 22, 2024
045126a
run tests on subsampled data
dorien-er Mar 22, 2024
cf9da6e
adjust shm size
dorien-er Mar 22, 2024
c72575a
update test
dorien-er Mar 22, 2024
779006a
update memory requirements nextflow
dorien-er Mar 22, 2024
1e12613
update test
dorien-er Mar 22, 2024
b460c17
update test
dorien-er Mar 22, 2024
3cb4682
update test
dorien-er Mar 22, 2024
992cae7
expand unit tests, update script with loggers and todo
dorien-er Mar 24, 2024
9ccc4a3
Add ATAC demux (#726)
VladimirShitov Mar 25, 2024
5a2822a
run tests with subsampled data
dorien-er Mar 25, 2024
ab9a182
use specific model input files instead of directory
dorien-er Mar 26, 2024
418687a
update test data
dorien-er Mar 26, 2024
41b60be
Remove muon as test dependency for concatenate_h5mu. (#773)
DriesSchaumont Mar 27, 2024
7ec3ba4
scGPT binning component (#765)
dorien-er Mar 28, 2024
5f2e092
Merge branch 'develop' into scgpt
DriesSchaumont Mar 28, 2024
9e1d35a
update embedding dependencies and gene name layer handling
dorien-er Mar 28, 2024
4e9d916
update input handling
dorien-er Apr 3, 2024
c11db8c
include dsbn logic
dorien-er Apr 4, 2024
e3faf4b
update unit tests
dorien-er Apr 4, 2024
0ba4e9c
update config
dorien-er Apr 4, 2024
350ba33
expand unit tests, fix dsbn
dorien-er Apr 5, 2024
86ff4ef
Update CHANGELOG.md
dorien-er Apr 19, 2024
8fb4a68
Update src/scgpt/embedding/config.vsh.yaml
dorien-er Apr 19, 2024
e12b2e4
update required, remove shared memory docker
dorien-er Apr 19, 2024
b6083f0
Merge branch 'scgpt' into embed
dorien-er Apr 19, 2024
5bec37a
Add scGPT padding and tokenization component (#754)
dorien-er Apr 19, 2024
832d754
enable gpu device option
dorien-er Apr 19, 2024
e0ee58c
update dsbn
dorien-er Apr 19, 2024
5d6ef32
Merge branch 'scgpt' into embed
dorien-er Apr 19, 2024
d224787
remove temporary, unused components
dorien-er Apr 19, 2024
c3e159a
update error messages, remove device param
dorien-er Apr 24, 2024
2daa6f6
remove dropout param
dorien-er Apr 24, 2024
6ddd7c1
fix typo
dorien-er Apr 24, 2024
0ae8cdb
fix typo
dorien-er Apr 24, 2024
a5977ea
Merge pull request #761 from openpipelines-bio/embed
dorien-er Apr 25, 2024
638ac50
Generate scgpt cross check genes module (#758)
jakubmajercik Apr 25, 2024
94b955c
Merge branch 'main' into scgpt
dorien-er Jun 14, 2024
49285b0
undo concat changes
dorien-er Jun 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,12 @@

* `reference/cellranger_mkgtf` component: Added cellranger mkgtf as a standalone component (PR #771).

* `scgpt/cross_check_genes` component: Added a gene-model cross check component for scGPT (PR #758).

* `scgpt/embedding`: component: Added scGPT embedding component (PR #761)

* `scgpt/tokenize_pad`: component: Added scGPT padding and tokenization component (PR #754).

* `scgpt/binning` component: Added a scGPT pre-processing binning component (PR #765).

## MINOR CHANGES
Expand Down
87 changes: 87 additions & 0 deletions src/scgpt/cross_check_genes/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
functionality:
name: cross_check_genes
namespace: "scgpt"
description: |
Cross-check genes with pre-trained scGPT model.
authors:
- __merge__: /src/authors/jakub_majercik.yaml
roles: [ maintainer, author ]
- __merge__: /src/authors/dorien_roosen.yaml
roles: [ maintainer, author ]

argument_groups:
- name: Inputs
arguments:
- name: "--input"
type: file
direction: input
required: true
example: input.h5mu
description: |
The input h5mu file containing of pre-processed data.
- name: "--modality"
type: string
default: "rna"
required: false
description: |
The modality key of the MuData object containing the RNA AnnData object.
- name: "--vocab_file"
type: file
direction: input
required: true
example: resources_test/scgpt/vocab.json
description: |
Model vocabulary file path.
- name: "--input_var_gene_names"
type: string
example: "gene_name"
required: false
description: |
The name of the adata.var column containing gene names. By default the .var index will be used.
- name: Outputs
arguments:
- name: "--output"
type: file
direction: output
required: true
example: output.h5mu
description: |
The output cross-checked anndata file.
- name: "--output_compression"
type: string
choices: ["gzip", "lzf"]
required: false
example: "gzip"
- name: Arguments
arguments:
- name: "--pad_token"
type: string
default: "<pad>"
required: false
description: |
The padding token used in the model.
resources:
- type: python_script
path: script.py
- path: /src/utils/setup_logger.py
test_resources:
- type: python_script
path: test.py
- path: /resources_test/scgpt/test_resources/Kim2020_Lung_subset.h5mu
- path: /resources_test/scgpt/source/vocab.json

platforms:
- type: docker
image: nvcr.io/nvidia/pytorch:23.09-py3
setup:
- type: python
__merge__: [ /src/base/requirements/anndata_mudata.yaml, /src/base/requirements/scanpy.yaml, .]
- type: python
packages:
- scgpt==0.2.1
test_setup:
- type: python
__merge__: [ /src/base/requirements/python_test_setup.yaml, .]
- type: nextflow
directives:
label: [ lowmem, lowcpu ]
68 changes: 68 additions & 0 deletions src/scgpt/cross_check_genes/script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import mudata as mu
import numpy as np
from scgpt.tokenizer.gene_tokenizer import GeneVocab

## VIASH START
par = {
"input": "resources_test/scgpt/test_resources/Kim2020_Lung_subset.h5mu",
"output": "output.h5mu",
"modality": "rna",
"input_var_gene_names": None,
"pad_token": "<pad>",
"vocab_file": "resources_test/scgpt/source/vocab.json"
}
## VIASH END

# START TEMPORARY WORKAROUND setup_logger
# reason: resources aren't available when using Nextflow fusion
# from setup_logger import setup_logger
def setup_logger():
import logging
from sys import stdout

logger = logging.getLogger()
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler(stdout)
logFormatter = logging.Formatter("%(asctime)s %(levelname)-8s %(message)s")
console_handler.setFormatter(logFormatter)
logger.addHandler(console_handler)

return logger
# END TEMPORARY WORKAROUND setup_logger
logger = setup_logger()
# Read in data
logger.info(f"Reading {par['input']}")
mudata = mu.read_h5mu(par["input"])
adata = mudata.mod[par["modality"]].copy()

pad_token = par["pad_token"]
special_tokens = [pad_token, "<cls>", "<eoc>"]

# Fetching gene names
if not par["input_var_gene_names"]:
genes = adata.var.index.astype(str).tolist()
elif par["input_var_gene_names"] not in adata.var.columns:
raise ValueError(f"Gene name column '{par['input_var_gene_names']}' not found in .mod['{par['modality']}'].obs.")
else:
genes = adata.var[par["input_var_gene_names"]].astype(str).tolist()

# Cross-check genes with pre-trained model
logger.info(f"Loading model vocab from {par['vocab_file']}")
vocab_file = par["vocab_file"]
vocab = GeneVocab.from_file(vocab_file)
[vocab.append_token(s) for s in special_tokens if s not in vocab]

# vocab.append_token([s for s in special_tokens if s not in vocab])

logger.info("Filtering genes based on model vocab")
adata.var["id_in_vocab"] = [1 if gene in vocab else -1 for gene in genes]

gene_ids_in_vocab = np.array(adata.var["id_in_vocab"])

logger.info("Subsetting input data based on genes present in model vocab")
adata = adata[:, adata.var["id_in_vocab"] >= 0]

mudata.mod[par["modality"]] = adata

logger.info(f"Writing to {par['output']}")
mudata.write_h5mu(par["output"], compression=par["output_compression"])
55 changes: 55 additions & 0 deletions src/scgpt/cross_check_genes/test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import pytest
import subprocess
from mudata import read_h5mu
import re
import sys

## VIASH START
meta = {
'executable': './target/docker/scgpt/cross_check/cross_check',
'resources_dir': './resources_test/scgpt/',
'config': './src/scgpt/cross_check/config.vsh.yaml'
}
## VIASH END

input_path = meta["resources_dir"] + "Kim2020_Lung_subset.h5mu"
vocab_path = meta["resources_dir"] + "vocab.json"

def test_cross_check(run_component, random_path):
output_path = random_path(extension="h5mu")
args = [
"--input", input_path,
"--output", output_path,
"--modality", "rna",
"--vocab_file", vocab_path,
"--output_compression", "gzip"
]
run_component(args)

output_mudata = read_h5mu(output_path)
input_mudata = read_h5mu(input_path)

# Check added columns
assert {"gene_name", "id_in_vocab"}.issubset(set(output_mudata.mod["rna"].var.columns)), "Gene columns were not added."
# Check if genes were filtered
assert all(output_mudata.mod["rna"].var["id_in_vocab"] == 1), "Genes were not filtered."
# Check if number of observations is the same
assert output_mudata.mod["rna"].n_obs == input_mudata.mod["rna"].n_obs, "Number of observations changed."
assert output_mudata.n_obs == input_mudata.n_obs, "Number of observations changed."

def test_cross_check_invalid_gene_layer_raises(run_component, random_path):
output_path = random_path(extension="h5mu")
args = [
"--input", input_path,
"--output", output_path,
"--vocab_file", vocab_path,
"--input_var_gene_names", "dummy_var",
]

with pytest.raises(subprocess.CalledProcessError) as err:
run_component(args)
assert re.search(r"ValueError: Gene name column 'dummy_var' not found in .mod\['rna'\]\.obs\.",
err.value.stdout.decode('utf-8'))

if __name__ == '__main__':
sys.exit(pytest.main([__file__]))
143 changes: 143 additions & 0 deletions src/scgpt/embedding/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
functionality:
name: embedding
namespace: scgpt
description: |
Generation of cell embeddings for the integration of single cell transcriptomic count data using scGPT.
authors:
- __merge__: /src/authors/dorien_roosen.yaml
roles: [ maintainer, author ]

argument_groups:
- name: Inputs
arguments:
- name: "--input"
type: file
direction: input
required: true
example: input.h5mu
description: |
The input h5mu file containing tokenized gene and count data.
- name: "--modality"
type: string
default: "rna"
required: false
- name: "--model"
type: file
direction: input
required: true
example: best_model.pt
description: |
Path to scGPT model file.
- name: "--model_vocab"
type: file
direction: input
required: true
example: vocab.json
description: |
Path to scGPT model vocabulary file.
- name: "--model_config"
type: file
direction: input
required: true
example: args.json
description: |
Path to scGPT model config file.
- name: "--obsm_gene_tokens"
type: string
default: "gene_id_tokens"
description: |
The key of the .obsm array containing the gene token ids
example: values.pt
- name: "--obsm_tokenized_values"
type: string
default: values_tokenized
description: |
The key of the .obsm array containing the count values of the tokenized genes
- name: "--obsm_padding_mask"
type: string
default: padding_mask
description: |
The key of the .obsm array containing the padding mask.
- name: "--var_gene_names"
type: string
description: |
The name of the .var column containing gene names. When no gene_name_layer is provided, the .var index will be used.
- name: "--obs_batch_label"
type: string
description: |
The name of the adata.obs column containing the batch labels. Must be provided when 'dsbn' is set to True.

- name: Outputs
arguments:
- name: "--output"
type: file
required: true
description: |
Path to output anndata file containing pre-processed data as well as scGPT embeddings.
direction: output
example: output.h5mu
- name: "--output_compression"
type: string
example: "gzip"
required: false
choices: ["gzip", "lzf"]
description: |
The compression algorithm to use for the output h5mu file.
- name: "--obsm_embeddings"
type: string
default: "X_scGPT"
description: |
The name of the adata.obsm array to which scGPT embeddings will be written.

- name: Arguments
arguments:
- name: "--pad_token"
type: string
default: "<pad>"
description: |
The token to be used for padding.
- name: "--pad_value"
type: integer
default: -2
description: |
The value of the padding token.
- name: "--dbsn"
type: boolean
default: true
description: |
Whether to apply domain-specific batch normalization for generating embeddings. When set to True, 'obs_batch_labels' must be set as well.
- name: "--batch_size"
type: integer
default: 64
description: |
The batch size to be used for inference
- name: "--dsbn"
type: boolean
default: true
description: |
Whether to apply domain-specific batch normalization for generating embeddings. When set to True, 'obs_batch_labels' must be set as well.

resources:
- type: python_script
path: script.py
test_resources:
- type: python_script
path: test.py
- path: /resources_test/scgpt/source
- path: /resources_test/scgpt/test_resources/Kim2020_Lung_subset.h5mu

platforms:
- type: docker
image: nvcr.io/nvidia/pytorch:23.09-py3
setup:
- type: python
__merge__: [ /src/base/requirements/anndata_mudata.yaml, /src/base/requirements/scanpy.yaml ]
- type: python
packages:
- scgpt==0.2.1
test_setup:
- type: python
__merge__: [ /src/base/requirements/viashpy.yaml ]
- type: nextflow
directives:
label: [ midmem ]
Loading