JIND-Multi is an advanced extension of the JIND framework, designed to automate the annotation of single-cell RNA sequencing (scRNA-Seq). This framework, originally introduced by Goyal et al., 2022, now supports the transfer of cell-type labels from multiple annotated datasets, enhancing the accuracy and reliability of annotations. Additionally, JIND-Multi is applicable for annotating scATAC-Seq data and can flag cells as "unassigned" if predictions fall below predefined thresholds.
Leveraging multiple annotated datasets, such as those in an atlas, JIND-Multi improves the precision of unlabeled dataset annotations while reducing rejection rates (unassigned cells). We offer a robust and efficient implementation of JIND-Multi, available for the scientific community.
- Operating System: Linux or macOS
- Environment Manager: Miniconda
- Programming Language: Python 3.6.8 or higher
- Hardware: A CPU is sufficient, but using an NVIDIA GPU with CUDA and cuDNN is recommended for better performance.
To run JIND-Multi, whether you're training from scratch or using pre-trained models, you need to provide a .h5ad file containing your single-cell data and define a few key parameters. Below is a list of required and optional arguments, their types, and what they represent.
| Argument | Type | Required | Description |
|---|---|---|---|
PATH |
string |
✅ | Path to the input .h5ad file. This file must contain your annotated single-cell dataset (AnnData object). |
BATCH_COL |
string |
✅ | Name of the column in adata.obs that contains batch or donor identifiers. |
LABELS_COL |
string |
✅ | Name of the column in adata.obs that contains cell type labels. |
TARGET_DATASET_NAME |
string |
✅ | Name of the target batch (from BATCH_COL) to which the cell type annotations will be transferred. |
SOURCE_DATASET_NAME |
string |
❌ | (Optional) Name of the source batch used for training. If not specified, JIND-Multi will automatically select the best source batch based on rejection rate. |
OUTPUT_PATH |
string |
✅ | Path to the directory where output results (metrics, predictions, etc.) will be saved. |
PRETRAINED_MODEL_PATH |
string |
❌ | (Optional) Path to a directory with pre-trained .pt model files and a .json with thresholds. If provided, the model will skip training and proceed directly to inference. |
INTER_DATASETS_NAMES |
string |
❌ | (Optional) Comma-separated list of intermediate batch names (from BATCH_COL) used in multi-step training. Do not include the source batch. |
EXCLUDE_DATASETS_NAMES |
string |
❌ | (Optional) Comma-separated list of dataset names to exclude from training. Avoid duplicating entries used in SOURCE, TARGET, or INTER. |
NUM_FEATURES |
int |
❌ | (Optional) Number of genes to include in the model. Default: 5000. |
MIN_CELL_TYPE_POPULATION |
int |
❌ | (Optional) Minimum number of cells per cell type per batch required for training. Default: 100. |
USE_GPU |
bool |
❌ | (Optional, but recommended) Set to True to train using GPU. Default: True. |
- If
PRETRAINED_MODEL_PATHis provided, JIND-Multi skips training and uses the given models for inference on the target batch. - If
SOURCE_DATASET_NAMEis not specified, the method automatically selects the source batch that minimizes cell rejection when predicting on the target batch.
After running JIND-Multi, all outputs are stored in the specified OUTPUT_PATH directory. These results include both prediction files and detailed performance metrics for model evaluation.
-
predicted_label_test_data.xlsx
This Excel file contains prediction results for each cell in the target batch. For each cell, it includes:- Probabilities assigned by the model for each cell type.
raw_predictions: The cell type with the highest probability (before applying thresholds).predictions: The final predicted label after applying cell type-specific thresholds (low-confidence predictions may be marked as Unknown).
-
Trained Model Files
- Trained models for each annotated batch are saved in
.ptformat inside thetrained_models/directory. - A separate
target.pthfile contains the model trained on the target batch after fine-tuning. - The file
val_stats_trained_model.jsoncontains the predictions on the validation set used to calculate threshold values per cell type.
- Trained models for each annotated batch are saved in
-
JIND-Multi evaluates and records the classification performance at every key step:
- Source batch: Confusion matrices before and after fine-tuning.
- Intermediate batches: Performance at three stages:
initial: Before alignment.adapt: After alignment to the source batch.finetuned: After final fine-tuning of encoder and classifier.
- Target batch (if labels are available): Confusion matrices showing:
- Before alignment.
- After adaptation.
- After classifier fine-tuning using confident predictions.
-
The matrices display:
- Number of cells per cell type.
- How many were predicted as Unknown.
- Accuracy before (raw) and after (effective) thresholding.
- Misclassified cells.
- Mean Average Precision (mAP) per cell type.
-
Training history summary
A PDF file namedtrain[SOURCE_BATCH, INTER_COUNT]-test[TARGET_BATCH].pdfis also generated. It includes a visual history of confusion matrices across training stages, making it easier to interpret performance changes through each step of the pipeline.
The datasets used to reproduce the results presented in the manuscript are publicly available at the following Zenodo link:
🔗 https://doi.org/10.5281/zenodo.14000644
⚠️ Important:
If you're using any of the datasets from Zenodo, please refer to theInput Argument Referencebelow to correctly configure the input arguments when running the method.
⚠️ High Resource Requirement:
The following datasets require a High Performance Computing system (HPC) due to their large size:
All_human_brain.h5addata_multiome_annotated_BMMC_ATAC.h5ad
| Dataset | Type | Filename | BATCH_COL | LABELS_COL | SOURCE_DATASET_NAME | TARGET_DATASET_NAME | INTER_DATASETS_NAMES | MIN_CELL_TYPE_POPULATION |
|---|---|---|---|---|---|---|---|---|
| Pancreas | scRNA-seq | pancreas.h5ad |
batch |
celltype |
0 |
3 |
['1', '2'] |
5 |
| NSCLC Lung | scRNA-seq | NSCLC_lung_NORMALIZED_FILTERED.h5ad |
Donor |
predicted_labels_majority |
Donor5 |
Donor2 |
['Donor0', 'Donor1', 'Donor3', 'Donor4', 'Donor6'] |
20 |
| Neurips Brain | scRNA-seq | All_human_brain.h5ad |
batch |
label |
C4 |
C7 |
['AD2', 'ADx1', 'ADx2', 'ADx4'] |
100 |
| BMMC | scATAC-seq | data_multiome_annotated_BMMC_ATAC.h5ad |
batch |
cell_type |
s4d8 |
s3d3 |
['s1d1', 's1d2', 's1d3', 's2d1', 's2d4', 's2d5', 's3d10', 's4d1'] |
18 |
| Fetal Heart | scATAC-seq | heart_sample_norm_scaled_data_annotated.h5ad |
batch |
celltype |
heart_sample_39 |
heart_sample_14 |
['heart_sample_32'] |
100 |
| Fetal Kidney | scATAC-seq | kidney_sample_norm_scaled_data_annotated.h5ad |
batch |
celltype |
kidney_sample_3 |
kidney_sample_67 |
['kidney_sample_34', 'kidney_sample_65'] |
100 |
-
README_execution.md: Located in the same directory as this README.
It provides step-by-step instructions on how to run the method:- ✅ Locally
- 🚀 On an HPC system
- 🌐 Via the web interface
-
Inside the
./jind_multi/folder:README.md: Contains detailed documentation about each Python script included in thejind_multipackage.
README_execution.md: Contains detailed documentation about how run the models.
For questions, feedback, or support, please contact:
Joseba Sancho-Zamora
Email: jsanchoz@unav.es
