JIND-Multi

JIND-Multi is an advanced extension of the JIND framework, designed to automate the annotation of single-cell RNA sequencing (scRNA-Seq). This framework, originally introduced by Goyal et al., 2022, now supports the transfer of cell-type labels from multiple annotated datasets, enhancing the accuracy and reliability of annotations. Additionally, JIND-Multi is applicable for annotating scATAC-Seq data and can flag cells as "unassigned" if predictions fall below predefined thresholds.

Leveraging multiple annotated datasets, such as those in an atlas, JIND-Multi improves the precision of unlabeled dataset annotations while reducing rejection rates (unassigned cells). We offer a robust and efficient implementation of JIND-Multi, available for the scientific community.

Prerequisites for Running Locally

Operating System: Linux or macOS
Environment Manager: Miniconda
Programming Language: Python 3.6.8 or higher
Hardware: A CPU is sufficient, but using an NVIDIA GPU with CUDA and cuDNN is recommended for better performance.

📁 Required Inputs & Configuration Options

To run JIND-Multi, whether you're training from scratch or using pre-trained models, you need to provide a .h5ad file containing your single-cell data and define a few key parameters. Below is a list of required and optional arguments, their types, and what they represent.

Argument	Type	Required	Description
`PATH`	`string`	✅	Path to the input `.h5ad` file. This file must contain your annotated single-cell dataset (AnnData object).
`BATCH_COL`	`string`	✅	Name of the column in `adata.obs` that contains batch or donor identifiers.
`LABELS_COL`	`string`	✅	Name of the column in `adata.obs` that contains cell type labels.
`TARGET_DATASET_NAME`	`string`	✅	Name of the target batch (from `BATCH_COL`) to which the cell type annotations will be transferred.
`SOURCE_DATASET_NAME`	`string`	❌	(Optional) Name of the source batch used for training. If not specified, JIND-Multi will automatically select the best source batch based on rejection rate.
`OUTPUT_PATH`	`string`	✅	Path to the directory where output results (metrics, predictions, etc.) will be saved.
`PRETRAINED_MODEL_PATH`	`string`	❌	(Optional) Path to a directory with pre-trained `.pt` model files and a `.json` with thresholds. If provided, the model will skip training and proceed directly to inference.
`INTER_DATASETS_NAMES`	`string`	❌	(Optional) Comma-separated list of intermediate batch names (from `BATCH_COL`) used in multi-step training. Do not include the source batch.
`EXCLUDE_DATASETS_NAMES`	`string`	❌	(Optional) Comma-separated list of dataset names to exclude from training. Avoid duplicating entries used in `SOURCE`, `TARGET`, or `INTER`.
`NUM_FEATURES`	`int`	❌	(Optional) Number of genes to include in the model. Default: `5000`.
`MIN_CELL_TYPE_POPULATION`	`int`	❌	(Optional) Minimum number of cells per cell type per batch required for training. Default: `100`.
`USE_GPU`	`bool`	❌	(Optional, but recommended) Set to `True` to train using GPU. Default: `True`.

📌 Notes

If PRETRAINED_MODEL_PATH is provided, JIND-Multi skips training and uses the given models for inference on the target batch.
If SOURCE_DATASET_NAME is not specified, the method automatically selects the source batch that minimizes cell rejection when predicting on the target batch.

📤 Output Files and Results Overview

After running JIND-Multi, all outputs are stored in the specified OUTPUT_PATH directory. These results include both prediction files and detailed performance metrics for model evaluation.

🧪 Prediction Results

predicted_label_test_data.xlsx
This Excel file contains prediction results for each cell in the target batch. For each cell, it includes:
- Probabilities assigned by the model for each cell type.
- raw_predictions: The cell type with the highest probability (before applying thresholds).
- predictions: The final predicted label after applying cell type-specific thresholds (low-confidence predictions may be marked as Unknown).
Trained Model Files
- Trained models for each annotated batch are saved in .pt format inside the trained_models/ directory.
- A separate target.pth file contains the model trained on the target batch after fine-tuning.
- The file val_stats_trained_model.json contains the predictions on the validation set used to calculate threshold values per cell type.

📊 Performance Metrics and Confusion Matrices

JIND-Multi evaluates and records the classification performance at every key step:
- Source batch: Confusion matrices before and after fine-tuning.
- Intermediate batches: Performance at three stages:
  - initial: Before alignment.
  - adapt: After alignment to the source batch.
  - finetuned: After final fine-tuning of encoder and classifier.
- Target batch (if labels are available): Confusion matrices showing:
  - Before alignment.
  - After adaptation.
  - After classifier fine-tuning using confident predictions.
The matrices display:
- Number of cells per cell type.
- How many were predicted as Unknown.
- Accuracy before (raw) and after (effective) thresholding.
- Misclassified cells.
- Mean Average Precision (mAP) per cell type.
Training history summary
A PDF file named train[SOURCE_BATCH, INTER_COUNT]-test[TARGET_BATCH].pdf is also generated. It includes a visual history of confusion matrices across training stages, making it easier to interpret performance changes through each step of the pipeline.

📂 Datasets and Input Argument Guide

The datasets used to reproduce the results presented in the manuscript are publicly available at the following Zenodo link:
🔗 https://doi.org/10.5281/zenodo.14000644

⚠️ Important:
If you're using any of the datasets from Zenodo, please refer to the Input Argument Reference below to correctly configure the input arguments when running the method.

📥 Input Argument Reference

⚠️ High Resource Requirement:
The following datasets require a High Performance Computing system (HPC) due to their large size:

All_human_brain.h5ad
data_multiome_annotated_BMMC_ATAC.h5ad

Dataset	Type	Filename	BATCH_COL	LABELS_COL	SOURCE_DATASET_NAME	TARGET_DATASET_NAME	INTER_DATASETS_NAMES	MIN_CELL_TYPE_POPULATION
Pancreas	scRNA-seq	`pancreas.h5ad`	`batch`	`celltype`	`0`	`3`	`['1', '2']`	5
NSCLC Lung	scRNA-seq	`NSCLC_lung_NORMALIZED_FILTERED.h5ad`	`Donor`	`predicted_labels_majority`	`Donor5`	`Donor2`	`['Donor0', 'Donor1', 'Donor3', 'Donor4', 'Donor6']`	20
Neurips Brain	scRNA-seq	`All_human_brain.h5ad`	`batch`	`label`	`C4`	`C7`	`['AD2', 'ADx1', 'ADx2', 'ADx4']`	100
BMMC	scATAC-seq	`data_multiome_annotated_BMMC_ATAC.h5ad`	`batch`	`cell_type`	`s4d8`	`s3d3`	`['s1d1', 's1d2', 's1d3', 's2d1', 's2d4', 's2d5', 's3d10', 's4d1']`	18
Fetal Heart	scATAC-seq	`heart_sample_norm_scaled_data_annotated.h5ad`	`batch`	`celltype`	`heart_sample_39`	`heart_sample_14`	`['heart_sample_32']`	100
Fetal Kidney	scATAC-seq	`kidney_sample_norm_scaled_data_annotated.h5ad`	`batch`	`celltype`	`kidney_sample_3`	`kidney_sample_67`	`['kidney_sample_34', 'kidney_sample_65']`	100

📁 Additional Documentation

README_execution.md: Located in the same directory as this README.
It provides step-by-step instructions on how to run the method:
- ✅ Locally
- 🚀 On an HPC system
- 🌐 Via the web interface
Inside the ./jind_multi/ folder:
- README.md: Contains detailed documentation about each Python script included in the jind_multi package.

Inside the `./jind_multi/` folder:

README_execution.md: Contains detailed documentation about how run the models.

Contact

For questions, feedback, or support, please contact:
Joseba Sancho-Zamora
Email: jsanchoz@unav.es

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
__pycache__		__pycache__
build/lib/jind_multi		build/lib/jind_multi
jind_multi		jind_multi
notebooks		notebooks
slurm		slurm
static		static
templates		templates
.gitignore		.gitignore
Dockerfile		Dockerfile
JIND.png		JIND.png
LICENSE.txt		LICENSE.txt
README.md		README.md
README_execution.md		README_execution.md
config.json		config.json
environment.yml		environment.yml
index.py		index.py
setup.py		setup.py
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JIND-Multi

Prerequisites for Running Locally

📁 Required Inputs & Configuration Options

📌 Notes

📤 Output Files and Results Overview

🧪 Prediction Results

📊 Performance Metrics and Confusion Matrices

📂 Datasets and Input Argument Guide

📥 Input Argument Reference

📁 Additional Documentation

Inside the `./jind_multi/` folder:

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

ML4BM-Lab/JIND-Multi

Folders and files

Latest commit

History

Repository files navigation

JIND-Multi

Prerequisites for Running Locally

📁 Required Inputs & Configuration Options

📌 Notes

📤 Output Files and Results Overview

🧪 Prediction Results

📊 Performance Metrics and Confusion Matrices

📂 Datasets and Input Argument Guide

📥 Input Argument Reference

📁 Additional Documentation

Inside the ./jind_multi/ folder:

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Inside the `./jind_multi/` folder:

Packages