UniMap is a multi-functional tool that leverages expert-curated scRNA-seq datasets as references to integrate, annotate, and conduct interpretable analyses on unlabeled query data.
The repository is organised as follows:
analysis/contains the specific analysis process of all cases;benchmark_models/contains the replication of all benchmark modelsdata/contains all preprocessed data;raw_data/contains all raw data;results/contains the output of UniMap;data_process1.pycontains simple initial processing of raw data;data_process2.pycontains the necessary preprocessing of the data;data_list.pycontains steps to read the data;loss.pycontains each loss function of UniMap;network.pycontains different modules of UniMap;train_unimap.pythe main function for UniMap;utils.pycontains the necessary processing subroutines.
- The environment dependencies for UniMap can be downloaded from GitHub:
git clone git@github.com:Huahuatii/Reproducing-UniMap.git
cd Reproducing-UniMap
conda env update --f env.yml
conda activate unimap
Benchmark datasets available on Google Drive need to be manually downloaded and extracted to the data/ folder. We strongly recommend using the PBMC CVID dataset because it has a relatively small data size:
Before training, please ensure that you have downloaded the datasets and placed them in the correct path.
unzip data/pbmc9.zipThen run the following commands to test UniMap:
$ python train_unimap.py --data_type pbmc9 --max_epoch 50Only this one parameter needs to be changed for different datasets:
--data_type:- PBMC CVID dataset:
pbmc9 - PBMC COVID-19:
pbmc40 - PBMC MG:
mg - Cross-species:
cross_species
- PBMC CVID dataset:
The training code ran successfully when you see the following:
results/pbmc9/unimap/2023 created!
Feature in Source and Target are aligned!
Current config is:
Namespace(model='unimap', method='union', var_name='highly_variable', need_umap=1, seed=2024, batch_size=128, latent_feature=128, tolerance=10, max_epoch=10, lr=0.0001, data_type='pbmc9', drop=0.1, conf_thres=0.9, trans_loss_w=0.5, t_loss_w=0.5, margin_w=1, epoch=25000, focal_alpha=1, focal_gamma=2, device='cuda', save_folder='results/pbmc9/unimap/2024', in_feature=1815, ce=<utils.Label_Encoder object at 0x7fa47038f670>, be=LabelEncoder(), num_classes=8, num_batches=6)
epoch:00 total_loss:-0.3942 s_loss:0.2297 t_loss:0.0003 transfer_loss:0.4793 margin_loss:-0.0063 mean_ent:1.8265 best_idx:0
epoch:01 total_loss:-0.6194 s_loss:0.0647 t_loss:0.0003 transfer_loss:0.4836 margin_loss:-0.0065
......
epoch:09 total_loss:-0.7485 s_loss:0.0153 t_loss:0.0001 transfer_loss:0.4325 margin_loss:-0.0068 mean_ent:0.7218 best_idx:9
Calculating UMAP...
This may take a few minutes...
╭──────────── Unimap PBMC9 Train Finished ────────────╮
│ All results are saved in: results/pbmc9/unimap/2024 │
│ 1. st_result.csv │
│ 2. history.csv │
│ 3. best_model.pth │
│ 4. st_z_result.csv │
│ 5. t_prob_result.csv │
│ 6. st_umap_result.csv │
╰─────────────────────────────────────────────────────╯
We provide source codes for reproducing the experiments of the paper UniMap. For reproducing this part, although you do not need to download the corresponding datasets and retrain the model, you still need to manually download the results, extract and place them in the results/ folder, except for the PBMC CVID results, which are already placed in the results/ folder. Therefore, we recommend using the PBMC CVID datasets for reproduction.
The results can be downloaded from the following Google Drive link:
- PBMC CVID results (only UniMap)
- PBMC COVID-19 results (only UniMap)
- PBMC MG results (only UniMap)
- Cross-species results (only UniMap)
The reproduction code is provided below:
- Benchmarking of PBMC CVID datasets.
- Integration and annotation of PBMC COVID-19 datasets.
- Integration and annotation of PBMC MG datasets.
- Integration and annotation of Cross-species datasets.
This framework is developed by Haitao Hu (22260236@zju.edu.cn)
