Description
Description of bug
When training a model into sei-framework (DeepSEA) using train.yml and your data hg38_UCSC.fa, sei_chromatin_profiles.txt, sorted_sei_data.bed.gz, I get printing in the log file:
2025-04-29 15:49:07,977 - INFO - validation roc_auc: None
2025-04-29 15:49:07,995 - INFO - validation average_precision: None
However, I observe calculated validation_loss and training_loss >= 0
This causes, as per Selene's behaviour, that no final best model is saved but only some partial results of the train (data.pkl, version, and a "data" folder in binary format).
I also carried out some experiments by sub-sampling the chromatin profile file and the bed file provided by you (maintaining the presence in the bed file of 5 bins per chromosome and that the chromatin profiles expressed in the .txt file are expressed by bins in the bed file); however I always observed the same behaviour of the model during the train: validation roc_auc None, validation average_precision None, with no final best model detected and saved.
- Below is the configuration file I used for the train on your data hg38_UCSC.fa, sei_chromatin_profiles.txt, sorted_sei_data.bed.gz (compared to your version train.yml I only changed the number of workers):
...
ops: [train]
model: {
path: /nfsd/bcb/bcbg/EleonoraSignor/sei-framework/model/sei.py,
class: Sei,
class_args: {
sequence_length: 4096,
n_genomic_features: 21907,
},
non_strand_specific: mean
}
sampler: !obj:selene_sdk.samplers.MultiSampler {
train_sampler: !obj:selene_sdk.samplers.dataloader.SamplerDataLoader {
sampler: !obj:selene_sdk.samplers.RandomPositionsSampler {
target_path: /nfsd/bcb/bcbg/EleonoraSignor/sei-framework/train/data/sorted_sei_data.bed.gz,
reference_sequence: !obj:selene_sdk.sequences.Genome {
input_path: /nfsd/bcb/bcbg/EleonoraSignor/sei-framework/resources/hg38_UCSC.fa,
blacklist_regions: hg38
},
features: !obj:selene_sdk.utils.load_features_list {
input_path: /nfsd/bcb/bcbg/EleonoraSignor/sei-framework/train/data/sei_chromatin_profiles.txt
},
test_holdout: [chr8, chr9],
validation_holdout: [chr10],
sequence_length: 4096,
center_bin_to_predict: [2048, 2049],
feature_thresholds: null,
save_datasets: []
},
num_workers: 1,
batch_size: 64,
},
validate_sampler: !obj:selene_sdk.samplers.RandomPositionsSampler {
target_path: /nfsd/bcb/bcbg/EleonoraSignor/sei-framework/train/data/sorted_sei_data.bed.gz,
reference_sequence: !obj:selene_sdk.sequences.Genome {
input_path: /nfsd/bcb/bcbg/EleonoraSignor/sei-framework/resources/hg38_UCSC.fa,
blacklist_regions: hg38
},
features: !obj:selene_sdk.utils.load_features_list {
input_path: /nfsd/bcb/bcbg/EleonoraSignor/sei-framework/train/data/sei_chromatin_profiles.txt
},
test_holdout: [chr8, chr9],
validation_holdout: [chr10],
sequence_length: 4096,
center_bin_to_predict: [2048, 2049],
mode: validate,
save_datasets: []
},
features: !obj:selene_sdk.utils.load_features_list {
input_path: /nfsd/bcb/bcbg/EleonoraSignor/sei-framework/train/data/sei_chromatin_profiles.txt
}
}
train_model: !obj:selene_sdk.TrainModel {
batch_size: 64,
report_stats_every_n_steps: 5000,
n_validation_samples: 12800,
n_test_samples: 1600000,
use_cuda: True,
data_parallel: True, #we recommend multi-GPU training only on NVLink-enabled GPUs
cpu_n_threads: 19,
report_gt_feature_n_positives: 5,
use_scheduler: False,
max_steps: 1000000000,
metrics: {
roc_auc: !import sklearn.metrics.roc_auc_score,
average_precision: !import sklearn.metrics.average_precision_score
},
}
output_dir: /nfsd/bcb/bcbg/EleonoraSignor/sei-framework/train/models
random_seed: 1447
...
- I also provide you with a log file, which I obtained from an experiment where I used 10 chromatin profiles (SUBSET-sei_chromatin_profiles.txt) and 5 bins per chromosome in the bed file (SUBSET-sorted_sei_data_filtered_bins_exact_profiles.bed.gz), respecting that each of the 10 chromatin profiles has at least one corresponding bins in the bed file, and reference genome hg38_UCSC.fa:
2025-04-29 13:43:46,142 - INFO - Setting deterministic = True for reproducibility.
2025-04-29 13:43:46,142 - INFO - Training parameters set: batch size 8, number of steps per 'epoch': 10, maximum number of steps: 100
2025-04-29 13:43:49,454 - DEBUG - Wrapped model in DataParallel
2025-04-29 13:43:49,455 - DEBUG - Set modules to use CUDA
2025-04-29 13:43:49,457 - INFO - Creating validation dataset.
2025-04-29 13:43:52,973 - INFO - 3.5153815746307373 s to load 1000 validation examples (125 validation batches) to evaluate after each training step.
2025-04-29 13:43:53,042 - DEBUG - [BATCH] Time to sample 8 examples: 0.06795096397399902 s.
2025-04-29 13:43:59,475 - DEBUG - [TRAIN] 0: Saving model state to file.
2025-04-29 13:44:02,013 - DEBUG - [BATCH] Time to sample 8 examples: 0.02274632453918457 s.
2025-04-29 13:44:03,295 - DEBUG - [BATCH] Time to sample 8 examples: 0.040410518646240234 s.
2025-04-29 13:44:04,532 - DEBUG - [BATCH] Time to sample 8 examples: 0.02966022491455078 s.
2025-04-29 13:44:05,770 - DEBUG - [BATCH] Time to sample 8 examples: 0.03045344352722168 s.
2025-04-29 13:44:06,997 - DEBUG - [BATCH] Time to sample 8 examples: 0.023934364318847656 s.
2025-04-29 13:44:08,224 - DEBUG - [BATCH] Time to sample 8 examples: 0.02252674102783203 s.
2025-04-29 13:44:09,461 - DEBUG - [BATCH] Time to sample 8 examples: 0.03370785713195801 s.
2025-04-29 13:44:10,690 - DEBUG - [BATCH] Time to sample 8 examples: 0.023950815200805664 s.
2025-04-29 13:44:11,922 - DEBUG - [BATCH] Time to sample 8 examples: 0.024319171905517578 s.
2025-04-29 13:44:13,155 - DEBUG - [BATCH] Time to sample 8 examples: 0.026258230209350586 s.
2025-04-29 13:44:14,361 - INFO - [STEP 10] average number of steps per second: 0.6
2025-04-29 13:44:14,362 - INFO - training loss: 0.5449446114626798
2025-04-29 13:46:30,425 - INFO - validation roc_auc: None
2025-04-29 13:46:30,426 - INFO - validation average_precision: None
2025-04-29 13:46:30,426 - DEBUG - [TRAIN] 10: Saving model state to file.
2025-04-29 13:46:36,665 - DEBUG - Updating best_model.pth.tar
2025-04-29 13:46:36,665 - INFO - validation loss: 0.4264565827846527
2025-04-29 13:46:36,719 - DEBUG - [BATCH] Time to sample 8 examples: 0.05398201942443848 s.
2025-04-29 13:46:37,989 - DEBUG - [BATCH] Time to sample 8 examples: 0.02141737937927246 s.
2025-04-29 13:46:39,248 - DEBUG - [BATCH] Time to sample 8 examples: 0.03598308563232422 s.
2025-04-29 13:46:40,495 - DEBUG - [BATCH] Time to sample 8 examples: 0.03802609443664551 s.
2025-04-29 13:46:41,758 - DEBUG - [BATCH] Time to sample 8 examples: 0.03577780723571777 s.
2025-04-29 13:46:43,007 - DEBUG - [BATCH] Time to sample 8 examples: 0.028432846069335938 s.
2025-04-29 13:46:44,240 - DEBUG - [BATCH] Time to sample 8 examples: 0.025448083877563477 s.
2025-04-29 13:46:45,500 - DEBUG - [BATCH] Time to sample 8 examples: 0.038803815841674805 s.
2025-04-29 13:46:46,748 - DEBUG - [BATCH] Time to sample 8 examples: 0.04081869125366211 s.
2025-04-29 13:46:48,026 - DEBUG - [BATCH] Time to sample 8 examples: 0.04925346374511719 s.
2025-04-29 13:46:49,241 - INFO - [STEP 20] average number of steps per second: 0.8
2025-04-29 13:46:49,241 - INFO - training loss: 0.238083927705884
2025-04-29 13:49:06,118 - INFO - validation roc_auc: None
2025-04-29 13:49:06,119 - INFO - validation average_precision: None
2025-04-29 13:49:06,120 - DEBUG - [TRAIN] 20: Saving model state to file.
2025-04-29 13:49:11,677 - DEBUG - Updating best_model.pth.tar
2025-04-29 13:49:11,677 - INFO - validation loss: 0.04189149260520935
2025-04-29 13:49:11,711 - DEBUG - [BATCH] Time to sample 8 examples: 0.033258676528930664 s.
2025-04-29 13:49:12,984 - DEBUG - [BATCH] Time to sample 8 examples: 0.030499935150146484 s.
2025-04-29 13:49:14,246 - DEBUG - [BATCH] Time to sample 8 examples: 0.03310084342956543 s.
2025-04-29 13:49:15,509 - DEBUG - [BATCH] Time to sample 8 examples: 0.039209842681884766 s.
2025-04-29 13:49:16,784 - DEBUG - [BATCH] Time to sample 8 examples: 0.053952693939208984 s.
2025-04-29 13:49:18,034 - DEBUG - [BATCH] Time to sample 8 examples: 0.03648209571838379 s.
2025-04-29 13:49:19,293 - DEBUG - [BATCH] Time to sample 8 examples: 0.0339815616607666 s.
2025-04-29 13:49:20,549 - DEBUG - [BATCH] Time to sample 8 examples: 0.03137612342834473 s.
2025-04-29 13:49:21,822 - DEBUG - [BATCH] Time to sample 8 examples: 0.0253298282623291 s.
2025-04-29 13:49:23,097 - DEBUG - [BATCH] Time to sample 8 examples: 0.03963041305541992 s.
2025-04-29 13:49:24,349 - INFO - [STEP 30] average number of steps per second: 0.8
2025-04-29 13:49:24,350 - INFO - training loss: 0.005656527276369161
2025-04-29 13:51:40,170 - INFO - validation roc_auc: None
2025-04-29 13:51:40,171 - INFO - validation average_precision: None
2025-04-29 13:51:40,172 - DEBUG - [TRAIN] 30: Saving model state to file.
2025-04-29 13:51:46,322 - DEBUG - Updating best_model.pth.tar
2025-04-29 13:51:46,323 - INFO - validation loss: 7.712587952846661e-05
2025-04-29 13:51:46,365 - DEBUG - [BATCH] Time to sample 8 examples: 0.042008399963378906 s.
2025-04-29 13:51:47,666 - DEBUG - [BATCH] Time to sample 8 examples: 0.025139808654785156 s.
2025-04-29 13:51:48,942 - DEBUG - [BATCH] Time to sample 8 examples: 0.04321646690368652 s.
2025-04-29 13:51:50,227 - DEBUG - [BATCH] Time to sample 8 examples: 0.03827023506164551 s.
2025-04-29 13:51:51,509 - DEBUG - [BATCH] Time to sample 8 examples: 0.03984642028808594 s.
2025-04-29 13:51:52,800 - DEBUG - [BATCH] Time to sample 8 examples: 0.04982876777648926 s.
2025-04-29 13:51:54,080 - DEBUG - [BATCH] Time to sample 8 examples: 0.04528355598449707 s.
2025-04-29 13:51:55,355 - DEBUG - [BATCH] Time to sample 8 examples: 0.029361724853515625 s.
2025-04-29 13:51:56,617 - DEBUG - [BATCH] Time to sample 8 examples: 0.03145432472229004 s.
2025-04-29 13:51:57,896 - DEBUG - [BATCH] Time to sample 8 examples: 0.03281450271606445 s.
2025-04-29 13:51:59,126 - INFO - [STEP 40] average number of steps per second: 0.8
2025-04-29 13:51:59,127 - INFO - training loss: 5.87201489565814e-06
2025-04-29 13:54:15,643 - INFO - validation roc_auc: None
2025-04-29 13:54:15,660 - INFO - validation average_precision: None
2025-04-29 13:54:15,661 - DEBUG - [TRAIN] 40: Saving model state to file.
2025-04-29 13:54:21,838 - DEBUG - Updating best_model.pth.tar
2025-04-29 13:54:21,838 - INFO - validation loss: 2.4526438555767528e-06
2025-04-29 13:54:21,876 - DEBUG - [BATCH] Time to sample 8 examples: 0.037668466567993164 s.
2025-04-29 13:54:23,178 - DEBUG - [BATCH] Time to sample 8 examples: 0.0375819206237793 s.
2025-04-29 13:54:24,446 - DEBUG - [BATCH] Time to sample 8 examples: 0.04212832450866699 s.
2025-04-29 13:54:25,709 - DEBUG - [BATCH] Time to sample 8 examples: 0.043633460998535156 s.
2025-04-29 13:54:26,974 - DEBUG - [BATCH] Time to sample 8 examples: 0.0326848030090332 s.
2025-04-29 13:54:28,239 - DEBUG - [BATCH] Time to sample 8 examples: 0.03873395919799805 s.
2025-04-29 13:54:29,526 - DEBUG - [BATCH] Time to sample 8 examples: 0.043430328369140625 s.
2025-04-29 13:54:30,794 - DEBUG - [BATCH] Time to sample 8 examples: 0.04188942909240723 s.
2025-04-29 13:54:32,062 - DEBUG - [BATCH] Time to sample 8 examples: 0.034316062927246094 s.
2025-04-29 13:54:33,331 - DEBUG - [BATCH] Time to sample 8 examples: 0.035858154296875 s.
2025-04-29 13:54:34,573 - INFO - [STEP 50] average number of steps per second: 0.8
2025-04-29 13:54:34,574 - INFO - training loss: 2.4952012864787323e-07
2025-04-29 13:56:50,736 - INFO - validation roc_auc: None
2025-04-29 13:56:50,738 - INFO - validation average_precision: None
2025-04-29 13:56:50,738 - DEBUG - [TRAIN] 50: Saving model state to file.
2025-04-29 13:56:55,443 - DEBUG - Updating best_model.pth.tar
2025-04-29 13:56:55,443 - INFO - validation loss: 6.21278103608347e-07
2025-04-29 13:56:55,480 - DEBUG - [BATCH] Time to sample 8 examples: 0.036336660385131836 s.
2025-04-29 13:56:56,783 - DEBUG - [BATCH] Time to sample 8 examples: 0.026140689849853516 s.
2025-04-29 13:56:58,054 - DEBUG - [BATCH] Time to sample 8 examples: 0.02702927589416504 s.
2025-04-29 13:56:59,327 - DEBUG - [BATCH] Time to sample 8 examples: 0.026160240173339844 s.
2025-04-29 13:57:00,596 - DEBUG - [BATCH] Time to sample 8 examples: 0.02651524543762207 s.
2025-04-29 13:57:01,868 - DEBUG - [BATCH] Time to sample 8 examples: 0.028634309768676758 s.
2025-04-29 13:57:03,139 - DEBUG - [BATCH] Time to sample 8 examples: 0.02731490135192871 s.
2025-04-29 13:57:04,414 - DEBUG - [BATCH] Time to sample 8 examples: 0.029621124267578125 s.
2025-04-29 13:57:05,689 - DEBUG - [BATCH] Time to sample 8 examples: 0.0329890251159668 s.
2025-04-29 13:57:06,972 - DEBUG - [BATCH] Time to sample 8 examples: 0.04100394248962402 s.
2025-04-29 13:57:08,216 - INFO - [STEP 60] average number of steps per second: 0.8
2025-04-29 13:57:08,217 - INFO - training loss: 8.381905161058967e-08
2025-04-29 13:59:25,848 - INFO - validation roc_auc: None
2025-04-29 13:59:25,849 - INFO - validation average_precision: None
2025-04-29 13:59:25,849 - DEBUG - [TRAIN] 60: Saving model state to file.
2025-04-29 13:59:30,052 - DEBUG - Updating best_model.pth.tar
2025-04-29 13:59:30,052 - INFO - validation loss: 3.77053402189631e-07
2025-04-29 13:59:30,102 - DEBUG - [BATCH] Time to sample 8 examples: 0.049765586853027344 s.
2025-04-29 13:59:31,400 - DEBUG - [BATCH] Time to sample 8 examples: 0.03403210639953613 s.
2025-04-29 13:59:32,690 - DEBUG - [BATCH] Time to sample 8 examples: 0.045079946517944336 s.
2025-04-29 13:59:33,970 - DEBUG - [BATCH] Time to sample 8 examples: 0.036760807037353516 s.
2025-04-29 13:59:35,245 - DEBUG - [BATCH] Time to sample 8 examples: 0.04293107986450195 s.
2025-04-29 13:59:36,517 - DEBUG - [BATCH] Time to sample 8 examples: 0.030341625213623047 s.
2025-04-29 13:59:37,781 - DEBUG - [BATCH] Time to sample 8 examples: 0.03646421432495117 s.
2025-04-29 13:59:39,034 - DEBUG - [BATCH] Time to sample 8 examples: 0.027781963348388672 s.
2025-04-29 13:59:40,310 - DEBUG - [BATCH] Time to sample 8 examples: 0.028576374053955078 s.
2025-04-29 13:59:41,577 - DEBUG - [BATCH] Time to sample 8 examples: 0.036710262298583984 s.
2025-04-29 13:59:42,824 - INFO - [STEP 70] average number of steps per second: 0.8
2025-04-29 13:59:42,825 - INFO - training loss: 5.364419237707807e-08
2025-04-29 14:01:58,912 - INFO - validation roc_auc: None
2025-04-29 14:01:58,914 - INFO - validation average_precision: None
2025-04-29 14:01:58,915 - DEBUG - [TRAIN] 70: Saving model state to file.
2025-04-29 14:02:04,477 - DEBUG - Updating best_model.pth.tar
2025-04-29 14:02:04,477 - INFO - validation loss: 3.156426239456778e-07
2025-04-29 14:02:04,513 - DEBUG - [BATCH] Time to sample 8 examples: 0.03553056716918945 s.
2025-04-29 14:02:05,803 - DEBUG - [BATCH] Time to sample 8 examples: 0.029157638549804688 s.
2025-04-29 14:02:07,060 - DEBUG - [BATCH] Time to sample 8 examples: 0.03611922264099121 s.
2025-04-29 14:02:08,308 - DEBUG - [BATCH] Time to sample 8 examples: 0.02735614776611328 s.
2025-04-29 14:02:09,577 - DEBUG - [BATCH] Time to sample 8 examples: 0.031862497329711914 s.
2025-04-29 14:02:10,832 - DEBUG - [BATCH] Time to sample 8 examples: 0.033113718032836914 s.
2025-04-29 14:02:12,099 - DEBUG - [BATCH] Time to sample 8 examples: 0.03315114974975586 s.
2025-04-29 14:02:13,357 - DEBUG - [BATCH] Time to sample 8 examples: 0.030545711517333984 s.
2025-04-29 14:02:14,614 - DEBUG - [BATCH] Time to sample 8 examples: 0.03906106948852539 s.
2025-04-29 14:02:15,887 - DEBUG - [BATCH] Time to sample 8 examples: 0.03446364402770996 s.
2025-04-29 14:02:17,112 - INFO - [STEP 80] average number of steps per second: 0.8
2025-04-29 14:02:17,113 - INFO - training loss: 4.5895586708866174e-08
2025-04-29 14:04:32,665 - INFO - validation roc_auc: None
2025-04-29 14:04:32,683 - INFO - validation average_precision: None
2025-04-29 14:04:32,684 - DEBUG - [TRAIN] 80: Saving model state to file.
2025-04-29 14:04:38,517 - DEBUG - Updating best_model.pth.tar
2025-04-29 14:04:38,517 - INFO - validation loss: 2.9552006458288817e-07
2025-04-29 14:04:38,544 - DEBUG - [BATCH] Time to sample 8 examples: 0.02707815170288086 s.
2025-04-29 14:04:39,843 - DEBUG - [BATCH] Time to sample 8 examples: 0.0443272590637207 s.
2025-04-29 14:04:41,098 - DEBUG - [BATCH] Time to sample 8 examples: 0.031562089920043945 s.
2025-04-29 14:04:42,367 - DEBUG - [BATCH] Time to sample 8 examples: 0.03975653648376465 s.
2025-04-29 14:04:43,618 - DEBUG - [BATCH] Time to sample 8 examples: 0.028406381607055664 s.
2025-04-29 14:04:44,870 - DEBUG - [BATCH] Time to sample 8 examples: 0.03281235694885254 s.
2025-04-29 14:04:46,154 - DEBUG - [BATCH] Time to sample 8 examples: 0.03940463066101074 s.
2025-04-29 14:04:47,435 - DEBUG - [BATCH] Time to sample 8 examples: 0.041507720947265625 s.
2025-04-29 14:04:48,701 - DEBUG - [BATCH] Time to sample 8 examples: 0.03107142448425293 s.
2025-04-29 14:04:49,970 - DEBUG - [BATCH] Time to sample 8 examples: 0.03574061393737793 s.
2025-04-29 14:04:51,213 - INFO - [STEP 90] average number of steps per second: 0.8
2025-04-29 14:04:51,214 - INFO - training loss: 4.418195231892241e-08
2025-04-29 14:07:06,887 - INFO - validation roc_auc: None
2025-04-29 14:07:06,888 - INFO - validation average_precision: None
2025-04-29 14:07:06,889 - DEBUG - [TRAIN] 90: Saving model state to file.
2025-04-29 14:07:10,999 - DEBUG - Updating best_model.pth.tar
2025-04-29 14:07:10,999 - INFO - validation loss: 2.8867147921118884e-07
2025-04-29 14:07:11,028 - DEBUG - [BATCH] Time to sample 8 examples: 0.02870011329650879 s.
2025-04-29 14:07:12,342 - DEBUG - [BATCH] Time to sample 8 examples: 0.03154802322387695 s.
2025-04-29 14:07:13,627 - DEBUG - [BATCH] Time to sample 8 examples: 0.03596901893615723 s.
2025-04-29 14:07:14,911 - DEBUG - [BATCH] Time to sample 8 examples: 0.0320584774017334 s.
2025-04-29 14:07:16,221 - DEBUG - [BATCH] Time to sample 8 examples: 0.03128671646118164 s.
2025-04-29 14:07:17,499 - DEBUG - [BATCH] Time to sample 8 examples: 0.03123188018798828 s.
2025-04-29 14:07:18,774 - DEBUG - [BATCH] Time to sample 8 examples: 0.03404068946838379 s.
2025-04-29 14:07:20,052 - DEBUG - [BATCH] Time to sample 8 examples: 0.02485179901123047 s.
2025-04-29 14:07:21,326 - DEBUG - [BATCH] Time to sample 8 examples: 0.028725624084472656 s.
Environment
I observed the same behaviour when installing sei-framework locally and remotely on a cluster.
Locally: Python 3.6.13, PyTorch 1.9.0, Selene 0.5.1
Cluster: Python 3.9.21, PyTorch: 1.13.1, Selene: 0.6.0 (for compatibility with GPUs)