Description
I've trained a model on COCO for 50k iterations (I know the paper says 100k, but that takes a long time, and I wanted to verify that i'm on the right track), and now I'm trying to fine-tune it on HABBOF+CEPDOF. COCO training seemed to progress well, I used one split of CEPDOF (Lunch1) as my validation data:
0.9 AP50 on a CEPDOF split seems great, given that it's only trained on COCO images so far. (ignore the little zigzag at the start - I forgot to reset the logs after a misconfigured run).
I had to do the following steps to get it to run:
-Convert HABBOF annotation text files into a CEPDOF-style json annotation file. This was pretty straight forward, although it did require enforcing the CEPDOF restriction that r<90 (HABBOF appears to allow r<=90). I just swapped all r=90 instances to r=-90.
-Rename a few CEPDOF folders to match the names in train.py. In particular, CEPDOF has "High_activity" whereas train.py has "Activity", and CEPDOF has "Edge_Cases" whereas train.py has "Edge_cases".
I then ran train.py again, pointing it to my previous COCO checkpoint.
python train.py --dataset=H1H2 --batch_size=8 --checkpoint=rapid_pL1_dark53_COCO608_Oct15_52000.ckpt
However, I'm getting a crash at the end of the first validation section:
Total time: 0:00:09.982568, iter: 0:00:09.982568, epoch: 2:33:45.084843
[Iteration -1] [learning rate 4e-10] [Total loss 373.04] [img size 672]
level_21 total 7 objects: xy/gt 1.204, wh/gt 0.024, angle/gt 0.337, conf 44.211
level_42 total 24 objects: xy/gt 1.111, wh/gt 0.026, angle/gt 0.268, conf 78.706
level_84 total 35 objects: xy/gt 1.319, wh/gt 0.028, angle/gt 0.464, conf 142.954
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:36<00:00, 4.52it/s]
accumulating results
Traceback (most recent call last):
File "train.py", line 228, in <module>
str_0 = val_set.evaluate_dtList(dts, metric='AP')
File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 77, in evaluate_dtList
self._accumulate(**kwargs)
File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 204, in _accumulate
assert ((tp_sum[:,-1] + fp_sum[:,-1]) == num_dt).all()
IndexError: index -1 is out of bounds for dimension 1 with size 0Total time: 0:00:09.982568, iter: 0:00:09.982568, epoch: 2:33:45.084843
[Iteration -1] [learning rate 4e-10] [Total loss 373.04] [img size 672]
level_21 total 7 objects: xy/gt 1.204, wh/gt 0.024, angle/gt 0.337, conf 44.211
level_42 total 24 objects: xy/gt 1.111, wh/gt 0.026, angle/gt 0.268, conf 78.706
level_84 total 35 objects: xy/gt 1.319, wh/gt 0.028, angle/gt 0.464, conf 142.954
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:36<00:00, 4.52it/s]
accumulating results
Traceback (most recent call last):
File "train.py", line 228, in <module>
str_0 = val_set.evaluate_dtList(dts, metric='AP')
File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 77, in evaluate_dtList
self._accumulate(**kwargs)
File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 204, in _accumulate
assert ((tp_sum[:,-1] + fp_sum[:,-1]) == num_dt).all()
IndexError: index -1 is out of bounds for dimension 1 with size 0
After adding a few print statements in MWtools/_accumulate, I think the problem is that I'm getting no detections:
Total time: 0:00:09.720600, iter: 0:00:09.720600, epoch: 2:29:42.951354
[Iteration -1] [learning rate 4e-10] [Total loss 188.62] [img size 672]
level_21 total 10 objects: xy/gt 1.033, wh/gt 0.013, angle/gt 0.350, conf 17.059
level_42 total 18 objects: xy/gt 1.259, wh/gt 0.015, angle/gt 0.209, conf 45.737
level_84 total 15 objects: xy/gt 1.318, wh/gt 0.016, angle/gt 0.229, conf 62.061
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:07<00:00, 4.88it/s]
accumulating results
NUM_GT: 6917
TPS tensor([], size=(10, 0), dtype=torch.bool)
FPS tensor([], size=(10, 0), dtype=torch.bool)
NUM_DT 0
TP_SUM tensor([], size=(10, 0))
FP_SUM tensor([], size=(10, 0))
Any advice on what might be going wrong here? Have I missed a step in setting up fine-tuning?
Also, would it be possible to make your pre-finetuning COCO checkpoint available? It'd save me a lot of training time.