Skip to content

Trouble with fine-tuning #13

Open
Open
@nemtiax

Description

@nemtiax

I've trained a model on COCO for 50k iterations (I know the paper says 100k, but that takes a long time, and I wanted to verify that i'm on the right track), and now I'm trying to fine-tune it on HABBOF+CEPDOF. COCO training seemed to progress well, I used one split of CEPDOF (Lunch1) as my validation data:

Screen Shot 2020-10-19 at 10 44 06 AM

0.9 AP50 on a CEPDOF split seems great, given that it's only trained on COCO images so far. (ignore the little zigzag at the start - I forgot to reset the logs after a misconfigured run).

I had to do the following steps to get it to run:

-Convert HABBOF annotation text files into a CEPDOF-style json annotation file. This was pretty straight forward, although it did require enforcing the CEPDOF restriction that r<90 (HABBOF appears to allow r<=90). I just swapped all r=90 instances to r=-90.
-Rename a few CEPDOF folders to match the names in train.py. In particular, CEPDOF has "High_activity" whereas train.py has "Activity", and CEPDOF has "Edge_Cases" whereas train.py has "Edge_cases".

I then ran train.py again, pointing it to my previous COCO checkpoint.

python train.py --dataset=H1H2 --batch_size=8 --checkpoint=rapid_pL1_dark53_COCO608_Oct15_52000.ckpt

However, I'm getting a crash at the end of the first validation section:

Total time: 0:00:09.982568, iter: 0:00:09.982568, epoch: 2:33:45.084843
[Iteration -1] [learning rate 4e-10] [Total loss 373.04] [img size 672]
level_21 total 7 objects: xy/gt 1.204, wh/gt 0.024, angle/gt 0.337, conf 44.211
level_42 total 24 objects: xy/gt 1.111, wh/gt 0.026, angle/gt 0.268, conf 78.706
level_84 total 35 objects: xy/gt 1.319, wh/gt 0.028, angle/gt 0.464, conf 142.954
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:36<00:00,  4.52it/s]
accumulating results
Traceback (most recent call last):
  File "train.py", line 228, in <module>
    str_0 = val_set.evaluate_dtList(dts, metric='AP')
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 77, in evaluate_dtList
    self._accumulate(**kwargs)
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 204, in _accumulate
    assert ((tp_sum[:,-1] + fp_sum[:,-1]) == num_dt).all()
IndexError: index -1 is out of bounds for dimension 1 with size 0Total time: 0:00:09.982568, iter: 0:00:09.982568, epoch: 2:33:45.084843
[Iteration -1] [learning rate 4e-10] [Total loss 373.04] [img size 672]
level_21 total 7 objects: xy/gt 1.204, wh/gt 0.024, angle/gt 0.337, conf 44.211
level_42 total 24 objects: xy/gt 1.111, wh/gt 0.026, angle/gt 0.268, conf 78.706
level_84 total 35 objects: xy/gt 1.319, wh/gt 0.028, angle/gt 0.464, conf 142.954
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:36<00:00,  4.52it/s]
accumulating results
Traceback (most recent call last):
  File "train.py", line 228, in <module>
    str_0 = val_set.evaluate_dtList(dts, metric='AP')
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 77, in evaluate_dtList
    self._accumulate(**kwargs)
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 204, in _accumulate
    assert ((tp_sum[:,-1] + fp_sum[:,-1]) == num_dt).all()
IndexError: index -1 is out of bounds for dimension 1 with size 0

After adding a few print statements in MWtools/_accumulate, I think the problem is that I'm getting no detections:


Total time: 0:00:09.720600, iter: 0:00:09.720600, epoch: 2:29:42.951354
[Iteration -1] [learning rate 4e-10] [Total loss 188.62] [img size 672]
level_21 total 10 objects: xy/gt 1.033, wh/gt 0.013, angle/gt 0.350, conf 17.059
level_42 total 18 objects: xy/gt 1.259, wh/gt 0.015, angle/gt 0.209, conf 45.737
level_84 total 15 objects: xy/gt 1.318, wh/gt 0.016, angle/gt 0.229, conf 62.061
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:07<00:00,  4.88it/s]
accumulating results
NUM_GT: 6917
TPS tensor([], size=(10, 0), dtype=torch.bool)
FPS tensor([], size=(10, 0), dtype=torch.bool)
NUM_DT 0
TP_SUM tensor([], size=(10, 0))
FP_SUM tensor([], size=(10, 0))

Any advice on what might be going wrong here? Have I missed a step in setting up fine-tuning?

Also, would it be possible to make your pre-finetuning COCO checkpoint available? It'd save me a lot of training time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions