Trouble with fine-tuning

I've trained a model on COCO for 50k iterations (I know the paper says 100k, but that takes a long time, and I wanted to verify that i'm on the right track), and now I'm trying to fine-tune it on HABBOF+CEPDOF. COCO training seemed to progress well, I used one split of CEPDOF (Lunch1) as my validation data:

<img width="368" alt="Screen Shot 2020-10-19 at 10 44 06 AM" src="https://user-images.githubusercontent.com/6548233/96518901-e20ef900-1239-11eb-9a92-b9bc238aaa6e.png">

0.9 AP50 on a CEPDOF split seems great, given that it's only trained on COCO images so far.  (ignore the little zigzag at the start - I forgot to reset the logs after a misconfigured run).

I had to do the following steps to get it to run:

-Convert HABBOF annotation text files into a CEPDOF-style json annotation file.  This was pretty straight forward, although it did require enforcing the CEPDOF restriction that r<90 (HABBOF appears to allow r<=90).   I just swapped all r=90 instances to r=-90.  
-Rename a few CEPDOF folders to match the names in train.py.  In particular, CEPDOF has "High_activity" whereas train.py has "Activity", and CEPDOF has "Edge_Cases" whereas train.py has "Edge_cases".  

I then ran train.py again, pointing it to my previous COCO checkpoint.

```
python train.py --dataset=H1H2 --batch_size=8 --checkpoint=rapid_pL1_dark53_COCO608_Oct15_52000.ckpt
```

However, I'm getting a crash at the end of the first validation section:

```
Total time: 0:00:09.982568, iter: 0:00:09.982568, epoch: 2:33:45.084843
[Iteration -1] [learning rate 4e-10] [Total loss 373.04] [img size 672]
level_21 total 7 objects: xy/gt 1.204, wh/gt 0.024, angle/gt 0.337, conf 44.211
level_42 total 24 objects: xy/gt 1.111, wh/gt 0.026, angle/gt 0.268, conf 78.706
level_84 total 35 objects: xy/gt 1.319, wh/gt 0.028, angle/gt 0.464, conf 142.954
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:36<00:00,  4.52it/s]
accumulating results
Traceback (most recent call last):
  File "train.py", line 228, in <module>
    str_0 = val_set.evaluate_dtList(dts, metric='AP')
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 77, in evaluate_dtList
    self._accumulate(**kwargs)
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 204, in _accumulate
    assert ((tp_sum[:,-1] + fp_sum[:,-1]) == num_dt).all()
IndexError: index -1 is out of bounds for dimension 1 with size 0Total time: 0:00:09.982568, iter: 0:00:09.982568, epoch: 2:33:45.084843
[Iteration -1] [learning rate 4e-10] [Total loss 373.04] [img size 672]
level_21 total 7 objects: xy/gt 1.204, wh/gt 0.024, angle/gt 0.337, conf 44.211
level_42 total 24 objects: xy/gt 1.111, wh/gt 0.026, angle/gt 0.268, conf 78.706
level_84 total 35 objects: xy/gt 1.319, wh/gt 0.028, angle/gt 0.464, conf 142.954
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:36<00:00,  4.52it/s]
accumulating results
Traceback (most recent call last):
  File "train.py", line 228, in <module>
    str_0 = val_set.evaluate_dtList(dts, metric='AP')
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 77, in evaluate_dtList
    self._accumulate(**kwargs)
  File "/home/ubuntu/RAPiD_clean/RAPiD/utils/MWtools.py", line 204, in _accumulate
    assert ((tp_sum[:,-1] + fp_sum[:,-1]) == num_dt).all()
IndexError: index -1 is out of bounds for dimension 1 with size 0
```

After adding a few print statements in MWtools/_accumulate, I think the problem is that I'm getting no detections:

```

Total time: 0:00:09.720600, iter: 0:00:09.720600, epoch: 2:29:42.951354
[Iteration -1] [learning rate 4e-10] [Total loss 188.62] [img size 672]
level_21 total 10 objects: xy/gt 1.033, wh/gt 0.013, angle/gt 0.350, conf 17.059
level_42 total 18 objects: xy/gt 1.259, wh/gt 0.015, angle/gt 0.209, conf 45.737
level_84 total 15 objects: xy/gt 1.318, wh/gt 0.016, angle/gt 0.229, conf 62.061
Max GPU memory usage: 10.145342826843262 GigaBytes
Using PIL.Image format
100%|███████████████████████████████████████████████| 1792/1792 [06:07<00:00,  4.88it/s]
accumulating results
NUM_GT: 6917
TPS tensor([], size=(10, 0), dtype=torch.bool)
FPS tensor([], size=(10, 0), dtype=torch.bool)
NUM_DT 0
TP_SUM tensor([], size=(10, 0))
FP_SUM tensor([], size=(10, 0))
```

Any advice on what might be going wrong here?  Have I missed a step in setting up fine-tuning?

Also, would it be possible to make your pre-finetuning COCO checkpoint available?  It'd save me a lot of training time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble with fine-tuning #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Trouble with fine-tuning #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions