-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Description
I'm trying to train htc without semantic model on my dataset. Training continues for a few iterations, but then it stops and an error message is displayed: struct.error: 'i' format requires -2147483648 <= number <= 2147483647 (The GPU is in use even when the training stops).
- What command or script did you run?
python tools/train.py configs/htc/htc_without_semantic_r50_fpn_1x.py
- What dataset did you use?
iSAID dataset (each image is 800x800)
Environment
sys.platform: linux
Python: 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GPU 0: Tesla V100-PCIE-32GB
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.1
- Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.5.0
OpenCV: 4.1.2
MMCV: 0.2.16
MMDetection: 1.0+93bed07
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 10.1
Error traceback
2020-02-02 12:37:11,502 - mmdet - INFO - Epoch [1][650/18528] lr: 0.00310, eta: 4 days, 18:37:07, time: 0.728, data_time: 0.071, memory: 5590, loss_rpn_cls: 0.1827, loss_rpn_bbox: 0.0775, s0.loss_cls: 0.4941, s0.acc: 88.8125, s0.loss_bbox: 0.1686, s0.loss_mask: 0.5250, s1.loss_cls: 0.1972, s1.acc: 90.7881, s1.loss_bbox: 0.0964, s1.loss_mask: 0.2665, s2.loss_cls: 0.0714, s2.acc: 93.5225, s2.loss_bbox: 0.0229, s2.loss_mask: 0.1330, loss: 2.2353
2020-02-02 12:37:48,581 - mmdet - INFO - Epoch [1][700/18528] lr: 0.00310, eta: 4 days, 18:35:25, time: 0.742, data_time: 0.092, memory: 5590, loss_rpn_cls: 0.1725, loss_rpn_bbox: 0.0657, s0.loss_cls: 0.5495, s0.acc: 86.7148, s0.loss_bbox: 0.2078, s0.loss_mask: 0.5147, s1.loss_cls: 0.2271, s1.acc: 89.3906, s1.loss_bbox: 0.1113, s1.loss_mask: 0.2618, s2.loss_cls: 0.0833, s2.acc: 92.5547, s2.loss_bbox: 0.0228, s2.loss_mask: 0.1314, loss: 2.3478
2020-02-02 12:38:38,882 - mmdet - INFO - Epoch [1][750/18528] lr: 0.00310, eta: 4 days, 21:16:57, time: 1.006, data_time: 0.188, memory: 5590, loss_rpn_cls: 0.1754, loss_rpn_bbox: 0.0763, s0.loss_cls: 0.4764, s0.acc: 86.5312, s0.loss_bbox: 0.1909, s0.loss_mask: 0.5148, s1.loss_cls: 0.2147, s1.acc: 87.8662, s1.loss_bbox: 0.1259, s1.loss_mask: 0.2618, s2.loss_cls: 0.0844, s2.acc: 90.8618, s2.loss_bbox: 0.0328, s2.loss_mask: 0.1273, loss: 2.2806
2020-02-02 12:39:13,072 - mmdet - INFO - Epoch [1][800/18528] lr: 0.00310, eta: 4 days, 20:31:53, time: 0.684, data_time: 0.058, memory: 5590, loss_rpn_cls: 0.1607, loss_rpn_bbox: 0.0589, s0.loss_cls: 0.4762, s0.acc: 88.1055, s0.loss_bbox: 0.1624, s0.loss_mask: 0.5100, s1.loss_cls: 0.2019, s1.acc: 89.7587, s1.loss_bbox: 0.1066, s1.loss_mask: 0.2612, s2.loss_cls: 0.0766, s2.acc: 92.8203, s2.loss_bbox: 0.0282, s2.loss_mask: 0.1265, loss: 2.1691
2020-02-02 12:39:50,342 - mmdet - INFO - Epoch [1][850/18528] lr: 0.00310, eta: 4 days, 20:25:35, time: 0.745, data_time: 0.080, memory: 5590, loss_rpn_cls: 0.1796, loss_rpn_bbox: 0.0810, s0.loss_cls: 0.4945, s0.acc: 87.5430, s0.loss_bbox: 0.1822, s0.loss_mask: 0.4937, s1.loss_cls: 0.2152, s1.acc: 89.1416, s1.loss_bbox: 0.1204, s1.loss_mask: 0.2475, s2.loss_cls: 0.0812, s2.acc: 92.0859, s2.loss_bbox: 0.0298, s2.loss_mask: 0.1242, loss: 2.2493
2020-02-02 12:40:25,651 - mmdet - INFO - Epoch [1][900/18528] lr: 0.00310, eta: 4 days, 19:59:45, time: 0.706, data_time: 0.070, memory: 5590, loss_rpn_cls: 0.1652, loss_rpn_bbox: 0.0869, s0.loss_cls: 0.4785, s0.acc: 87.5156, s0.loss_bbox: 0.1794, s0.loss_mask: 0.4574, s1.loss_cls: 0.2248, s1.acc: 88.1298, s1.loss_bbox: 0.1368, s1.loss_mask: 0.2303, s2.loss_cls: 0.0882, s2.acc: 91.3361, s2.loss_bbox: 0.0415, s2.loss_mask: 0.1157, loss: 2.2046
2020-02-02 12:41:01,211 - mmdet - INFO - Epoch [1][950/18528] lr: 0.00310, eta: 4 days, 19:39:00, time: 0.711, data_time: 0.068, memory: 5590, loss_rpn_cls: 0.1625, loss_rpn_bbox: 0.0865, s0.loss_cls: 0.5250, s0.acc: 86.5234, s0.loss_bbox: 0.2008, s0.loss_mask: 0.4628, s1.loss_cls: 0.2395, s1.acc: 87.7434, s1.loss_bbox: 0.1457, s1.loss_mask: 0.2361, s2.loss_cls: 0.0910, s2.acc: 91.5898, s2.loss_bbox: 0.0393, s2.loss_mask: 0.1149, loss: 2.3042
Traceback (most recent call last):
File "/opt/conda/envs/open-mmlab/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/opt/conda/envs/open-mmlab/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/opt/conda/envs/open-mmlab/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
Trials
img_scale Tried : (1400, 800), (1200, 800), (1000, 800) (with and without using CPU for gt)
GPU memory: 32 GB
Some Changes
Since my dataset has many bbox in a single image, I set the gpu_assign_thr=350 in the max_iou_assigner.py. Still it throws the same error.
Thanks!