Skip to content

struct.error: 'i' format requires -2147483648 <= number <= 2147483647 #2044

@prateek-77

Description

@prateek-77

I'm trying to train htc without semantic model on my dataset. Training continues for a few iterations, but then it stops and an error message is displayed: struct.error: 'i' format requires -2147483648 <= number <= 2147483647 (The GPU is in use even when the training stops).

  1. What command or script did you run?
python tools/train.py configs/htc/htc_without_semantic_r50_fpn_1x.py
  1. What dataset did you use?
    iSAID dataset (each image is 800x800)

Environment

sys.platform: linux
Python: 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GPU 0: Tesla V100-PCIE-32GB
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.3
  • Magma 2.5.1
  • Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.5.0
OpenCV: 4.1.2
MMCV: 0.2.16
MMDetection: 1.0+93bed07
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 10.1

Error traceback

2020-02-02 12:37:11,502 - mmdet - INFO - Epoch [1][650/18528]	lr: 0.00310, eta: 4 days, 18:37:07, time: 0.728, data_time: 0.071, memory: 5590, loss_rpn_cls: 0.1827, loss_rpn_bbox: 0.0775, s0.loss_cls: 0.4941, s0.acc: 88.8125, s0.loss_bbox: 0.1686, s0.loss_mask: 0.5250, s1.loss_cls: 0.1972, s1.acc: 90.7881, s1.loss_bbox: 0.0964, s1.loss_mask: 0.2665, s2.loss_cls: 0.0714, s2.acc: 93.5225, s2.loss_bbox: 0.0229, s2.loss_mask: 0.1330, loss: 2.2353
2020-02-02 12:37:48,581 - mmdet - INFO - Epoch [1][700/18528]	lr: 0.00310, eta: 4 days, 18:35:25, time: 0.742, data_time: 0.092, memory: 5590, loss_rpn_cls: 0.1725, loss_rpn_bbox: 0.0657, s0.loss_cls: 0.5495, s0.acc: 86.7148, s0.loss_bbox: 0.2078, s0.loss_mask: 0.5147, s1.loss_cls: 0.2271, s1.acc: 89.3906, s1.loss_bbox: 0.1113, s1.loss_mask: 0.2618, s2.loss_cls: 0.0833, s2.acc: 92.5547, s2.loss_bbox: 0.0228, s2.loss_mask: 0.1314, loss: 2.3478
2020-02-02 12:38:38,882 - mmdet - INFO - Epoch [1][750/18528]	lr: 0.00310, eta: 4 days, 21:16:57, time: 1.006, data_time: 0.188, memory: 5590, loss_rpn_cls: 0.1754, loss_rpn_bbox: 0.0763, s0.loss_cls: 0.4764, s0.acc: 86.5312, s0.loss_bbox: 0.1909, s0.loss_mask: 0.5148, s1.loss_cls: 0.2147, s1.acc: 87.8662, s1.loss_bbox: 0.1259, s1.loss_mask: 0.2618, s2.loss_cls: 0.0844, s2.acc: 90.8618, s2.loss_bbox: 0.0328, s2.loss_mask: 0.1273, loss: 2.2806
2020-02-02 12:39:13,072 - mmdet - INFO - Epoch [1][800/18528]	lr: 0.00310, eta: 4 days, 20:31:53, time: 0.684, data_time: 0.058, memory: 5590, loss_rpn_cls: 0.1607, loss_rpn_bbox: 0.0589, s0.loss_cls: 0.4762, s0.acc: 88.1055, s0.loss_bbox: 0.1624, s0.loss_mask: 0.5100, s1.loss_cls: 0.2019, s1.acc: 89.7587, s1.loss_bbox: 0.1066, s1.loss_mask: 0.2612, s2.loss_cls: 0.0766, s2.acc: 92.8203, s2.loss_bbox: 0.0282, s2.loss_mask: 0.1265, loss: 2.1691
2020-02-02 12:39:50,342 - mmdet - INFO - Epoch [1][850/18528]	lr: 0.00310, eta: 4 days, 20:25:35, time: 0.745, data_time: 0.080, memory: 5590, loss_rpn_cls: 0.1796, loss_rpn_bbox: 0.0810, s0.loss_cls: 0.4945, s0.acc: 87.5430, s0.loss_bbox: 0.1822, s0.loss_mask: 0.4937, s1.loss_cls: 0.2152, s1.acc: 89.1416, s1.loss_bbox: 0.1204, s1.loss_mask: 0.2475, s2.loss_cls: 0.0812, s2.acc: 92.0859, s2.loss_bbox: 0.0298, s2.loss_mask: 0.1242, loss: 2.2493
2020-02-02 12:40:25,651 - mmdet - INFO - Epoch [1][900/18528]	lr: 0.00310, eta: 4 days, 19:59:45, time: 0.706, data_time: 0.070, memory: 5590, loss_rpn_cls: 0.1652, loss_rpn_bbox: 0.0869, s0.loss_cls: 0.4785, s0.acc: 87.5156, s0.loss_bbox: 0.1794, s0.loss_mask: 0.4574, s1.loss_cls: 0.2248, s1.acc: 88.1298, s1.loss_bbox: 0.1368, s1.loss_mask: 0.2303, s2.loss_cls: 0.0882, s2.acc: 91.3361, s2.loss_bbox: 0.0415, s2.loss_mask: 0.1157, loss: 2.2046
2020-02-02 12:41:01,211 - mmdet - INFO - Epoch [1][950/18528]	lr: 0.00310, eta: 4 days, 19:39:00, time: 0.711, data_time: 0.068, memory: 5590, loss_rpn_cls: 0.1625, loss_rpn_bbox: 0.0865, s0.loss_cls: 0.5250, s0.acc: 86.5234, s0.loss_bbox: 0.2008, s0.loss_mask: 0.4628, s1.loss_cls: 0.2395, s1.acc: 87.7434, s1.loss_bbox: 0.1457, s1.loss_mask: 0.2361, s2.loss_cls: 0.0910, s2.acc: 91.5898, s2.loss_bbox: 0.0393, s2.loss_mask: 0.1149, loss: 2.3042
Traceback (most recent call last):
  File "/opt/conda/envs/open-mmlab/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/opt/conda/envs/open-mmlab/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/envs/open-mmlab/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Trials
img_scale Tried : (1400, 800), (1200, 800), (1000, 800) (with and without using CPU for gt)
GPU memory: 32 GB

Some Changes
Since my dataset has many bbox in a single image, I set the gpu_assign_thr=350 in the max_iou_assigner.py. Still it throws the same error.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions