Skip to content

onmt_train an illegal memory access was encountered  #1836

Closed
@zhangqianjin

Description

@zhangqianjin

onmt_train -data demo/data -save_model demo-model -layers 6 -rnn_size 64 -word_vec_size 64 -transformer_ff 256 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 20000 -max_generator_batches 2 -batch_size 640 -dropout 0.1 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 1000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 50 -save_checkpoint_steps 500 -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7

when begin valid. occur
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCReduceAll.cuh:327
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fd98245c536 in /data/common_tool/anaconda3/envs/dnn/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7fd98269ffbe in /data/common_tool/anaconda3/envs/dnn/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)

pytorch1.5 cuda10.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions