为什么1.0版本报错显存不够，用发布的0.9版本没有问题 #10545

18liumin · 2024-07-10T08:59:29Z

Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
/usr1/lm/model/transformer/cv/imagenet/compaire_speed_with_torch.py:218: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
warnings.warn('You have chosen a specific GPU. This will completely '
Use GPU: 0 for training
=> creating model
=> Dummy data is used!
oneflow模型训练总耗时：477.7218196541071
terminate called after throwing an instance of 'oneflow::RuntimeException'
what(): Error: CUDA out of memory. Tried to allocate 32.0 MB
You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error.
Stack trace (most recent call last) in thread 1363223:
File "virtual_machine.cpp", line 0, in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void ()(vm::ThreadCtx, std::function<void (vm::ThreadCtx*)> const&), vm::ThreadCtx*, VirtualMachine::CreateThreadCtx(Symbol, StreamType, unsigned long)::{lambda(vm::ThreadCtx*)#5}> > >::_M_run()
File "virtual_machine.cpp", line 0, in (anonymous namespace)::WorkerLoop(vm::ThreadCtx*, std::function<void (vm::ThreadCtx*)> const&)
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f682a064b07, in vm::ThreadCtx::TryReceiveAndRun()
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f6829ffbf57, in vm::EpStreamPolicyBase::Run(vm::Instruction*) const
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f682a000335, in vm::Instruction::Compute()
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f682a095ae6, in vm::FuseInstructionPolicy::Compute(vm::Instruction*)
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f682a000335, in vm::Instruction::Compute()
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f682a006479, in vm::OpCallInstructionPolicy::Compute(vm::Instruction*)
File "op_call_instruction_policy.cpp", line 0, in vm::OpCallInstructionPolicy::Compute(vm::Instruction*)::{lambda(char const*)#1}::operator()(char const*) const [clone .constprop.0]
File "op_call_instruction_policy.cpp", line 0, in details::Throw::operator=(Error&&) [clone .constprop.0]
File "error.cpp", line 0, in ThrowError(std::shared_ptr const&) [clone .cold]

Aborted (Signal sent by tkill() 1362859 0)
Aborted (core dumped)
|

18liumin added bug community events from community labels Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

为什么1.0版本报错显存不够，用发布的0.9版本没有问题 #10545

为什么1.0版本报错显存不够，用发布的0.9版本没有问题 #10545

18liumin commented Jul 10, 2024

为什么1.0版本报错显存不够，用发布的0.9版本没有问题 #10545

为什么1.0版本报错显存不够，用发布的0.9版本没有问题 #10545

Comments

18liumin commented Jul 10, 2024