Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

为什么1.0版本报错显存不够,用发布的0.9版本没有问题 #10545

Open
18liumin opened this issue Jul 10, 2024 · 0 comments
Open
Labels
bug community events from community

Comments

@18liumin
Copy link

Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
/usr1/lm/model/transformer/cv/imagenet/compaire_speed_with_torch.py:218: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
warnings.warn('You have chosen a specific GPU. This will completely '
Use GPU: 0 for training
=> creating model
=> Dummy data is used!
oneflow模型训练总耗时:477.7218196541071
terminate called after throwing an instance of 'oneflow::RuntimeException'
what(): Error: CUDA out of memory. Tried to allocate 32.0 MB
You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error.
Stack trace (most recent call last) in thread 1363223:
File "virtual_machine.cpp", line 0, in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void ()(vm::ThreadCtx, std::function<void (vm::ThreadCtx*)> const&), vm::ThreadCtx*, VirtualMachine::CreateThreadCtx(Symbol, StreamType, unsigned long)::{lambda(vm::ThreadCtx*)#5}> > >::_M_run()
File "virtual_machine.cpp", line 0, in (anonymous namespace)::WorkerLoop(vm::ThreadCtx*, std::function<void (vm::ThreadCtx*)> const&)
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f682a064b07, in vm::ThreadCtx::TryReceiveAndRun()
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f6829ffbf57, in vm::EpStreamPolicyBase::Run(vm::Instruction*) const
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f682a000335, in vm::Instruction::Compute()
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f682a095ae6, in vm::FuseInstructionPolicy::Compute(vm::Instruction*)
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f682a000335, in vm::Instruction::Compute()
Object "/usr1/lm/model/oneflow-test/oneflow-1.0.0/build/liboneflow.so", at 0x7f682a006479, in vm::OpCallInstructionPolicy::Compute(vm::Instruction*)
File "op_call_instruction_policy.cpp", line 0, in vm::OpCallInstructionPolicy::Compute(vm::Instruction*)::{lambda(char const*)#1}::operator()(char const*) const [clone .constprop.0]
File "op_call_instruction_policy.cpp", line 0, in details::Throw::operator=(Error&&) [clone .constprop.0]
File "error.cpp", line 0, in ThrowError(std::shared_ptr const&) [clone .cold]

Aborted (Signal sent by tkill() 1362859 0)
Aborted (core dumped)
|

@18liumin 18liumin added bug community events from community labels Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug community events from community
Projects
None yet
Development

No branches or pull requests

1 participant