-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with 3d-unet #1845
Comments
That looks like a |
Hi @arjunsuresh , thanks for the advice. I tried several times (about 4-5) and in 4 out of 5 cases after ~2000-3000 seconds I got the error shown above. Finally I got some results (for other neural networks). Please answer the questions below.
resnet50 +----------+----------+----------+------------+-----------------+ How should I interpret them and what should I compare them with? I found some tables here https://mlcommons.org/benchmarks/inference-edge/ . Did I understand correctly that Throughput is an analogue of "Samples". And what should I do with "Accuracy"? I wanted to run resnet50 in single mode. But I got an error. I took the command here https://docs.mlcommons.org/inference/benchmarks/language/bert/#__tabbed_59_3 Log with error: |
@arjunsuresh Hi. I ran the command inside the container. And I got an error |
@Agalakdak Can you please open a separate issue for each model related query? For R50 can you try adding this option? For the 3d-unet failing, can you please add While running in "closed" division accuracy will be above the threshold or else the submission checker will fail. For this reason, accuracy is not reported in the official results. In other words accuracy value of all the submissions are expected to be very close in the closed division and so only the performance number matter. For throughput - yes, it is "samples per second" for most benchmarks and "tokens per second" for LLM ones. |
Hi @arjunsuresh, sorry for the late reply. I was busy with other things. I tried your advice and unfortunately got errors again. But this time I can provide the entire log of the first and second steps. The first step is entering the command to go to the container. The second step is actually entering the command in the container itself. 3dunet_full_error_second_step.log |
Hi @Agalakdak The second command you shared is for |
It's working fine for me.
What I suspect is a failure in the download of the kits19 dataset as the below Nvidia script is skipping a redownload if the file already exist without checking its validity. The below command will give you the path to the NVIDIA_SCRATCH to where the data gets downloaded. You manually remove the kits19 data directory from there and then retry the command.
|
Meanwhile kits19 download is slow and can take several hours to complete. |
Hello @arjunsuresh. I think I figured out what the problem is. It's a network issue... --2024-09-17 02:52:32-- http://92dd3d24cf78d07aa31165f90c636d98c4adddcd/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json --2024-09-17 02:52:32-- https://raw.githubusercontent.com/mlcommons/inference/486a629ea4d5c5150f452d0b0a196bf71fd2021e --2024-09-17 02:52:32-- http://92dd3d24cf78d07aa31165f90c636d98c4adddcd/vision/medical_imaging/3d-unet-kits19/meta/calibration_cases.json Done. CM error: Portable CM script failed (name = app-mlperf-inference-nvidia, return code = 256) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ https://github.com/mlcommons/cm4mlops/issues The CM concept is to collaboratively fix such issues inside portable CM scripts |
@Agalakdak Actually that looks like a problem with the download script where it is creating invalid URLs. It probably worked fine for me because some of the downloaded files were already present. We'll fix this issue in the script. |
@arjunsuresh If you need more information about my system, please let me know |
Hi @Agalakdak Can you please do this (inside the container)
And retry the command? |
@arjunsuresh Hi, I tried the advice above. Didn't help. There aren't many logs, so I just duplicated them below. cmuser@ccaeef79d72e:$ cd ~~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA CM error: artifact(s) not found! ... cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ source CM error: artifact(s) not found! |
Hi @Agalakdak Can you please try |
@arjunsuresh
I tried to run "cm rm cache --tags=_preprocess_data -f" right after entering the container. And the command completed successfully. But it did not give any result. |
Can you retry the original command? No need to do command number 3. |
@arjunsuresh
cm run script --tags=run-mlperf,inference,_r4.1-dev (and getting an error) Full log: |
No worries. I have added some extra checks for existing stale files. Can you please do |
Hi @arjunsuresh , I repeated all the commands as I did above. I got the same result. |
Hello everyone, I have already submitted a bug report here. But that topic got a lot of messages and I decided to create a new topic.
This time I ran 3D unet using the command below from this site
https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/
The command
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=3d-unet-99
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50
and a brief error report:
0.580 INFO:root: ! cd /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f
0.580 INFO:root: ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh
0.584 /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f
0.585 ******************************************************
0.585 Current directory: /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f
0.585
0.585 Cloning inference from https://github.com/mlcommons/inference
0.585
0.585 git clone -b master https://github.com/mlcommons/inference --depth 5 inference
0.585
0.586 Cloning into 'inference'...
38.68 fatal: the remote end hung up unexpectedly
38.69 fatal: early EOF
38.69 fatal: index-pack failed
38.69 Detected version: 3.8.10
38.69 Detected version: 3.8.10
38.69
38.69 CM error: Portable CM script failed (name = get-git-repo, return code = 256)
38.69
38.69
38.69 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
38.69 Note that it is often a portability issue of a third-party tool or a native script
38.69 wrapped and unified by this CM script (automation recipe). Please re-run
38.69 this script with --repro flag and report this issue with the original
38.69 command line, cm-repro directory and full log here:
38.69
38.69 https://github.com/mlcommons/cm4mlops/issues
38.69
38.69 The CM concept is to collaboratively fix such issues inside portable CM scripts
38.69 to make existing tools and native scripts more portable, interoperable
38.69 and deterministic. Thank you!
Full log with the problem
error_3dunet.log
The text was updated successfully, but these errors were encountered: