Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There appear to be 1 leaked semaphore objects to clean up at shutdown #8

Open
oscarnevarezleal opened this issue Dec 2, 2022 · 54 comments

Comments

@oscarnevarezleal
Copy link

oscarnevarezleal commented Dec 2, 2022

Can't complete the conversion Models to Core ML

Chip: Apple M2
Memory: 8GB
OS: 13.0.1 (22A400)
pip list
Package                        Version    Editable project location
------------------------------ ---------- ----------------------------------------------------------
accelerate                     0.15.0
certifi                        2022.9.24
charset-normalizer             2.1.1
coremltools                    6.1
diffusers                      0.9.0
filelock                       3.8.0
huggingface-hub                0.11.1
idna                           3.4
importlib-metadata             5.1.0
mpmath                         1.2.1
numpy                          1.23.5
packaging                      21.3
Pillow                         9.3.0
pip                            21.3.1
protobuf                       3.20.3
psutil                         5.9.4
pyparsing                      3.0.9
python-coreml-stable-diffusion 0.1.0      /Users/....
PyYAML                         6.0
regex                          2022.10.31
requests                       2.28.1
scipy                          1.9.3
setuptools                     60.2.0
sympy                          1.11.1
tokenizers                     0.13.2
torch                          1.12.0
tqdm                           4.64.1
transformers                   4.25.1
typing_extensions              4.4.0
urllib3                        1.26.13
wheel                          0.37.1
zipp                           3.11.0

python -m python_coreml_stable_diffusion.torch2coreml --convert-unet --convert-text-encoder --convert-vae-decoder --convert-safety-checker -o packages

!!! macOS 13.1 and newer or iOS/iPadOS 16.2 and newer is required for best performance !!!
INFO:__main__:Initializing StableDiffusionPipeline with CompVis/stable-diffusion-v1-4..
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 11636.70it/s]
INFO:__main__:Done.
INFO:__main__:Converting vae_decoder
INFO:__main__:`vae_decoder` already exists at packages/Stable_Diffusion_version_CompVis_stable-diffusion-v1-4_vae_decoder.mlpackage, skipping conversion.
INFO:__main__:Converted vae_decoder
INFO:__main__:Converting unet
INFO:__main__:Attention implementation in effect: AttentionImplementations.SPLIT_EINSUM
INFO:__main__:Sample inputs spec: {'sample': (torch.Size([2, 4, 64, 64]), torch.float32), 'timestep': (torch.Size([2]), torch.float32), 'encoder_hidden_states': (torch.Size([2, 768, 1, 77]), torch.float32)}
INFO:__main__:JIT tracing..
/Users/xxx/xxx/apple/ml-stable-diffusion/venv/lib/python3.9/site-packages/torch/nn/functional.py:2515: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  _verify_batch_size([input.size(0) * input.size(1) // num_groups, num_groups] + list(input.size()[2:]))
/Users/xxx/xxx/apple/ml-stable-diffusion/python_coreml_stable_diffusion/layer_norm.py:61: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert inputs.size(1) == self.num_channels
INFO:__main__:Done.
INFO:__main__:Converting unet to CoreML..
WARNING:coremltools:Tuple detected at graph output. This will be flattened in the converted model.
Converting PyTorch Frontend ==> MIL Ops:   0%|                                                                           | 0/7876 [00:00<?, ? ops/s]WARNING:coremltools:Saving value type of int64 into a builtin type of int32, might lose precision!
Converting PyTorch Frontend ==> MIL Ops: 100%|█████████████████████████████████████████████████████████████▉| 7874/7876 [00:01<00:00, 4105.24 ops/s]
Running MIL Common passes: 100%|███████████████████████████████████████████████████████████████████████████████| 39/39 [00:27<00:00,  1.43 passes/s]
Running MIL FP16ComputePrecision pass: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:44<00:00, 44.50s/ passes]
Running MIL Clean up passes: 100%|█████████████████████████████████████████████████████████████████████████████| 11/11 [03:00<00:00, 16.40s/ passes]
zsh: killed     python -m python_coreml_stable_diffusion.torch2coreml --convert-unet    -o
/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
@enzyme69
Copy link

enzyme69 commented Dec 2, 2022

I had the same issue:
#5

@felipebaez
Copy link

Same thing here for me and in the end I'm missing the safety_checker CoreML model

@oscarnevarezleal
Copy link
Author

Just updated the OS to 13.1 preview, still facing the same error.

@martinlexow
Copy link

martinlexow commented Dec 3, 2022

Same here.

Apple M1 Pro
16 GB RAM
macOS 13.0.1 (22A400)

Edit: After some investigation it seems like my Mac ran out of memory. It worked well in a later attempt.

Screenshot 2022-12-03 at 14 42 41

@enzyme69
Copy link

enzyme69 commented Dec 5, 2022

8 GB will cause run out of memory issue. As suggested by Yasuhito. Best if you can ask a compiled model from someone... or try running again and again with Terminal only when logging in

@mariapatulea
Copy link

Same here.

Apple M1 Pro
16 GB RAM
macOS 13.0.1 (22A400)

Edit: After some investigation it seems like my Mac ran out of memory. It worked well in a later attempt.

Screenshot 2022-12-03 at 14 42 41

I have the same RAM memory on my Mac. Did you keep trying until it worked eventually?

@oscarnevarezleal
Copy link
Author

@mariapatulea never worked 4 me

@bensh
Copy link

bensh commented Apr 26, 2023

I think this is an issue with tqdm and floating point refs on the progress bar.

I get the same issue and don't have coreml installed.

tqdm    4.65.0

@Siriz23
Copy link

Siriz23 commented May 23, 2023

Hi there!

Has somebody found any solution to this problem?
I'm facing the same issue on M1 chip.

@tahuuha
Copy link

tahuuha commented May 29, 2023

I'm facing the same issue on M1 chip.
Anyone has solution?

@tahuuha
Copy link

tahuuha commented May 29, 2023

Check the solution: AUTOMATIC1111/stable-diffusion-webui#1890

@AlanZhou2022
Copy link

I've got the same problem in stable diffusion V 1.5.1 running on Macbook M2:
anaconda3/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

@vzsg
Copy link
Contributor

vzsg commented Aug 17, 2023

The line you quoted is just a warning, and does not cause any issues. The most common reason why conversions fail is running out of memory, just like in OP's case, look for a line that says or contains "Killed".

@gamesbykk
Copy link

i am using macbook pro ventura m2 chip and facing the same issue

@frankl1
Copy link

frankl1 commented Oct 7, 2023

Problem solved on my side by downgrading Python to 3.10.13

@zhanwenchen
Copy link

I got this error with PyTorch mps while running tqdm=4.65.0. I was able to remove it and install 4.66.1 which solved it. Not a RAM issue.

@YakDriver
Copy link

I think it might be RAM related even if package versons help - they may just use memory better. It consistently failed for me and then I closed everything on my Mac that I could and it ran fine without changing versions. 🤷

@chris-heney
Copy link

I got this error with PyTorch mps while running tqdm=4.65.0. I was able to remove it and install 4.66.1 which solved it. Not a RAM issue.

I agree it's not a RAM issue, I have 96GB of RAM on a custom-built M2 model and I'm getting the error. I can guarantee it has nothing to do with RAM

@42piratas
Copy link

42piratas commented Nov 7, 2023

+1 with the error.
M1 Max 64GB

@mo-foodbit
Copy link

Getting the same error when training Dreambooth. Did anyone figure out a solution to this?

loc("mps_add"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/a0876c02-1788-11ed-b9c4-96898e02b808/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":219:0)): error: input types 'tensor<1x1280xf16>' and 'tensor<1280xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
./webui.sh: line 255: 38149 Abort trap: 6           "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"
/opt/homebrew/Cellar/python@3.10/3.10.13_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@vzsg
Copy link
Contributor

vzsg commented Nov 15, 2023

It's not the same error though.
Yours was:

error: input types 'tensor<1x1280xf16>' and 'tensor<1280xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).

The warning about the semaphore, just like in the OP (where the real error was zsh: Killed, due to running out of memory), is just a red herring, that gets printed after both successful and failed conversions.

@mossishahi
Copy link

I have the same error on a M3 model with 36GB memory! :(

@LukaVerhoeven
Copy link

Same issue on M3 with 128GB ram

@julien-c
Copy link
Collaborator

julien-c commented Jan 2, 2024

@LukaVerhoeven nice config^ 🙂

@LukaVerhoeven
Copy link

@LukaVerhoeven nice config^ 🙂

Was hoping on no memory issues with this setup 😒

@zzingae
Copy link

zzingae commented Jan 17, 2024

It seems related to device type (Mac mps type). When I move mps type tensor to cpu(), the problem no longer appears.

@lemonsz15
Copy link

same error on M3 Max 96GB while trying to run invokeAI, any solution?

@Blenderama
Copy link

I think this is an issue with tqdm and floating point refs on the progress bar.

I get the same issue and don't have coreml installed.

tqdm    4.65.0

Removing tqdm solved my issue. Thank you!

@yunshiyu11
Copy link

In my opinion because you run it on the docker so that the shm size is so small,you can run df -lh to watch its size, therefore you need create the docker with --shm-size=2G then i successfully run it

@chenyangkang
Copy link

chenyangkang commented Apr 17, 2024

Same here on Apple M3 Max 36GB MacBook Pro. Never installed CoreML. Upgrading from tqdm=4.65.0 to 4.66.1 solves the problem.

@LukaVerhoeven
Copy link

LukaVerhoeven commented Apr 24, 2024

This might be relevant:
conda/conda#9589
conda update --all

What worked for me was just rebooting my device...
(It was not while working with stable-diffusion though, but with the exact same error)

EDIT: It seems restarted does not always fix the issue

@RayRaytheDivine
Copy link

It seems related to device type (Mac mps type). When I move mps type tensor to cpu(), the problem no longer appears.

Can you explain how you did this, exactly? I've tried all of the other solutions that people have reported, but nothing has worked yet... running SD in ComfyUI on my M3 Max 64GB.

@tombearx
Copy link

It seems related to device type (Mac mps type). When I move mps type tensor to cpu(), the problem no longer appears.

Can you explain how you did this, exactly? I've tried all of the other solutions that people have reported, but nothing has worked yet... running SD in ComfyUI on my M3 Max 64GB.

Got the same error after updated to MacOS 14.5

@AlanZhou2022
Copy link

AlanZhou2022 commented May 23, 2024 via email

@tombearx
Copy link

I have to say SD is not compatible with macOS for now and no solution to your problem.

---- Replied Message ---- | From | @.> | | Date | 05/22/2024 05:52 | | To | apple/ml-stable-diffusion @.> | | Cc | Alan @.>, Comment @.> | | Subject | Re: [apple/ml-stable-diffusion] There appear to be 1 leaked semaphore objects to clean up at shutdown (Issue #8) | It seems related to device type (Mac mps type). When I move mps type tensor to cpu(), the problem no longer appears. Can you explain how you did this, exactly? I've tried all of the other solutions that people have reported, but nothing has worked yet... running SD in ComfyUI on my M3 Max 64GB. Got the same error after updated to MacOS 14.5 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

I reinstalled Python, all packages and ComfyUI, and it works now.

@larisoncarvalho
Copy link

larisoncarvalho commented Jun 28, 2024

This might be relevant: conda/conda#9589 conda update --all

What worked for me was just rebooting my device... (It was not while working with stable-diffusion though, but with the exact same error)

EDIT: It seems restarted does not always fix the issue

Thanks! conda update fixed it for me, running on M3.

@zhouhao27
Copy link

I think this is an issue with tqdm and floating point refs on the progress bar.
I get the same issue and don't have coreml installed.

tqdm    4.65.0

Removing tqdm solved my issue. Thank you!

How to remove? The app is using that, right?

@pratheeshkumar99
Copy link

Can't complete the conversion Models to Core ML

Chip: Apple M2
Memory: 8GB
OS: 13.0.1 (22A400)
pip list
Package                        Version    Editable project location
------------------------------ ---------- ----------------------------------------------------------
accelerate                     0.15.0
certifi                        2022.9.24
charset-normalizer             2.1.1
coremltools                    6.1
diffusers                      0.9.0
filelock                       3.8.0
huggingface-hub                0.11.1
idna                           3.4
importlib-metadata             5.1.0
mpmath                         1.2.1
numpy                          1.23.5
packaging                      21.3
Pillow                         9.3.0
pip                            21.3.1
protobuf                       3.20.3
psutil                         5.9.4
pyparsing                      3.0.9
python-coreml-stable-diffusion 0.1.0      /Users/....
PyYAML                         6.0
regex                          2022.10.31
requests                       2.28.1
scipy                          1.9.3
setuptools                     60.2.0
sympy                          1.11.1
tokenizers                     0.13.2
torch                          1.12.0
tqdm                           4.64.1
transformers                   4.25.1
typing_extensions              4.4.0
urllib3                        1.26.13
wheel                          0.37.1
zipp                           3.11.0

python -m python_coreml_stable_diffusion.torch2coreml --convert-unet --convert-text-encoder --convert-vae-decoder --convert-safety-checker -o packages

!!! macOS 13.1 and newer or iOS/iPadOS 16.2 and newer is required for best performance !!!
INFO:__main__:Initializing StableDiffusionPipeline with CompVis/stable-diffusion-v1-4..
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 11636.70it/s]
INFO:__main__:Done.
INFO:__main__:Converting vae_decoder
INFO:__main__:`vae_decoder` already exists at packages/Stable_Diffusion_version_CompVis_stable-diffusion-v1-4_vae_decoder.mlpackage, skipping conversion.
INFO:__main__:Converted vae_decoder
INFO:__main__:Converting unet
INFO:__main__:Attention implementation in effect: AttentionImplementations.SPLIT_EINSUM
INFO:__main__:Sample inputs spec: {'sample': (torch.Size([2, 4, 64, 64]), torch.float32), 'timestep': (torch.Size([2]), torch.float32), 'encoder_hidden_states': (torch.Size([2, 768, 1, 77]), torch.float32)}
INFO:__main__:JIT tracing..
/Users/xxx/xxx/apple/ml-stable-diffusion/venv/lib/python3.9/site-packages/torch/nn/functional.py:2515: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  _verify_batch_size([input.size(0) * input.size(1) // num_groups, num_groups] + list(input.size()[2:]))
/Users/xxx/xxx/apple/ml-stable-diffusion/python_coreml_stable_diffusion/layer_norm.py:61: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert inputs.size(1) == self.num_channels
INFO:__main__:Done.
INFO:__main__:Converting unet to CoreML..
WARNING:coremltools:Tuple detected at graph output. This will be flattened in the converted model.
Converting PyTorch Frontend ==> MIL Ops:   0%|                                                                           | 0/7876 [00:00<?, ? ops/s]WARNING:coremltools:Saving value type of int64 into a builtin type of int32, might lose precision!
Converting PyTorch Frontend ==> MIL Ops: 100%|█████████████████████████████████████████████████████████████▉| 7874/7876 [00:01<00:00, 4105.24 ops/s]
Running MIL Common passes: 100%|███████████████████████████████████████████████████████████████████████████████| 39/39 [00:27<00:00,  1.43 passes/s]
Running MIL FP16ComputePrecision pass: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:44<00:00, 44.50s/ passes]
Running MIL Clean up passes: 100%|█████████████████████████████████████████████████████████████████████████████| 11/11 [03:00<00:00, 16.40s/ passes]
zsh: killed     python -m python_coreml_stable_diffusion.torch2coreml --convert-unet    -o
/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Where you able to fix the issue?

@pratheeshkumar99
Copy link

I am trying to build a transformer from scratch, while try to train it on mps (gpu) I get this error Using device: mps Device name: <MPS> Max length of source sentence: 309 Max length of target sentence: 274 Processing Epoch 00: 0%| | 0/3638 [00:00<?, ?it/s]zsh: bus error python train_wb.py /opt/anaconda3/envs/mynewenv/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' . However while I try to train the model on cpu, it is working absolutely fine. Please help me fix the issue.

@rachelcenter
Copy link

8 GB will cause run out of memory issue. As suggested by Yasuhito. Best if you can ask a compiled model from someone... or try running again and again with Terminal only when logging in

I have 128gb ram and still get this error

@pratheeshkumar99
Copy link

Did you find solution to this issue?

@rachelcenter
Copy link

Did you find solution to this issue?

nope

@rachelcenter
Copy link

I think it might be RAM related even if package versons help - they may just use memory better. It consistently failed for me and then I closed everything on my Mac that I could and it ran fine without changing versions. 🤷

I have 128gb ram m2 mac and i'm running into this issue: "/.pyenv/versions/3.12.4/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown"

@DeadstarIII
Copy link

same, m2 pro 16gb, do we have any fix?

@ZachNagengast
Copy link
Contributor

Chiming in here to echo @atiorh #349 (comment), from what I can gather the leaked semaphore log is actually just an artifact of the process being killed, the multiprocessing library can output that log message when killed unexpectedly. The root of the issue is that the process is getting killed in the first place, which is generally due to out of memory, but could be from other issues.

Like people here have mentioned, simplest solution is to make sure your mac has enough RAM (ideally 2-3x the model size), make sure to disable --check-output-correctness, as well as freeing up as much memory as possible during the conversion process by closing any running apps to prevent swap usage. A longer-term fix would be a fine-grained review of memory usage throughout the conversion script to free up memory that is no longer needed and eliminate unnecessary model copies.

@rachelcenter
Copy link

rachelcenter commented Aug 29, 2024

Chiming in here to echo @atiorh #349 (comment), from what I can gather the leaked semaphore log is actually just an artifact of the process being killed, the multiprocessing library can output that log message when killed unexpectedly. The root of the issue is that the process is getting killed in the first place, which is generally due to out of memory, but could be from other issues.

Like people here have mentioned, simplest solution is to make sure your mac has enough RAM (ideally 2-3x the model size), make sure to disable --check-output-correctness, as well as freeing up as much memory as possible during the conversion process by closing any running apps to prevent swap usage. A longer-term fix would be a fine-grained review of memory usage throughout the conversion script to free up memory that is no longer needed and eliminate unnecessary model copies.

I have 128gb ram and i'm getting memory leaks, are you saying 128gb is still not enough memory? and i dont really have other programs open while im doing this

@ZachNagengast
Copy link
Contributor

According to this comment it may be an issue with the Core ML framework itself and not an actual memory "leak", but it appears to be triaged so hopefully a fix is in the works. The workaround from Toby suggests using skip_model_load=True which is true when --check-output-correctness is not set. One way to check would be to monitor the memory pressure in activity monitor and specifically look for spikes around the time the script is killed from a process called ANECompiler.

@rachelcenter
Copy link

This is all above my paygrade. I'm not sure what skip_model_load=True even is or how to implement that without some kind of tutorial or instructions

@atiorh
Copy link
Collaborator

atiorh commented Aug 30, 2024

When you avoid using --check-output-correctness as @ZachNagengast suggested, this argument will be True and your memory usage will go down. However, 128GB should have been plenty for most SD models. Which model are you converting? 😅

@rachelcenter
Copy link

When you avoid using --check-output-correctness as @ZachNagengast suggested, this argument will be True and your memory usage will go down. However, 128GB should have been plenty for most SD models. Which model are you converting? 😅

i dont remember what i was using at the time.

@rachelcenter
Copy link

When you avoid using --check-output-correctness as @ZachNagengast suggested, this argument will be True and your memory usage will go down. However, 128GB should have been plenty for most SD models. Which model are you converting? 😅

here's one workflow that goes to 100% and then gives me the semaphore error https://comfyworkflows.com/workflows/28794c8c-af07-424b-8363-d7e2be237770

@rachelcenter
Copy link

report.txt

@PrathamLearnsToCode
Copy link

same issue on 2020 Apple M1 chip.

@jakubLangr
Copy link

same issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests