-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch for Cambricon MLUs test #1747
Conversation
@@ -464,13 +464,13 @@ def _test_merge_layers_fp16(self, model_id, config_cls, config_kwargs): | |||
if ("gpt2" in model_id.lower()) and (config_cls != LoraConfig): | |||
self.skipTest("Merging GPT2 adapters not supported for IA³ (yet)") | |||
|
|||
model = self.transformers_class.from_pretrained(model_id, torch_dtype=torch.float16) | |||
model = self.transformers_class.from_pretrained(model_id) | |||
config = config_cls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch_dtype=torch.float16
leads to an error.
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I cannot replicate this, whether with or without GPU. The idea of this test is exactly to check that this error does not occur with fp16, so not using this dtype is counter-productive. Is this only occurring with MLU devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reproduction code is as follows.
The issue can be reproduced using PyTorch 2.1, but it executes normally with PyTorch 2.3.
import torch
a = torch.rand(4,4).to(torch.float16)
b = torch.rand(4,4).to(torch.float16)
a @ b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so instead of changing the dtype, how about skipping the test if an old pytorch version is detected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe we can use fp16 with pt>=2.3, and fp32 with pt<2.3 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really don't need to test merging with fp32 here, as it's tested extensively in other tests. This test is very specifically for merging with fp16, so if we don't use fp16, we can skip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, Got it! I will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I found that it is the device "cpu" leads error.
When device is changed to self.torch_device as I fixed, MLU tests is OK with torch.float16 .
@BenjaminBossan Will this test use "cpu" device? If not, it isn't need to skip test for pt2.1 .
Line 473 in cb0bf07
model = model.to(device="cpu", dtype=torch.float16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIRC, there is an error when using CPU + float16 + old PyTorch. If we change either of those variables, there is no error. On CI, we have a new PyTorch version, so it passes, despite using CPU.
If we switch to self.torch_device
, it depends, because the device is auto-inferred based on what devices are available. So on our CI, this would still be CPU. On yours, it might not, but then we don't really test what was the original intent, namely that float16 works on CPU.
I assume this fails on your CI because it uses an older PyTorch version. This is why I suggested to just skip the test with older PyTorch versions. If you want, you could add a specific test for merging float16 with MLU, which would be skipped if the device is not available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan
Got it! Skipping test_merge_layers_fp16 for pt2.1 when cpu device, this should be OK.
@BenjaminBossan Hi, could you help to reivew this PR? Thx. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks @huismiling, could you please run |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these adjustments. Are the results from the tests on MLU visible somewhere?
Overall, the changes LGTM, I just have one concern about the float16 test. Please check my comment.
@@ -464,13 +464,13 @@ def _test_merge_layers_fp16(self, model_id, config_cls, config_kwargs): | |||
if ("gpt2" in model_id.lower()) and (config_cls != LoraConfig): | |||
self.skipTest("Merging GPT2 adapters not supported for IA³ (yet)") | |||
|
|||
model = self.transformers_class.from_pretrained(model_id, torch_dtype=torch.float16) | |||
model = self.transformers_class.from_pretrained(model_id) | |||
config = config_cls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I cannot replicate this, whether with or without GPU. The idea of this test is exactly to check that this error does not occur with fp16, so not using this dtype is counter-productive. Is this only occurring with MLU devices?
The results from the tests on MLU are as follows. More details is in attachment.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the skipping logic. There's just a small typo there, the rest should be good.
tests/testing_common.py
Outdated
@@ -464,13 +465,16 @@ def _test_merge_layers_fp16(self, model_id, config_cls, config_kwargs): | |||
if ("gpt2" in model_id.lower()) and (config_cls != LoraConfig): | |||
self.skipTest("Merging GPT2 adapters not supported for IA³ (yet)") | |||
|
|||
if (self.torch_device in ["cpu"]) and (version.parse(torch.__version__) <= version.parse(2.1)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (self.torch_device in ["cpu"]) and (version.parse(torch.__version__) <= version.parse(2.1)): | |
if (self.torch_device in ["cpu"]) and (version.parse(torch.__version__) <= version.parse("2.1")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for that. Fixed !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, thanks for the PR. Keep us in the loop if tests should start failing for Cambricon MLUs.
Cambricon MLUs has been supported by peft (huggingface/peft/pull/1687).
But the pytest have some failed cases. It's mostly about tolerances.
This patch fix the failed case of tolerances.