-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve CI testing failure for Lazy Tensor Core #1088
Resolve CI testing failure for Lazy Tensor Core #1088
Conversation
We're still seeing
which is causing a whole bunch of tests to fail. We're unable to reproduce this locally. @silvasean @powderluv How should we proceed? What can be done to try and root cause this issue? |
Did this ever work? If so, can we bisect to find where it failed? |
We've never had it run successfully thru CI before |
Does passing |
erm now it dies without running the test?
|
Oh, that is good and expected. That means that the multiprocessing was swallowing the crash by restarting the worker process. I think from here you can run the testing process under gdb/lldb if it is in the VM image, and have it break on This appears to be a memory corruption issue, so likely running it locally with AddressSanitizer would flag it, even if locally there are no symptoms on a normal build. btw, have you verified that locally you are doing the same Release+Asserts build as CI? This is the command line:
|
When I try to run the same cmake command as CI, I get the following error:
And when I run |
For those following along, we got a stack trace: https://github.com/llvm/torch-mlir/runs/7452705521?check_suite_focus=true
|
Awesome. that's a good first step. It sounds like ASan would catch this bug and give a very clear diagnosis. ASan is probably pretty hard to set up for the full Python/etc. e2e test, but a small .cpp file with just a main() and a few calls into libtorch + LTC might be doable. Let me know how your debugging goes and I can get more hands-on (vs armchair debugging this) if you folks need. |
And by Asan I mean a local ASan -- I suspect that the issue exists in the local build but is somehow not causing any symptoms locally (e.g. your local libc/malloc doesn't do as strict of checking) - asan will see it though. |
Thanks for continuing debugging this. Eventually we can add an ASAN builder (at least for the C++ parts) |
ca6d89e
to
29922a2
Compare
I made a new PR for experimenting with CI so everyone doesn't get spammed with emails: #1095 Once a solution is found, I'll bring it back over here |
Is your -DLLVM_EXTERNAL_TORCH_MLIR_DIALECTS_SOURCE_DIR set correctly? Where is that error coming from? |
Yes, it was set to
Honestly not sure where that error is coming from. Its not something that's encountered when running |
6698bb0
to
005748b
Compare
29922a2
to
23e9bb8
Compare
Here's some updates on the situation. It looks like we forgot to add
The behaviour is the same when running e2e tests, but in this case I'm running a small example model to help with isolating the source. |
23e9bb8
to
c5e8448
Compare
You probably need to call torchMlirRegisterAllDialects: torch-mlir/python/torch_mlir/dialects/torch/importer/jit_ir/csrc/module_builder.cpp Line 118 in e23fbc8
|
Hmm the strange thing is that it is called: https://github.com/llvm/torch-mlir/blob/torch_mlir_ltc_backend/python/torch_mlir/csrc/base_lazy_backend/mlir_lowering_context.cpp#L279 |
I think we might also need the equivalent of torchMlirRegisterRequiredDialects: https://github.com/llvm/torch-mlir/pull/1084/files @ashay I recall that patch got reverted -- is it okay for henry to just copy that function for now? |
Oh now that I think of it, |
c5e8448
to
bfe533f
Compare
From my testing on another branch, this PR should enable e2e CI tests to run successfully; however, we still run into some issues during Build out-of-tree |
we were talking today about the possibility of removing the out of tree build at the developer hour -- @powderluv, this sounds like another data point. |
@silvasean assuming the other 2 tests pass, are there any other changes you want on this PR, or would we be good to merge into |
LGTM |
And btw, thanks for debugging this. I've been on the other end of these "debug iteration goes through GitHub Actions" things and it is incredibly painful. I can't imagine debugging a memory corruption issue like you did here! |
Sorry for being late to the discussion. Henry, feel free to copy that function. Now that macOS builds run in CI, I should be able to fix any redundancy in a subsequent patch. |
I'm going to merge this into the LTC branch now. The failures during source and macos build seem to be outside the scope of this PR, so they'll be addressed separately. |
* Xfail unsupported ops * Register FuncDialect * Include dynamic_ir in build * Code reformat * Enable LTC tests for macOS and Source Build
* Xfail unsupported ops * Register FuncDialect * Include dynamic_ir in build * Code reformat * Enable LTC tests for macOS and Source Build
* Xfail unsupported ops * Register FuncDialect * Include dynamic_ir in build * Code reformat * Enable LTC tests for macOS and Source Build
Signed-off-by: Gong Su <gong_su@hotmail.com> Co-authored-by: Alexandre Eichenberger <alexe@us.ibm.com>
* Add check-onnx-backend to Mac CI. (llvm#1069) * Add check-onnx-backend to Mac CI. Signed-off-by: Ettore Tiotto <etiotto@ca.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Additional Docker help and split README for easier reading (llvm#1084) * initial docker documentation Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * split README with no redundant place for info Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * update Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * update Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * update Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * update Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * update Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * respond to suggestions Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * specify that onnx-mlir.py script generates only code suitable to be exec in Linux and/or Docker env Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * fix checkdocs Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * responded to review suggestion on onnx-mlir --help Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * use ONNX-MLIR everywhere Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * add verify for concat Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * check all inputs Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Support filtering out lit tests based on targets (llvm#1087) Currently we ignore what targets llvm was built for in the lit tests, but recent changes to onnx-mlir explicitly initialize the available targets. This makes the corresponding change to the lit configuration, so that we can filter out the lit tests based on the available targets. Signed-off-by: Stella Stamenova <stilis@microsoft.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Switch URLs to use main instead of master (llvm#1094) Signed-off-by: Charles Volzka <cjvolzka@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Fix MacOS build badge (llvm#1092) Signed-off-by: Gong Su <gong_su@hotmail.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * onnx-mlir.py warning about binary output (.so and .jar) (llvm#1090) not directly usable if host is not Linux Signed-off-by: Gong Su <gong_su@hotmail.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Make the doc example obey ONNX_MLIR_BUILD_TESTS (llvm#1083) * Make the doc example obey ONNX_MLIR_BUILD_TESTS Currently, ONNX_MLIR_BUILD_TESTS controls EXCLUDE_FROM_ALL, however, the targets added through add_executable will always build. We follow the llvm pattern and explicitly set EXCLUDE_FROM_ALL in the add_onnx_mlir_executable function if it is set for the directory, so that add_executable targets don't always build. Signed-off-by: Stella Stamenova <stilis@microsoft.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Explicitly install into lib on all systems (llvm#1088) Signed-off-by: Gong Su <gong_su@hotmail.com> Co-authored-by: Alexandre Eichenberger <alexe@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * add check (llvm#1098) Signed-off-by: Tong Chen <chentong@us.ibm.com> Co-authored-by: Alexandre Eichenberger <alexe@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * fix typos and add ssh-client to dockerfile (llvm#1096) * fix typos and add ssh-client to dockerfile Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * sync doc and script Signed-off-by: Ethan Wang <ywan2928@uwo.ca> Co-authored-by: Alexandre Eichenberger <alexe@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Emit print statement only when the verbose option is in effect. (llvm#1097) Signed-off-by: Ettore Tiotto <etiotto@ca.ibm.com> Co-authored-by: Alexandre Eichenberger <alexe@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * format & refine code by request Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Support older versions 6, 11, 12 for Clip Op (llvm#1100) Signed-off-by: Tung D. Le <tung@jp.ibm.com> Co-authored-by: Alexandre Eichenberger <alexe@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * using front to get first input Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * add 3 lit test for concat verifier Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * add newline Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Add check-onnx-backend to Mac CI. (llvm#1069) * Add check-onnx-backend to Mac CI. Signed-off-by: Ettore Tiotto <etiotto@ca.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Additional Docker help and split README for easier reading (llvm#1084) * initial docker documentation Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * split README with no redundant place for info Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * update Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * update Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * update Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * update Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * update Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * respond to suggestions Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * specify that onnx-mlir.py script generates only code suitable to be exec in Linux and/or Docker env Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * fix checkdocs Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * responded to review suggestion on onnx-mlir --help Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> * use ONNX-MLIR everywhere Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Switch URLs to use main instead of master (llvm#1094) Signed-off-by: Charles Volzka <cjvolzka@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Fix MacOS build badge (llvm#1092) Signed-off-by: Gong Su <gong_su@hotmail.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * fix typos and add ssh-client to dockerfile (llvm#1096) * fix typos and add ssh-client to dockerfile Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * sync doc and script Signed-off-by: Ethan Wang <ywan2928@uwo.ca> Co-authored-by: Alexandre Eichenberger <alexe@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Update document (llvm#1077) * create Signed-off-by: Tong Chen <chentong@us.ibm.com> * delete HowTOAddAnOperation.md Signed-off-by: Tong Chen <chentong@us.ibm.com> * modify testing Signed-off-by: Tong Chen <chentong@us.ibm.com> * create Signed-off-by: Tong Chen <chentong@us.ibm.com> * delete HowTOAddAnOperation.md Signed-off-by: Tong Chen <chentong@us.ibm.com> * modify testing Signed-off-by: Tong Chen <chentong@us.ibm.com> * fix Signed-off-by: Tong Chen <chentong@us.ibm.com> * create Signed-off-by: Tong Chen <chentong@us.ibm.com> * add comment Signed-off-by: Tong Chen <chentong@us.ibm.com> * delete HowTOAddAnOperation.md Signed-off-by: Tong Chen <chentong@us.ibm.com> * modify testing Signed-off-by: Tong Chen <chentong@us.ibm.com> * fix Signed-off-by: Tong Chen <chentong@us.ibm.com> * create Signed-off-by: Tong Chen <chentong@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Update LLVM level (llvm#1095) * Update LLVM level to 700997a Signed-off-by: Ettore Tiotto <etiotto@ca.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Pass a type converter to all ONNX operations. (llvm#1102) Signed-off-by: Ettore Tiotto <etiotto@ca.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Nuke KrnlDummyCastOp now that we use MLIR's UnrealizedConversionCastOp (llvm#1103) * Nuke KrnlDummyCastOp now that we use MLIR's UnrealizedConversionCastOp Signed-off-by: Ettore Tiotto <etiotto@ca.ibm.com> * Remove a dependency in src/Dialect/Krnl/CMakeList.txt. Regenerate docs via 'ninja onnx-mlir-docs'. Signed-off-by: Ettore Tiotto <etiotto@ca.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Add --emitObj option to onnx-mlir (llvm#1104) Signed-off-by: Ettore Tiotto <etiotto@ca.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * fix warnings (llvm#1093) Signed-off-by: Ian Bearman <ianb@microsoft.com> Co-authored-by: Stella Stamenova <stilis@microsoft.com> Co-authored-by: Alexandre Eichenberger <alexe@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Add -march option to onnx-mlir (llvm#1107) Signed-off-by: Ettore Tiotto <etiotto@ca.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Fix Doc spelling and broken links, removed warnings about using main (llvm#1106) * removed warning about main vs master in CONTRIBUTING, fixed links and spelling mistakes Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com> Signed-off-by: Ethan Wang <ywan2928@uwo.ca> * Update BuildONNX.md Signed-off-by: Ethan Wang <ywan2928@uwo.ca> Co-authored-by: Ettore Tiotto <etiotto@ca.ibm.com> Co-authored-by: Alexandre Eichenberger <alexe@us.ibm.com> Co-authored-by: Stella Stamenova <stilis@microsoft.com> Co-authored-by: Charles Volzka <42243335+cjvolzka@users.noreply.github.com> Co-authored-by: gongsu832 <gong_su@hotmail.com> Co-authored-by: chentong319 <chentong@us.ibm.com> Co-authored-by: Tung D. Le <tung@jp.ibm.com> Co-authored-by: Ian Bearman <ian.bearman@live.com>
dynamic_ir.cpp
to source list, which resolvedfree()
error in CIFuncDialect
, which resolved MLIR error in CIcc: @ke1337 @antoniojkim