-
Couldn't load subscription status.
- Fork 3.9k
ARROW-4372: [C++] Embed precompiled bitcode in the gandiva library #3484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@xhochy We're not bundling gandiva generated bytecode files. I'm trying to resolve it, but You might know the solution. |
|
I've managed to bundle |
What do you mean with that? |
|
I think the bitcode path is still being hard-coded, is that right? |
|
@xhochy according to this commit @kou made it configurable. |
|
The change just for creating a configuration for custom path. This is enough just for testing. But it's not suitable for normal use. |
.travis.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will reduce the build times. There are nightly tests for all versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou This will resolve the random timeouts of wheel builds.
|
@kou could You elaborate on customizing the default path dynamically? If I understand correctly the second option is to maintain a list of default search paths relative to the gandiva binary (queried via |
The easiest way is using an environment variable: diff --git a/cpp/src/gandiva/configuration.h b/cpp/src/gandiva/configuration.h
index 04e2eed2..fd948400 100644
--- a/cpp/src/gandiva/configuration.h
+++ b/cpp/src/gandiva/configuration.h
@@ -75,7 +75,14 @@ class ConfigurationBuilder {
std::string byte_code_file_path_;
static std::shared_ptr<Configuration> InitDefaultConfig() {
- std::shared_ptr<Configuration> configuration(new Configuration(kByteCodeFilePath));
+ std::string path;
+ auto path_env = std::getenv("GANDIVA_BYTE_CODE_PATH");
+ if (path_env) {
+ path = path_env;
+ } else {
+ path = kByteCodeFilePath;
+ }
+ std::shared_ptr<Configuration> configuration(new Configuration(path));
return configuration;
}But it may have a security problem. |
Right. But it may be difficult. Because Gandiva may be built as a static library. (We may able to assume that Gandiva is built as a shared library in wheel.) This approach is popular on Windows. Many products search their configuration file by relative path from their DLL. |
|
@wesm any thoughts on this? |
ci/travis_script_manylinux.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we crashy when Tensorflow is loaded in the same process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is supposed to test that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not crash, there is some special safeaguarding code for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it doesn't always work, see https://issues.apache.org/jira/browse/ARROW-4272
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So shouldn't we remove this import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should We? If We want to test that both arrow and tensorflow can live in the same interperter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you prefer... but it will probably fail at one point or the other, because Tensorflow publishes rogue wheels.
|
@pravindra couldn't We embed the bitcode into the library? |
sorry, I didn't understand. can you please elaborate some more ? The bit-code file is generated by a series of clang tool commands during the build process. Is there a way to add such a file into the .so or .a, and extract it later at runtime ? |
|
The easiest and most portable way (though slightly clunky) would probably to generate a C++ file containing static See https://stackoverflow.com/questions/11813271/embed-resources-eg-shader-code-images-into-executable-library-with-cmake for an example, though we'd probably want to write the generator utility in Python. |
|
@pravindra yes, like Antoine's example. Although it might be easier to just configure the right path from the python module (similarly like from java). Not sure which one would be cleaner... |
|
But We still need to keep the ability to load bitcode from files, right? |
|
Apparently the bitcode path is hardcoded as an absolute path currently (see the generated namespace gandiva {
// Path to the byte-code file.
extern const char kByteCodeFilePath[] = "/home/antoine/miniconda3/envs/pyarrow/lib/gandiva/irhelpers.bc";
} // namespace gandiva |
|
So We have a couple of options:
which one should We go with? |
|
I think compiling the bitcode as a string is the most pain-free approach. Other approaches have to deal with locating the files on disk in different environments (Python, Java, ...). |
|
Any downstream libraries would like to provide precompiled gandiva expressions should do the same? |
|
I think that's a separate concern. It would probably be possible to let them choose their preferred way. |
|
Ok, I'll go with the string approach then. Thanks @pitrou ! |
|
I do see the advantages of the embed-as-string approach. Early on, we had plans to support pluggable function registries in gandiva I.e given a new bitcode file and the fn signatures, it would be possible to add new functions to gandiva by doing a module reload (without actually changing arrow/gandiva code). so, passing the location(s) of the bitcode file(s) as a config param made sense. We didn’t get around to implementing it properly, and I think it’s harder to do now since there are cross references from the bitcode files into gandiva/arrow (for date/decimal). Embedding the bc code into the library will mean that we cannot go down that path anymore. |
|
Where do things stand on this? |
|
I can continue to work on this tomorrow, right now I'm busy with buildbot's infra. |
96d2b76 to
b21c88e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that We want to hash the precompiled bitcode, so I've disabled it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, keep this Hash(), but dummy out the configuration->Hash() (just return 0) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't hash byte_code_file_path anymore, so I plan to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok if I give you a patch for the test ?
…s and crossbow scripts
4d61141 to
3e4ec2a
Compare
|
We need to ensure that the remaining wheels (OSX and Windows) and conda packages are shipping gandiva as well. |
14aa22e to
3b1da30
Compare
|
Follow-up issues:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the gandiva changes LGTM. I skipped over the python changes.
|
Thanks for your review @pravindra! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
We were not running the pyarrow tests after installing the manylinux wheels, which can lead to uncaught issues, like: https://travis-ci.org/kszucs/crossbow/builds/484284104 Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes apache#3484 from kszucs/manylinux_tests and squashes the following commits: 3b1da30 <Krisztián Szűcs> use sudo c573a56 <Krisztián Szűcs> use env variables insude the container fd5e3fe <Krisztián Szűcs> use latest docker image tag d5531d9 <Krisztián Szűcs> test imports inside the wheel container 1aa19f1 <Krisztián Szűcs> reenable travis builds b399496 <Krisztián Szűcs> test py27mu and py36m wheels 71233c7 <Krisztián Szűcs> test 2.7,16 wheel 2372f3d <Krisztián Szűcs> fix requirements path; disable other CI tests 3e4ec2a <Krisztián Szűcs> unterminated llvm:MemoryBuffer; fix check_import.py path 7c88d61 <Krisztián Szűcs> only build python 3.6 wheel 18c5488 <Krisztián Szűcs> install wheel from dist dir 0bb07a7 <Krisztián Szűcs> better bash split 54fc653 <Krisztián Szűcs> don't export d3cb058 <Krisztián Szűcs> fix wheel building script 0d29b31 <Krisztián Szűcs> remove not existing resources from gandiva's pom 5d75adb <Krisztián Szűcs> initialize jni loader 09d829a <Krisztián Szűcs> build wheels for a single python distribution at a time; adjust travis and crossbow scripts 79abc0e <Krisztián Szűcs> mark .cc file as generated af78be2 <Krisztián Szűcs> don't bundle irhelpers in the jar a88cd37 <Krisztián Szűcs> cmake format 7deb359 <Krisztián Szűcs> fix REGEX; remove byteCodeFilePath from java configuration object fa19529 <Krisztián Szűcs> properly construct llvm:StringRef 5841dcd <Krisztián Szűcs> remove commented code 42391b1 <Krisztián Szűcs> don't pass precompiled bitcode all around the constructors d480c83 <Krisztián Szűcs> use const string ref for now b0b1117 <Krisztián Szűcs> conda llvmdev 169f43a <Krisztián Szűcs> build gandiva in cpp docker image cb69625 <Krisztián Szűcs> silent maven download msgs 19200c3 <Krisztián Szűcs> don't run wheel tests twice; cmake format f2205d0 <Krisztián Szűcs> gandiva jni dbf5b1c <Krisztián Szűcs> embed precompiled bitcode as char array; load precompiled IR from string 00d98e0 <Krisztián Szűcs> try to bundle bytecode files 97fe94b <Krisztián Szűcs> fix requirements-test.txt path 86e7e5b <Krisztián Szűcs> run pyarrow tests in manylinux CI build
We were not running the pyarrow tests after installing the manylinux wheels, which can lead to uncaught issues, like: https://travis-ci.org/kszucs/crossbow/builds/484284104