-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1579: [Java] Adding containerized Spark Integration tests #1319
ARROW-1579: [Java] Adding containerized Spark Integration tests #1319
Conversation
This is currently a WIP, the Scala/Java tests are able to run Left TODO:
|
I don't prior experience with Docker, so any suggestions are welcome! @heimir-sverrisson @wesm please take a look when you have a chance. One question I have from looking at the Dask integration tests is that are we assuming that pyarrow is already built prior to creating the docker image? |
I think it needs to be built as part of the Docker build. We should try to create some reusable scripts like our Travis CI setup to help with automating certain common tasks (setting up the C++ toolchain, building things, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment about better usage of docker's ability to cache layer of the image creation.
dev/spark_integration/Dockerfile
Outdated
# limitations under the License. | ||
# | ||
FROM maven:3.5.2-jdk-8-slim | ||
ADD . /apache-arrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add apache-arrow
as late as a possible. From the point on where you have added it, docker's caching will not be effective anymore. For example, once you move the apt-get
line before the add, a subsequent build of the image will not run apt-get
again but use the cached layer on top of the maven image that already is at the state after the apt-get
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great tip, thanks @xhochy ! apt-get takes a long time and I didn't know why it only sometimes ended up being cached.
@BryanCutler or @icexelloss could we get this set up? I'd like to look into setting up nightly integration test runs on a server someplace (we have a shared machine that @cpcloud and I have been using for nightly builds, and the pandas team users for benchmarks) so we can get warned immediately if something gets broken |
I'll try to pick this up again in the next few days @wesm. One question though, when we do an API breaking change, like rename MapVector, then this will fail until Spark is patched. Do we need to keep a branch with these kind of patches until they can be applied upstream? |
I think that's fine. The failing nightly integration test will help with creating some urgency to get the downstream project fixed |
5d398e8
to
95eb22a
Compare
This is pretty much running for me now, but I'm getting a strange test failure when running pyspark tests
Not sure why because the tests pass normally, maybe something with the conda setup? I'll try some more later.. |
@BryanCutler I had a look into the build and found some errors with the Python/C++ scripts (will do a PR for that) but I also cannot get
|
@xhochy that is strange, until I switched the image to be based on
I did follow this docker post-installation step, maybe that has something to do with it? https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user I've also been running with my local .m2 repo mounted in the container with Besides the strange path, what is the exact error that it gives you for |
Looks like the errors I'm getting is because
Is there something in the build I need to set to get a valid version? cc @wesm |
from
I would expect something like '0.8.***', where is it getting the above number from? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe some native toolchain is missing when switching to maven:3.5.2-jdk-8-slim
popd | ||
|
||
# Build Spark with Arrow | ||
SPARK_REPO=https://github.com/apache/spark.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should pull from apache git?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @felixcheung ! Yeah, it would probably be best to use Apache, I'll change it
@xhochy any pointers on how to get a correct pyarrow version? If you are able to do a PR to this branch for the errors you found, that would be great - I can make you a collaborator if that would be easier. |
@BryanCutler I think all committers on GitBox have write access to your branch, so we can push directly |
@BryanCutler The main change I did was to move |
Ok, I finally got this to build all and pass all tests! There are still a couple of issues to work out though, I'll discuss below.. Btw, to get the correct |
python/pyarrow/__init__.py
Outdated
@@ -24,7 +24,7 @@ | |||
# package is not installed | |||
try: | |||
import setuptools_scm | |||
__version__ = setuptools_scm.get_version('../') | |||
__version__ = setuptools_scm.get_version(root='../../', relative_to=__file__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhochy and @wesm , I needed to change this because it would only give a version if run under ARROW_HOME/python directory. So when running Spark tests, on importing pyarrow it would return None
. Making it relative to the __file__
seemed to fix it for all cases. I can make this a separate JIRA if you think that would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the suggestion above to get rid of this line.
@xhochy , I could not get Arrow C++ to build with |
We'll need to add some flags to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use the conda toolchain, set "ARROW_CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0"
this switches the code then to the correct C++ ABI.
# limitations under the License. | ||
# | ||
|
||
# Set up environment and working directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call set -e
here, then a command failure in the script will lead to a failure of the whole script. Then you can get rid of the if [[ $? -ne 0 ]]; then
blocks later on
# Build pyarrow and install inplace | ||
pushd arrow/python | ||
python setup.py clean | ||
python setup.py build_ext --build-type=release --inplace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use --inplace
but rather run python setup.py build_ext --build-type=release install
to install the extension into the environment.
popd | ||
|
||
# Clean up | ||
echo "Cleaning up.." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for these two line, at the end the environment is thrown away anyways
python/pyarrow/__init__.py
Outdated
@@ -24,7 +24,7 @@ | |||
# package is not installed | |||
try: | |||
import setuptools_scm | |||
__version__ = setuptools_scm.get_version('../') | |||
__version__ = setuptools_scm.get_version(root='../../', relative_to=__file__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the suggestion above to get rid of this line.
Thanks @xhochy , I'll try all this out! |
I still get the linking error after setting
|
Looks like that flag needs to be passed through to the ExternalProject declarations. Try using |
thanks @wesm , that seemed to do the trick by adding it to the cmake command line (it didn't work as an env var) |
Well, I'm able to run the build with Now, I get an error with I don't think this is from an arrow library, but I can't tell where it's coming from, any ideas? |
Now you need to set the |
…arrow python dir" This reverts commit 3305a35.
This is working using ARROW_BUILD_TOOLCHAIN set to the conda evn, and I think this could be merged now. Is somebody else able to verify that it is working before merging? |
Sweet, I can take this for a spin later today, or if @cpcloud wants to look that also works |
Build failed for me with
Haven't dug too far into what's wrong. Can we document how to run this someplace other than the Dockerfile? |
yeah, we should have it better documented. I copied how the api docs script was done and it's run with docker compose, but I'm not quite sure how to run it with through that. I just used the regular |
As this is part of a docker-compose script, this should be run using |
To get past the build issues I had to rebase locally. Once the script has run through I will report back. |
I probably should have rebased here earlier, if anyone else is going to try this out let me know and I can do that first. |
Sadly I get the following error on master:
|
That error is from #1490 being merged yesterday - what I was talking about in #1319 (comment) but I was hoping to at least get this running for a little while first before breaking Spark! I guess we could either rebase this just before #1490 to ensure it works or I can provide a patch for Spark to update it for that breaking change in Java? |
We definitely need an upstream patch for Spark but I'm happy with merging this PR as-is as it shows that we can now test Spark with it. @wesm is this ok for you? |
Works for me. Merging now |
Thanks @wesm @xhochy and @felixcheung ! Since it can sometimes take a while to get Spark updated, ff we get to the point were this is ready to be put in the nightly builds, maybe I could submit a PR to patch Spark and we could configure the docker build to point to that. |
This adds configuration files for running containerized integration tests with Spark for Java and Python. The image is based off of maven with openJDK-8 and will include a copy of Arrow when built on the parent directory to ARROW_HOME. Running the container will do the following:
Successful completion of the running container will indicate that there are no breaking changes between current Spark code and the Arrow code