Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1579: [Java] Adding containerized Spark Integration tests #1319

Conversation

BryanCutler
Copy link
Member

This adds configuration files for running containerized integration tests with Spark for Java and Python. The image is based off of maven with openJDK-8 and will include a copy of Arrow when built on the parent directory to ARROW_HOME. Running the container will do the following:

  1. Install Arrow Java artifacts to m2 repo in container
  2. Checkout current Spark master branch
  3. Update Spark POM to use the installed version of Arrow and package
  4. Run Arrow related Java/Scala Spark tests
  5. Run pyarrow related Python Spark tests

Successful completion of the running container will indicate that there are no breaking changes between current Spark code and the Arrow code

@BryanCutler
Copy link
Member Author

BryanCutler commented Nov 14, 2017

This is currently a WIP, the Scala/Java tests are able to run

Left TODO:

  • Run PySpark Tests
  • Verify working with docker-compose and existing volumes in arrow/dev
  • Check why Zinc is unable to run in mvn build, need to enable port 3030?
  • Speed up pyarrow build using conda prefix as toolchain

@BryanCutler
Copy link
Member Author

I don't prior experience with Docker, so any suggestions are welcome! @heimir-sverrisson @wesm please take a look when you have a chance. One question I have from looking at the Dask integration tests is that are we assuming that pyarrow is already built prior to creating the docker image?

@wesm
Copy link
Member

wesm commented Nov 14, 2017

I think it needs to be built as part of the Docker build. We should try to create some reusable scripts like our Travis CI setup to help with automating certain common tasks (setting up the C++ toolchain, building things, etc.)

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment about better usage of docker's ability to cache layer of the image creation.

# limitations under the License.
#
FROM maven:3.5.2-jdk-8-slim
ADD . /apache-arrow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add apache-arrow as late as a possible. From the point on where you have added it, docker's caching will not be effective anymore. For example, once you move the apt-get line before the add, a subsequent build of the image will not run apt-get again but use the cached layer on top of the maven image that already is at the state after the apt-get call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tip, thanks @xhochy ! apt-get takes a long time and I didn't know why it only sometimes ended up being cached.

@wesm wesm changed the title [WIP] ARROW-1579: [TESTS] Adding containerized Spark Integration tests [WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests Nov 18, 2017
@wesm
Copy link
Member

wesm commented Jan 18, 2018

@BryanCutler or @icexelloss could we get this set up? I'd like to look into setting up nightly integration test runs on a server someplace (we have a shared machine that @cpcloud and I have been using for nightly builds, and the pandas team users for benchmarks) so we can get warned immediately if something gets broken

@BryanCutler
Copy link
Member Author

I'll try to pick this up again in the next few days @wesm. One question though, when we do an API breaking change, like rename MapVector, then this will fail until Spark is patched. Do we need to keep a branch with these kind of patches until they can be applied upstream?

@wesm
Copy link
Member

wesm commented Jan 18, 2018

I think that's fine. The failing nightly integration test will help with creating some urgency to get the downstream project fixed

@BryanCutler BryanCutler force-pushed the docker-spark-integration-ARROW-1579 branch from 5d398e8 to 95eb22a Compare January 26, 2018 01:37
@BryanCutler
Copy link
Member Author

This is pretty much running for me now, but I'm getting a strange test failure when running pyspark tests

======================================================================
FAIL: test_unsupported_datatype (pyspark.sql.tests.ArrowTests)
----------------------------------------------------------------------
AttributeError: 'LooseVersion' object has no attribute 'version'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/apache-arrow/spark/python/pyspark/sql/tests.py", line 3404, in test_unsupported_datatype
    df.toPandas()
AssertionError: "Unsupported data type" does not match "'LooseVersion' object has no attribute 'version'"

Not sure why because the tests pass normally, maybe something with the conda setup? I'll try some more later..

@xhochy
Copy link
Member

xhochy commented Jan 28, 2018

@BryanCutler I had a look into the build and found some errors with the Python/C++ scripts (will do a PR for that) but I also cannot get mvn install to work correctly. Any ideas why it installs to this location?

[INFO] Installing /apache-arrow/arrow/java/tools/target/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar to /apache-arrow/arrow/java/?/.m2/repository/org/apache/arrow/arrow-tools/0.9.0-SNAPSHOT/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar

@BryanCutler
Copy link
Member Author

@xhochy that is strange, until I switched the image to be based on FROM maven:3.5.2-jdk-8-slim I was seeing similar problems with paths containing ? but since then it's been working fine and installs under root

[INFO] Installing /apache-arrow/arrow/java/tools/target/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar to /root/.m2/repository/org/apache/arrow/arrow-tools/0.9.0-SNAPSHOT/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar

I did follow this docker post-installation step, maybe that has something to do with it? https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user

I've also been running with my local .m2 repo mounted in the container with docker run -v /home/bryan/.m2:/root/.m2 spark-arrow to save from always downloading dependencies, but I tried skipping that still don't get any errors.

Besides the strange path, what is the exact error that it gives you for mvn install?

@BryanCutler
Copy link
Member Author

Looks like the errors I'm getting is because pa.__version__ return None

>>> import pyarrow as pa
>>> from distutils.version import LooseVersion
>>> LooseVersion(pa.__version__) < LooseVersion('0.8.0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bryan/miniconda2/envs/pyarrow-dev/lib/python2.7/distutils/version.py", line 296, in __cmp__
    return cmp(self.version, other.version)
AttributeError: LooseVersion instance has no attribute 'version'
>>> pa.__version__
>>>

Is there something in the build I need to set to get a valid version? cc @wesm

@BryanCutler
Copy link
Member Author

BryanCutler commented Jan 29, 2018

from pyarrow.__init__.py it looks like maybe I should install setuptools_scm but then I get the following version:

>>> pa.__version__
'0.1.1.dev1231+g0a49022'

I would expect something like '0.8.***', where is it getting the above number from?

@wesm
Copy link
Member

wesm commented Jan 29, 2018

Not my area of expertise (@xhochy or @cpcloud would know better) -- usually that relates to not having fetched all the tags

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some native toolchain is missing when switching to maven:3.5.2-jdk-8-slim

popd

# Build Spark with Arrow
SPARK_REPO=https://github.com/apache/spark.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pull from apache git?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felixcheung ! Yeah, it would probably be best to use Apache, I'll change it

@BryanCutler
Copy link
Member Author

@xhochy any pointers on how to get a correct pyarrow version? If you are able to do a PR to this branch for the errors you found, that would be great - I can make you a collaborator if that would be easier.

@wesm
Copy link
Member

wesm commented Jan 30, 2018

@BryanCutler I think all committers on GitBox have write access to your branch, so we can push directly

@xhochy
Copy link
Member

xhochy commented Feb 4, 2018

@BryanCutler The main change I did was to move source activate pyarrow-dev to the beginning of the file, set export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX and removed LD_LIBRARY_PATH and PYTHONPATH from the file. This should pick up the conda environment and speed up Arrow builds a lot. That should actually not affect maven builds.

@BryanCutler
Copy link
Member Author

BryanCutler commented Feb 9, 2018

Ok, I finally got this to build all and pass all tests! There are still a couple of issues to work out though, I'll discuss below..

Btw, to get the correct pyarrow.__version__ from the dev env, you do need to have all git tags fetched and install setuptools_scm from pip or conda. @xhochy , setuptools_scm wasn't listed in any of the developer docs I could find, should it be added to the list of dependent packages for setting up a conda env?

@@ -24,7 +24,7 @@
# package is not installed
try:
import setuptools_scm
__version__ = setuptools_scm.get_version('../')
__version__ = setuptools_scm.get_version(root='../../', relative_to=__file__)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy and @wesm , I needed to change this because it would only give a version if run under ARROW_HOME/python directory. So when running Spark tests, on importing pyarrow it would return None. Making it relative to the __file__ seemed to fix it for all cases. I can make this a separate JIRA if you think that would be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the suggestion above to get rid of this line.

@BryanCutler
Copy link
Member Author

@xhochy , I could not get Arrow C++ to build with export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX, I would get a linking error with gflags like "undefined reference google::FlagRegisterer::FlagRegisterer". I thought maybe it was because I wasn't using g++ 4.9, but I had no luck trying to get 4.9 installed since the base image I'm using is Ubuntu 16.04. Have you ever run into this? It seemed like it was some kind of template constructor that it couldn't find..

@wesm
Copy link
Member

wesm commented Feb 9, 2018

We'll need to add some flags to ARROW_CXXFLAGS to disable the gcc5 ABI. I can try to take a look in a bit or this weekend

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use the conda toolchain, set "ARROW_CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0" this switches the code then to the correct C++ ABI.

# limitations under the License.
#

# Set up environment and working directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call set -e here, then a command failure in the script will lead to a failure of the whole script. Then you can get rid of the if [[ $? -ne 0 ]]; then blocks later on

# Build pyarrow and install inplace
pushd arrow/python
python setup.py clean
python setup.py build_ext --build-type=release --inplace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use --inplace but rather run python setup.py build_ext --build-type=release install to install the extension into the environment.

popd

# Clean up
echo "Cleaning up.."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for these two line, at the end the environment is thrown away anyways

@@ -24,7 +24,7 @@
# package is not installed
try:
import setuptools_scm
__version__ = setuptools_scm.get_version('../')
__version__ = setuptools_scm.get_version(root='../../', relative_to=__file__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the suggestion above to get rid of this line.

@BryanCutler
Copy link
Member Author

Thanks @xhochy , I'll try all this out!

@BryanCutler
Copy link
Member Author

I still get the linking error after setting "ARROW_CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0"

[ 78%] Linking CXX executable ../../../release/json-integration-test
CMakeFiles/json-integration-test.dir/json-integration-test.cc.o: In function `_GLOBAL__sub_I__ZN5arrow4test25MakeRandomInt32PoolBufferElPNS_10MemoryPoolEPSt10shared_ptrINS_10PoolBufferEEj':
json-integration-test.cc:(.text.startup+0x1de): undefined reference to `google::FlagRegisterer::FlagRegisterer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
json-integration-test.cc:(.text.startup+0x296): undefined reference to `google::FlagRegisterer::FlagRegisterer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
json-integration-test.cc:(.text.startup+0x34e): undefined reference to `google::FlagRegisterer::FlagRegisterer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
collect2: error: ld returned 1 exit status

@wesm
Copy link
Member

wesm commented Feb 13, 2018

Looks like that flag needs to be passed through to the ExternalProject declarations. Try using CMAKE_CXX_FLAGS instead of ARROW_CXXFLAGS? If this doesn't work I can dig in later today

@BryanCutler
Copy link
Member Author

thanks @wesm , that seemed to do the trick by adding it to the cmake command line (it didn't work as an env var)
cmake -DCMAKE_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=0" -DARROW_PYTHON=on -DARROW_HDFS=on -DCMAKE_BUILD_TYPE=release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME ..

@BryanCutler
Copy link
Member Author

BryanCutler commented Feb 14, 2018

Well, I'm able to run the build with ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX but I do need to keep -DCMAKE_INSTALL_PREFIX=$ARROW_HOME for the Python build to work (is this right?), and then LD_LIBRARY_PATH=${ARROW_HOME}/lib:${CONDA_BASE}/lib:${LD_LIBRARY_PATH} to locate libarrow.so.

Now, I get an error with import pyarrow because of some undefined symbol:
ImportError: /home/ubuntu/miniconda/envs/pyarrow-dev/lib/python2.7/site-packages/pyarrow-0.8.1.dev116+g7bf7b2e9-py2.7-linux-x86_64.egg/pyarrow/lib.so: undefined symbol: _ZN5arrow9timestampENS_8TimeUnit4typeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

I don't think this is from an arrow library, but I can't tell where it's coming from, any ideas?

@wesm
Copy link
Member

wesm commented Feb 14, 2018

Now you need to set the PYARROW_CXXFLAGS environment variable with the gcc5 abi flag

@BryanCutler BryanCutler changed the title [WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests ARROW-1579: [Java] Adding containerized Spark Integration tests Feb 14, 2018
@BryanCutler
Copy link
Member Author

This is working using ARROW_BUILD_TOOLCHAIN set to the conda evn, and I think this could be merged now. Is somebody else able to verify that it is working before merging?

@wesm
Copy link
Member

wesm commented Feb 15, 2018

Sweet, I can take this for a spin later today, or if @cpcloud wants to look that also works

@wesm
Copy link
Member

wesm commented Feb 16, 2018

Build failed for me with

Scanning dependencies of target arrow_static
Scanning dependencies of target arrow_shared
make[2]: *** No rule to make target '/home/ubuntu/miniconda/envs/pyarrow-dev/lib/libbrotlidec.a', needed by 'release/libarrow.so.0.0.0'.  Stop.
CMakeFiles/Makefile2:693: recipe for target 'src/arrow/CMakeFiles/arrow_shared.dir/all' failed
make[1]: *** [src/arrow/CMakeFiles/arrow_shared.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 52%] Linking CXX static library ../../release/libarrow.a
[ 52%] Built target arrow_static
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2

Haven't dug too far into what's wrong. Can we document how to run this someplace other than the Dockerfile?

@BryanCutler
Copy link
Member Author

Thanks @wesm , I think that is because the brotli libraries were recently updated and #1554 should fix that. Could you check if you have the latest master checked out? The docker script just takes what is in your current dir - not sure if that is what we want or not.

@BryanCutler
Copy link
Member Author

yeah, we should have it better documented. I copied how the api docs script was done and it's run with docker compose, but I'm not quite sure how to run it with through that. I just used the regular docker run cmd

@xhochy
Copy link
Member

xhochy commented Feb 16, 2018

As this is part of a docker-compose script, this should be run using docker-compose build && docker-compose run spark_integration. Then it will use the options declared in the docker-compose.yml (mostly the mount of the current working dir)

@xhochy
Copy link
Member

xhochy commented Feb 16, 2018

To get past the build issues I had to rebase locally. Once the script has run through I will report back.

@BryanCutler
Copy link
Member Author

I probably should have rebased here earlier, if anyone else is going to try this out let me know and I can do that first.

@xhochy
Copy link
Member

xhochy commented Feb 16, 2018

Sadly I get the following error on master:

[warn] Class org.apache.avro.reflect.Stringable not found - continuing with a stub.
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:65: not found: type NullableMapVector
[error]       case (StructType(_), vector: NullableMapVector) =>
[error]                                    ^
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:66: value size is not a member of org.apache.arrow.vector.ValueVector
[error]         val children = (0 until vector.size()).map { ordinal =>
[error]                                        ^
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:67: value getChildByOrdinal is not a member of org.apache.arrow.vector.ValueVector
[error]           createFieldWriter(vector.getChildByOrdinal(ordinal))
[error]                                    ^
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:318: not found: type NullableMapVector
[error]     val valueVector: NullableMapVector,
[error]                      ^
[warn] Class org.apache.avro.reflect.Stringable not found - continuing with a stub.
[warn] two warnings found
[error] four errors found
[error] Compile failed at Feb 16, 2018 6:49:56 PM [50.120s]

@BryanCutler
Copy link
Member Author

That error is from #1490 being merged yesterday - what I was talking about in #1319 (comment) but I was hoping to at least get this running for a little while first before breaking Spark!

I guess we could either rebase this just before #1490 to ensure it works or I can provide a patch for Spark to update it for that breaking change in Java?

@xhochy
Copy link
Member

xhochy commented Feb 16, 2018

We definitely need an upstream patch for Spark but I'm happy with merging this PR as-is as it shows that we can now test Spark with it. @wesm is this ok for you?

@wesm
Copy link
Member

wesm commented Feb 17, 2018

Works for me. Merging now

@wesm wesm closed this in 62b9eb2 Feb 17, 2018
@BryanCutler
Copy link
Member Author

Thanks @wesm @xhochy and @felixcheung ! Since it can sometimes take a while to get Spark updated, ff we get to the point were this is ready to be put in the nightly builds, maybe I could submit a PR to patch Spark and we could configure the docker build to point to that.

@BryanCutler BryanCutler deleted the docker-spark-integration-ARROW-1579 branch November 19, 2018 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants