Closed
Description
Description
Users have reported that it's not possible to dynamically provision delta.io packages to use with PySpark.
The erroneous behavior can be reproduced with this commit.
The error is fixed and the delta test (and all others except for logging) is successful with this commit. This fix is only temporary and cannot be merged in it's current form since it breaks the logging tests.
Analysis
The problem is caused by the following two properties that the operator always adds to spark-submit
in order to support log aggregation with vector
:
--conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true
In addition, the user classpath is extended like this:
--conf spark.driver.userClassPath=/stackable/spark/extra-jars/*
--conf spark.executor.userClassPath=/stackable/spark/extra-jars/*
The contents of /stackable/spark/extra-jars/ is:
bash-4.4$ ls -l /stackable/spark/extra-jars/
total 1868
-rw-r--r-- 1 stackable stackable 126137 Feb 12 08:54 jackson-dataformat-xml-2.15.2.jar
-rw-r--r-- 1 stackable stackable 195909 Feb 12 08:54 stax2-api-4.2.1.jar
-rw-r--r-- 1 stackable stackable 1586395 Feb 12 08:54 woodstox-core-6.5.1.jar
Acceptance Criteria
Since this is an investigation ticket, the following outcomes are possible:
- An integration test showcasing Stackable and Delta with PySpark and S3.
- Updated operator documentation
-
An update to the Spark images to include Delta dependencies. -
A new Spark image with with Delta dependencies.
Related PRs
- fix: Remove userClassPathFirst properties #355
- Reorganize logging jars [was: experimental: spark with delta] docker-images#556
Related Issues
Metadata
Metadata
Type
Projects
Status
Done