-
Notifications
You must be signed in to change notification settings - Fork 993
Upgrade Hadoop and Hive to version 3 #1636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add shadow plugin to gradle build.
MR project to use JAR inputs instead of classes since the shading work must be done first.
These aren't needed anymore since any code the integration tests that would be shared with the qa tests has been lifted out.
Downgrade Hadoop version to 3.1.2. We can't test on any higher versions because hive packages an old version of guava that might be loaded first, which breaks much of the Hadoop code. Force all Hadoop versions to use 3.1.2. Add a hive-site.xml file to provide the settings to Hive in the test fixture. Expand the test seed in that file to keep the metastore fresh. Update the HiveEmbeddedServer start up for the new Hive version. Switch to using the Abstract SerDe class instead of the now removed SerDe interface. Add code to handle TimestampWritableV2 and DateWritableV2.
Get the Hive qa kerberos tests running again. Hive now requires the use of the schema tool to start. It additionally needs extra setup steps to patch libraries in from Hadoop in order to run. Some changes were needed to the sql scripts as well to account for changes in job planning.
These are pulled into the thirdparty project and the dist project now instead of MR and root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combed through the changes and found a few last clean up spots
buildSrc/src/main/groovy/org/elasticsearch/hadoop/gradle/BuildPlugin.groovy
Outdated
Show resolved
Hide resolved
.../groovy/org/elasticsearch/hadoop/gradle/fixture/hadoop/services/HiveServiceDescriptor.groovy
Outdated
Show resolved
Hide resolved
...vy/org/elasticsearch/hadoop/gradle/fixture/hadoop/services/SparkYarnServiceDescriptor.groovy
Outdated
Show resolved
Hide resolved
pig/src/itest/java/org/elasticsearch/hadoop/integration/pig/AbstractPigExtraTests.java
Outdated
Show resolved
Hide resolved
Consider switch to Hadoop Shaded Client? https://issues.apache.org/jira/browse/SPARK-33212 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback during second review
hive/src/main/java/org/elasticsearch/hadoop/hive/HiveValueReader.java
Outdated
Show resolved
Hide resolved
.../groovy/org/elasticsearch/hadoop/gradle/fixture/hadoop/services/HiveServiceDescriptor.groovy
Show resolved
Hide resolved
This might be a good thing for us to depend on in the project instead of the regular client libraries we pull in, but I think it should be a part of another PR - this one is already too large. I think that shading our dependencies ourselves going forward will give us more flexibility to upgrade them without the concern that they'll break other integrations or user code, but the API dependency could be a good way to avoid pulling in any conflicting transitive dependencies. |
The official shaded client only expose public Hadoop API which quite stable. And the shade behavior is not always safe, maybe accident broken some libraries which depends on the specific Hadoop API, even the official shaded client can not do it perfect, i.e. apache/hadoop#2575 For my perspective, I recommend to make the project only depends on Hadoop public API, and test against Hadoop Shaded Client(>=3.0) and normal Hadoop Client(< 3.0), which just Apache Spark doing now. |
We aren't shading any artifacts from the Hadoop packages. We're shading dependencies that Hadoop previously provided to jobs at runtime. If we were shading Hadoop dependencies, I for sure would agree with you - Any change to those API's in the runtime can cause runtime problems, but as it stands only HTTP Client and Jackson are being shaded in ES-Hadoop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(Jimmy walked Lee and myself through the changes and all tests pass locally)
I noticed a minor issue with the build with local repo vs. not local. Can you add the following (verify by running ./gradlew clean && ./gradlew clean -PlocalRepo) diff --git a/settings.gradle b/settings.gradle
index 259e6e31..e47115ed 100644
--- a/settings.gradle
+++ b/settings.gradle
@@ -1,3 +1,9 @@
+pluginManagement {
+ plugins {
+ id 'com.github.johnrengelman.shadow' version "6.1.0"
+ }
+}
+
rootProject.name = "elasticsearch-hadoop"
include 'thirdparty'
@@ -37,3 +43,4 @@ include 'test:fixtures:minikdc'
include 'qa'
include 'qa:kerberos'
+
diff --git a/thirdparty/build.gradle b/thirdparty/build.gradle
index 03e491b2..c0717fd0 100644
--- a/thirdparty/build.gradle
+++ b/thirdparty/build.gradle
@@ -1,7 +1,7 @@
import org.elasticsearch.hadoop.gradle.BuildPlugin
plugins {
- id 'com.github.johnrengelman.shadow' version "6.1.0"
+ id 'com.github.johnrengelman.shadow'
id 'es.hadoop.build'
} |
I've updated the plugin management so that it works with and without local repo now. |
This PR upgrades our dependencies on Hadoop and Hive to version 3.x.
This PR upgrades our dependencies on Hadoop and Hive to version 3.x.
Hadoop removed and isolated some dependencies that we make use of in the connector, primarily the HTTP client libraries. In order to maintain a single-jar deployment model, we are now shading the client library dependencies into our release artifacts. Notice files and licenses checking has been updated for this.
Our version of Hive was too old to function on the newer version of Hadoop, so Hive has also been updated to version 3.x on its own release track. I have tested that the hive integration code still works with Hive 1.2, but in the next major version we may be updating the minimum supported version of Hive.
The Hadoop version upgrade also caused some issues with our Spark testing code. Hadoop has upgraded the version of Jackson it depends on. The version that Hadoop provides conflicts with the version of Jackson that Spark expects. This is normally not a problem when deploying either solutions in isolation, but causes compatibility issues with Jackson modules when attempting to support both frameworks with one library. As such, the Jackson libraries required have also been shaded into the release artifacts for ES-Hadoop. We already packaged some Jackson code for backwards compatibility, so NOTICE files are already valid for this change.
A number of changes have gone in to support the version updates in the QA project for Kerberos. A number of testing changes were required in the integration tests for Hive due to changes to how the Hive metastore is initialized. Some additional changes to support running on external clusters also went in.