Skip to content

[HADOOP-18342] shaded avro jar #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 23, 2022
Merged

Conversation

pjfanning
Copy link
Contributor

https://issues.apache.org/jira/browse/HADOOP-18342

This is based on latest avro jar (1.11.1). The shaded avro jar also contains a shaded copy of jackson (based on version avro needs). This is jackson 2.12.7 which matches what hadoop uses but I think it might be safer not to rely on avro and hadoop needing the same version of jackson and jackson upgrades can prove problematic for downstream projects.

I have omitted these jars from the shaded jar:

  • slf4j-api
  • commons-compress

These 2 are more stable and might be tidier not to completely bloat the avro shaded jar.

@steveloughran
Copy link
Contributor

I'd prefer retaining jackson unshaded just to avoid the need to do another release of this every time there's a jackson CVE emergency

leaving out slf4j is good and flexible; though the dependencies of the shaded jar should declare that and compress as either provided or required.

now, the parquet lib includes the version. I'm not sure that is the right thing to do (as it complicates that update), but it would allow someone brave to have the different versions coexisting.

thoughts?

@pjfanning
Copy link
Contributor Author

Thanks @steveloughran - I'll look into changing the shaded avro jar and pom.xml as you suggest.

I'm not sure if I understand your point about parquet. Do you think parquet needs to be shaded as an extra hadoop-thirdparty jar? Or does hadoop use parquet-avro jar and if so, would we need to shade parquet-avro too?

@steveloughran
Copy link
Contributor

it's that because the parquet bit of the shaded jar has a version in it, upgrading is harder than just bumping up the build number, i need to change the module name etc too.

@steveloughran
Copy link
Contributor

no problem, great that you are working on this.

@pjfanning
Copy link
Contributor Author

pjfanning commented Sep 4, 2022

@steveloughran I have made some progress with the hadoop-avro jar. I still need to fully test it in a hadoop build though. I have noticed that hadoop also uses avro-maven-plugin to compile generated classes.

It does appear that we will need to shade this too because at the moment avro-maven-plugin generates classes that import org.apache.avro classes.

@pjfanning
Copy link
Contributor Author

@steveloughran I think it's easier to keep using the standard avro-maven-plugin and to postprocess the generated source to replace the org.apache.avro package names in the generated source. I have this approach working in a local hadoop build.

@pjfanning
Copy link
Contributor Author

@steveloughran could you review this? apache/hadoop#4854 uses a temp jar I published myself but can be adjusted when this hadoop-thirdparty PR is merged and released.

@steveloughran
Copy link
Contributor

lgtm; needs a newline at the end of the new pom.
+1 pending that

@steveloughran
Copy link
Contributor

do see #19 and discussion about whether artifacts need names. it may be good to use a version number here...

@pjfanning
Copy link
Contributor Author

do see #19 and discussion about whether artifacts need names. it may be good to use a version number here...

I can rename the module and jar to hadoop-shaded-avro_1_11 if that makes sense.

@steveloughran
Copy link
Contributor

let's do that. yes, upgrading is harder, but it's a lot more obvious what is in use.

(the other strategy would be to include it in the version itself, but that'd just confuse people)

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@steveloughran steveloughran merged commit 476dbc0 into apache:trunk Oct 23, 2022
@steveloughran
Copy link
Contributor

this is in; we need to get the protobuf update in and it'll be good for a release. though we may also want to update guava, which maybe we can do by adding a guava artifact with a version...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants