Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to Apache Avro 1.10.1 #1648

Merged
merged 13 commits into from
Jul 13, 2021
Merged

Move to Apache Avro 1.10.1 #1648

merged 13 commits into from
Jul 13, 2021

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Oct 22, 2020

Moving the Apache Avro 1.10.0. But with two changes:

  • The GenericRecord.get(...) would return null before <= 1.10.0, but now it will give an exception. When setting a field that doesn't exist, it will give an exception, but this wasn't the case when retrieving a field. When you would have a typo in field name, this would silently return a null. More information in https://issues.apache.org/jira/browse/AVRO-2278
  • The validation of downcasting decimals was made more strict. Therefore with the Parquet to Avro Decimal conversion, there was an implicit downcast from 34 to 10:
org.apache.iceberg.spark.data.parquet.vectorized.TestParquetVectorizedReads > testMostlyNullsForOptionalFields FAILED
    org.apache.avro.AvroTypeException: Cannot encode decimal with precision 14 as max precision 10
        at org.apache.avro.Conversions$DecimalConversion.validate(Conversions.java:141)
        at org.apache.avro.Conversions$DecimalConversion.toFixed(Conversions.java:105)
        at org.apache.iceberg.parquet.ParquetAvro$FixedDecimalConversion.toFixed(ParquetAvro.java:177)
        at org.apache.iceberg.parquet.ParquetAvro$FixedDecimalConversion.toFixed(ParquetAvro.java:156)
        at org.apache.parquet.avro.AvroWriteSupport.convert(AvroWriteSupport.java:293)
        at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:276)
        at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
        at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
        at org.apache.iceberg.parquet.ParquetWriteSupport.write(ParquetWriteSupport.java:62)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
        at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:301)
        at org.apache.iceberg.parquet.ParquetWriteAdapter.add(ParquetWriteAdapter.java:45)
        at org.apache.iceberg.io.FileAppender.addAll(FileAppender.java:32)
        at org.apache.iceberg.io.FileAppender.addAll(FileAppender.java:37)
        at org.apache.iceberg.spark.data.parquet.vectorized.TestParquetVectorizedReads.writeAndValidate(TestParquetVectorizedReads.java:74)
        at org.apache.iceberg.spark.data.parquet.vectorized.TestParquetVectorizedReads.testMostlyNullsForOptionalFields(TestParquetVectorizedReads.java:175)

The current conversion only takes scale into account, so it could be that precision would be lost.

Personally, I like the more explicit messaging when it comes to non-existent fields and loss of precision. But this is a matter of taste.

@Fokko Fokko force-pushed the fd-move-to-avro-1-10 branch from 7f13e86 to effe75b Compare October 22, 2020 19:34
This was referenced Oct 22, 2020
@Fokko Fokko force-pushed the fd-move-to-avro-1-10 branch 2 times, most recently from 43b8d97 to 775b3dc Compare October 25, 2020 12:45
@shardulm94
Copy link
Contributor

@Fokko I am also interested in getting Iceberg onto Avro 1.10.1. Are you still working on this?

@Fokko
Copy link
Contributor Author

Fokko commented Jan 8, 2021

@shardulm94 I'll pick this up again since I'll be convenient for #1972, this will make Schema serializable.

Copy link
Contributor

@shardulm94 shardulm94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me. A couple of minor comments.

@shardulm94
Copy link
Contributor

@Fokko I see the Spark tests are failing with this error

    java.lang.ExceptionInInitializerError

        Caused by:
        com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.10.2 requires Jackson Databind version >= 2.10.0 and < 2.11.0

This happens because Jackson expects com.fasterxml.jackson.core versions to match com.fasterxml.jackson.module and that doesn't seem to be the case here. I ran a gradle build scan and seems like jackson-databind version is being upgraded to 2.11 because of Avro 1.10.0 which is causing this issue.
Gradle build scan link

I think we may need to bump Iceberg's jackson version in version.props and in build.gradle

@rdblue I am not a 100% sure if this is the right thing to do. I know Spark can be finicky about the Jackson version. Do you have any insights here?

@rdblue
Copy link
Contributor

rdblue commented Jan 13, 2021

@shardulm94, our Jackson version should be okay to update. We shade and relocate it in the runtime Jars to avoid conflicts with Spark.

@Fokko Fokko mentioned this pull request Jan 13, 2021
@Fokko
Copy link
Contributor Author

Fokko commented Jan 13, 2021

Not 100% sure if this is the case. The Jackson module in Avro isn't shaded, so that might cause issues with Spark 2. Avro 1.10.1 uses Jackson 2.11, and 1.9 is still on 2.10.

Spark 3.0.1 is still on Jackson 2.10:

+--- org.apache.spark:spark-hive_2.12 -> 3.0.1
|    +--- org.apache.spark:spark-core_2.12:3.0.1
|    |    +--- com.fasterxml.jackson.module:jackson-module-scala_2.12:2.10.0 -> 2.10.2
|    |    |    +--- org.scala-lang:scala-library:2.12.10
|    |    |    +--- com.fasterxml.jackson.core:jackson-core:2.10.2
|    |    |    +--- com.fasterxml.jackson.core:jackson-annotations:2.10.2
|    |    |    +--- com.fasterxml.jackson.core:jackson-databind:2.10.2 (*)
|    |    |    \--- com.fasterxml.jackson.module:jackson-module-paranamer:2.10.2
|    |    |         +--- com.fasterxml.jackson.core:jackson-databind:2.10.2 (*)
|    |    |         \--- com.thoughtworks.paranamer:paranamer:2.8

https://github.com/apache/spark/blob/v3.0.1/pom.xml#L175

Pulling a newer version into the classpath will create a mismatch between the Jackson versions.

They've bumped Jackson in Spark 3.1, for exactly the issue that we're running into: apache/spark@01b73ae

@Fokko Fokko changed the title Move to Apache Avro 1.10.0 Move to Apache Avro 1.10.1 Jan 13, 2021
@Fokko Fokko force-pushed the fd-move-to-avro-1-10 branch 3 times, most recently from 264abc7 to dacb650 Compare January 18, 2021 17:04
@rdblue
Copy link
Contributor

rdblue commented Feb 1, 2021

@Fokko, could you rebase this? Now that the Jackson update is in, I think we can work on it.

@Fokko
Copy link
Contributor Author

Fokko commented Feb 2, 2021

@rdblue Sure thing!

Copy link
Contributor

@shardulm94 shardulm94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! @Fokko Just confirming, is the PR is ready to merge?

Assert.assertNotNull("Should project location", projected.get("location"));
Assert.assertNull("Should not project longitude", projectedLocation.get("long"));
AssertHelpers.assertEmptyAvroField(projected, "long");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should still use projectedLocation. Looks like it might just be a typo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, probably a copy-paste!

@@ -169,12 +164,14 @@ public String getLogicalTypeName() {

@Override
public BigDecimal fromFixed(GenericFixed value, Schema schema, LogicalType type) {
return super.fromFixed(value, schema, decimalsByScale[((ParquetDecimal) type).scale()]);
final ParquetDecimal dec = (ParquetDecimal) type;
return super.fromFixed(value, schema, LogicalTypes.decimal(dec.precision(), dec.scale()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change results in creating a new Decimal for every value. While I don't think that we use this very much, I would rather not introduce object allocation for every conversion. Can you add a static cache for these? That was the purpose of decimalsByScale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but we didn't take the scale into account, so it could actually change the scale. I fixed this in the fromFixed by directly creating the BigDecimal, this is what's happening in the Avro library:

    @Override
    public BigDecimal fromFixed(GenericFixed value, Schema schema, LogicalType type) {
      int scale = ((LogicalTypes.Decimal) type).getScale();
      return new BigDecimal(new BigInteger(value.bytes()), scale);
    }

For the toFixed this isn't that trivial, since this would mean copying a lot of code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to delegate to toFixed. I just don't want to call LogicalTypes.decimal every time to create a new decimal LogicalType with the same scale and precision. I think just adding a cache keyed by a Pair of scale and precision would fix it.

@rdblue
Copy link
Contributor

rdblue commented Feb 21, 2021

@Fokko, looks correct overall, although there is one error in a test case. I'd also like to fix the allocation problem the the fixed converter.

@Fokko Fokko force-pushed the fd-move-to-avro-1-10 branch from 04f5901 to a0328ac Compare March 11, 2021 08:38
@rdblue
Copy link
Contributor

rdblue commented Mar 23, 2021

Looks like this would fix #1654 so we should get it into the 0.12.0 release. I don't think we should update Avro with a breaking change in a patch release so it isn't a good idea to try to get this into 0.11.1.

@Fokko, if you have time please update to 1.10.2 and I'll merge. Otherwise, I'll merge this in a day or so and we can update again. Thanks!

@Fokko
Copy link
Contributor Author

Fokko commented Mar 23, 2021

Yes, I'll fix it right away.

@Fokko Fokko force-pushed the fd-move-to-avro-1-10 branch from 5cbe6ca to f931bbf Compare March 23, 2021 20:45
@Fokko Fokko force-pushed the fd-move-to-avro-1-10 branch from b98e1a7 to 30db7ec Compare March 23, 2021 21:07
@rdblue
Copy link
Contributor

rdblue commented Mar 23, 2021

I think a Jackson dependency update in Avro breaks Spark:

com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.11.4 requires Jackson Databind version >= 2.11.0 and < 2.12.0

Looks like Avro updated Jackson from 2.11.3 to 2.12.2, which is what causes the problem. We should revert the latest update to 1.10.2 and use 1.10.1 to fix the problem, I guess.

@davseitsev
Copy link

Is there any workaround for this issue? Are you going to release 0.11.2 with this fix?

@rdblue
Copy link
Contributor

rdblue commented Jul 7, 2021

@davseitsev, we don't typically change dependency versions in patch releases. I think that's a surprise to people that depend on very few changes in patches. This would be a good one to get into 0.12.0 (release in the next couple weeks) if we can get tests passing.

@Fokko Fokko changed the title Move to Apache Avro 1.10.1 Move to Apache Avro 1.10.2 Jul 12, 2021
Fokko Driesprong added 2 commits July 12, 2021 22:44
@Fokko Fokko changed the title Move to Apache Avro 1.10.2 Move to Apache Avro 1.10.1 Jul 13, 2021
@rdblue rdblue merged commit b3fb81a into apache:master Jul 13, 2021
@rdblue
Copy link
Contributor

rdblue commented Jul 13, 2021

Merged! Thanks for updating this, @Fokko!

minchowang pushed a commit to minchowang/iceberg that referenced this pull request Aug 2, 2021
Co-authored-by: Fokko Driesprong <fdriesprong@ebay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants