Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow: Bump Apache Arrow 7.0.0 #4112

Merged
merged 1 commit into from
Feb 16, 2022
Merged

Arrow: Bump Apache Arrow 7.0.0 #4112

merged 1 commit into from
Feb 16, 2022

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Feb 14, 2022

To pick up new improvements & bug fixes from the latest release.
Release Notes: https://arrow.apache.org/release/7.0.0.html

Benchmark result

This PR Arrow 7.0.0

https://github.com/pan3793/iceberg/actions/runs/1844723231

Benchmark                                                                                  Mode  Cnt   Score   Error  Units
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5   6.441 ± 0.359   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5   4.842 ± 0.780   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  19.441 ± 2.028   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5  18.144 ± 1.355   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5   4.421 ± 0.114   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5   4.550 ± 0.653   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5   5.084 ± 0.104   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5   4.494 ± 0.359   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5   5.839 ± 0.849   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5   5.080 ± 0.598   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5   4.132 ± 0.668   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5   4.305 ± 0.640   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5   6.813 ± 0.996   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5   9.497 ± 0.291   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5   4.930 ± 0.750   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5   4.194 ± 0.340   s/op
Benchmark                                                                 Mode  Cnt   Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5   2.179 ± 0.101   s/op
VectorizedReadFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5   2.058 ± 0.279   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  11.474 ± 2.022   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5   8.794 ± 0.162   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5   3.750 ± 0.186   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5   3.372 ± 0.154   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5   3.710 ± 0.258   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5   3.586 ± 0.300   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5   3.254 ± 0.378   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5   2.881 ± 0.232   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5   4.024 ± 0.670   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5   3.324 ± 0.184   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5   5.452 ± 0.740   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5   6.466 ± 0.481   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5   2.404 ± 0.216   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5   1.897 ± 0.137   s/op

Master branch Arrow 6.0.0

https://github.com/pan3793/iceberg/actions/runs/1844725462

Benchmark                                                                                  Mode  Cnt   Score   Error  Units
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5   5.792 ± 0.072   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5   5.194 ± 0.098   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  22.779 ± 0.248   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5  18.235 ± 0.069   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5   6.302 ± 0.086   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5   5.784 ± 0.331   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5   6.403 ± 0.367   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5   4.686 ± 0.115   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5   6.189 ± 0.092   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5   4.365 ± 0.081   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5   4.482 ± 0.091   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5   4.530 ± 0.059   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5   8.277 ± 0.148   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5   8.560 ± 0.185   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5   6.076 ± 0.104   s/op
VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5   4.539 ± 0.101   s/op
Benchmark                                                                 Mode  Cnt   Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5   2.218 ± 0.140   s/op
VectorizedReadFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5   2.046 ± 0.080   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  11.647 ± 0.174   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5   9.253 ± 0.053   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5   4.435 ± 0.154   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5   3.234 ± 0.129   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5   3.985 ± 0.160   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5   3.911 ± 0.081   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5   3.906 ± 0.101   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5   3.512 ± 0.071   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5   4.458 ± 0.142   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5   3.417 ± 0.083   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5   6.553 ± 0.183   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5   5.590 ± 0.063   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5   2.280 ± 0.087   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5   2.567 ± 0.124   s/op

@github-actions github-actions bot added the build label Feb 14, 2022
Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked like a big change jumping from 6.0.0 to 7.0.0, but it looks like the only thing we missed in between was a 6.0.1 release at the end of 2021.

Thought I'd leave that for anybody who, like me, initially looks at this and thinks this is a huge change. We only "missed" one minor patch version. It should be just the combination of 6.0.1 and what's listed in the changelog in the PR description (which isn’t trivial, but the PR description change log should cover most things).

@rdblue
Copy link
Contributor

rdblue commented Feb 14, 2022

We'll want to run the benchmarks to make sure there's no performance regression before committing this. You should be able to do that using the actions that @nastra set up.

@pan3793
Copy link
Member Author

pan3793 commented Feb 15, 2022

@kbendick Thanks for review, I think all of fix in v6.0.1 should be inlucded in v7.0.0.
@rdblue Thanks for tips, updated benchmark result and I think there is no perf regression blocks us to upgrade arrow 7.0.0

@rdblue
Copy link
Contributor

rdblue commented Feb 15, 2022

@rymurr, @RussellSpitzer, @emkornfield, any concerns with this update for the 0.14.0 release?

@RussellSpitzer
Copy link
Member

I probably wouldn't even be averse to doing this for at 13.x release. +1

@emkornfield
Copy link
Contributor

LGTM, FWIW, Spark also just upgraded on Master. I'm not sure about the other engines. as an FYI One of the more recent features in Java arrow is bindings to the C++ Parquet Dataset reader, which reportedly is faster then parquet-mr in some cases (not exactly sure how iceberg is using Arrow Java)

@emkornfield
Copy link
Contributor

I guess my only concerns are potential dependency "hell" with any consumers of Iceberg (I haven't had to delve into this in Java for quite some time)

@rdblue
Copy link
Contributor

rdblue commented Feb 16, 2022

Thanks, @emkornfield! It should be okay because we shade Arrow to avoid conflicting with Spark and other engines.

@rdblue rdblue merged commit 080495f into apache:master Feb 16, 2022
@rdblue
Copy link
Contributor

rdblue commented Feb 16, 2022

Thanks, @pan3793!

@pan3793 pan3793 deleted the arrow branch February 18, 2022 01:40
sunchao added a commit to sunchao/iceberg that referenced this pull request May 9, 2023
This PR upgrades Apache Arrow version to 7.0.0, to be consistent with Spark & Boson. It's the same as the OSS PR: apache#4112 here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants