Skip to content

Conversation

@Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Feb 20, 2025

This PR depends on #12667. It implements part of the iceberg geo spec: #10981.

The iceberg spec requires that geometry and geography types in iceberg are mapped to BINARY physical types with GEOMETRY or GEOGRAPHY logical type annotations. These 2 spatial logical types were introduced to the Parquet format since apache/parquet-format#240, and the initial implementation of the spec has been merged into parquet-java: apache/parquet-java#2971 and apache/parquet-java#3200.

The parquet-java implementation has not been released yet, so this work-in-progress PR depends on a locally built SNAPSHOT version of parquet-java. We'll mark it ready once we bump the parquet-java version to the new release and pass all the tests.

@Kontinuation
Copy link
Member Author

Kontinuation commented Feb 20, 2025

I found that it is not easy to upgrade the parquet dependency to the (not-released-yet) next version, because parquet-hadoop now uses a FileSystem API introduced in Hadoop 3: apache/parquet-java#3079. Upgrading parquet dependencies to the latest SNAPSHOT version results in the following failure when running tests in iceberg-data:

'org.apache.hadoop.fs.FutureDataInputStreamBuilder org.apache.hadoop.fs.FileSystem.openFile(org.apache.hadoop.fs.Path)'
java.lang.NoSuchMethodError: 'org.apache.hadoop.fs.FutureDataInputStreamBuilder org.apache.hadoop.fs.FileSystem.openFile(org.apache.hadoop.fs.Path)'
	at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:114)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:925)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:710)
	at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:194)
	at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:76)

We have to remove Hadoop 2 support and migrate to Hadoop 3 for all submodules. There is a stale PR working on this: #10932. I found that #10940 was closed as completed but there are still lots of submodule depending on Hadoop 2. I'd like to know how should we proceed to upgrade the parquet package. Should we upgrade dependencies to Hadoop 2 to Hadoop 3 to unblock the parquet upgrade? @szehon-ho @rdblue

@pvary
Copy link
Contributor

pvary commented Feb 20, 2025

I would raise this question on the dev list to get wider audience for the issue after collecting the modules effected.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 23, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Mar 30, 2025
@nastra nastra reopened this Apr 3, 2025
@nastra nastra added not-stale and removed stale labels Apr 3, 2025
@szehon-ho
Copy link
Member

Hi, sorry @Kontinuation for the long delay, (Iceberg summit and internal stuff). I wonder if we can rebase based on #12346 and also only have the Parquet part (not the expression part). Also, is it cleaner now that Parquet Format 2.11.0 is released with Parquet geo logical type?

@Kontinuation
Copy link
Member Author

Hi, sorry @Kontinuation for the long delay, (Iceberg summit and internal stuff). I wonder if we can rebase based on #12346 and also only have the Parquet part (not the expression part). Also, is it cleaner now that Parquet Format 2.11.0 is released with Parquet geo logical type?

Sure. I'll rework this patch and mark it ready for review once geospatial bounds and spatial predicates were added api/core.

@szehon-ho
Copy link
Member

We have to remove Hadoop 2 support and migrate to Hadoop 3 for all submodules. There is a stale PR working on this: #10932. I found that #10940 was closed as completed but there are still lots of submodule depending on Hadoop 2. I'd like to know how should we proceed to upgrade the parquet package. Should we upgrade dependencies to Hadoop 2 to Hadoop 3 to unblock the parquet upgrade? @szehon-ho @rdblue

by the way, I suppose we should try to move away from Hadoop 2 for remaining submodules to move ahead, let's see where we still have issues.

@freamdx
Copy link

freamdx commented May 12, 2025

freamdx@929dfae is a simple solution

@Kontinuation Kontinuation force-pushed the pr-geo-parquet-data branch from 20c391a to ae97c63 Compare May 21, 2025 12:39
@talatuyarer
Copy link
Contributor

Hi @Kontinuation Do we need this PR to use Geo types in Iceberg ?

@Kontinuation
Copy link
Member Author

Hi @Kontinuation Do we need this PR to use Geo types in Iceberg ?

This PR is still far from getting Geo types working with actual query engines such as Spark. We also need to make changes to the query engine integration (e.g. iceberg-spark and iceberg-spark-extensions) to make it actually usable.

@talatuyarer
Copy link
Contributor

@Kontinuation If i want to use Geo types with Apache Sedona for my iceberg table. What should I do ? I tried to create a table with geometry However it throw exception.

spark-sql (default)> CREATE TABLE LOCAL.db.icetable (id string, geometry geometry)
                   > USING iceberg
                   > TBLPROPERTIES('format-version'='3');

[UNSUPPORTED_DATATYPE] Unsupported data type "GEOMETRY". SQLSTATE: 0A000
== SQL (line 1, position 53) ==
...b.icetable (id string, geometry geometry)
                                   ^^^^^^^^

Should I patch Apached Sedona with your this PR: apache/sedona#1830

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants