Skip to content

PyIceberg: ORC file format support #6973

@alaturqua

Description

@alaturqua

Feature Request / Improvement

We use ORC file format to store our iceberg tables on azure storage.

Currently PyIceberg supports parquet format but not ORC.

This is a request to have ORC file format support in PyIceberg.

tbl.location()
---> [18](vscode-notebook-cell:/c%3A/Projects/pandas_snowflake/notebooks/pyiceberg_test.ipynb#W0sZmlsZQ%3D%3D?line=17) tbl.scan().to_pandas()

File C:\Projects\incubator-iceberg\python\pyiceberg\table\__init__.py:409, in DataScan.to_pandas(self, **kwargs)
    408 def to_pandas(self, **kwargs: Any) -> pd.DataFrame:
--> 409     return self.to_arrow().to_pandas(**kwargs)

File C:\Projects\incubator-iceberg\python\pyiceberg\table\__init__.py:404, in DataScan.to_arrow(self)
    401 def to_arrow(self) -> pa.Table:
    402     from pyiceberg.io.pyarrow import project_table
--> 404     return project_table(
    405         self.plan_files(), self.table, self.row_filter, self.projection(), case_sensitive=self.case_sensitive
    406     )

File C:\Projects\incubator-iceberg\python\pyiceberg\io\pyarrow.py:558, in project_table(tasks, table, row_filter, projected_schema, case_sensitive)
    551 projected_field_ids = {
    552     id for id in projected_schema.field_ids if not isinstance(projected_schema.find_type(id), (MapType, ListType))
    553 }.union(extract_field_ids(bound_row_filter))
    555 with ThreadPool() as pool:
    556     tables = [
    557         table
...
File c:\Python\Python39\lib\site-packages\pyarrow\_parquet.pyx:1227, in pyarrow._parquet.ParquetReader.open()

File c:\Python\Python39\lib\site-packages\pyarrow\error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Query engine

None

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions