Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ParquetReader to read LARGE_STRING, LARGE_BINARY, LARGE_LIST types from Arrow #460

Merged
merged 6 commits into from
Mar 5, 2025

Conversation

arhamchopra
Copy link
Collaborator

@arhamchopra arhamchopra commented Feb 28, 2025

This PR adds support for reading LARGE_STRING, LARGE_BINARY, LARGE_LIST Arrow types in the ParquetReader
Resolves #398

With this change, the ParquetReader can be used with Arrow Large Types especially in polars more seamlessly:

import polars as pl
from datetime import datetime, timedelta
# Write polars DF to parquet
df = pl.DataFrame({
    "dt": [datetime.now() + timedelta(seconds=i) for i in range(10)],
    "str": ["a" for i in range(10)],                         # polars uses large_str for strings
    "list_of_str": [["a"] for i in range(10)],               # polars uses large_list for lists
    "binary": [b"a" for i in range(10)],                     # polars uses large_binary for binary data
    "list_of_binary": [[b"a"] for i in range(10)],
})

# df.to_arrow().schema
# dt: timestamp[us]
# str: large_string
# list_of_str: large_list<item: large_string>
#   child 0, item: large_string
# binary: large_binary
# list_of_binary: large_list<item: large_binary>
#   child 0, item: large_binary

import csp
from csp.adapters.parquet import ParquetReader
class MyStruct(csp.Struct):
    dt: datetime
    str: str
    list_of_str: csp.typing.Numpy1DArray[str]
    binary: str
    list_of_binary: csp.typing.Numpy1DArray[str]

@csp.graph
def my_graph() -> csp.ts[MyStruct]:
    reader = ParquetReader([df.to_arrow()], time_column="dt", read_from_memory_tables=True)
    return reader.subscribe_all(MyStruct, MyStruct.default_field_map())

@timkpaine timkpaine added type: enhancement Issues and PRs related to improvements to existing features adapter: parquet Issues and PRs related to our Apache Parquet/Arrow adapter labels Feb 28, 2025
@arhamchopra arhamchopra force-pushed the ac/parquet_large_string branch from 700b337 to 645ac2f Compare March 2, 2025 21:49
@arhamchopra arhamchopra force-pushed the ac/parquet_large_string branch 2 times, most recently from 944f867 to e8b3379 Compare March 3, 2025 21:52
@arhamchopra arhamchopra marked this pull request as ready for review March 3, 2025 21:53
@arhamchopra arhamchopra requested a review from robambalu March 4, 2025 19:19
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
@arhamchopra arhamchopra force-pushed the ac/parquet_large_string branch from e8b3379 to 42264b3 Compare March 5, 2025 00:14
@arhamchopra arhamchopra requested a review from AdamGlustein March 5, 2025 00:14
Copy link
Collaborator

@AdamGlustein AdamGlustein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving but would appreciate if @timkpaine or @robambalu can take a final look as well, as they know the Arrow API better than myself.

@arhamchopra arhamchopra changed the title Allow ParquetReader to read LARGE_STRING types from Arrow Allow ParquetReader to read LARGE_STRING, LARGE_BINARY, LARGE_LIST types from Arrow Mar 5, 2025
@arhamchopra arhamchopra merged commit aafe7f9 into main Mar 5, 2025
27 checks passed
@arhamchopra arhamchopra deleted the ac/parquet_large_string branch March 5, 2025 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adapter: parquet Issues and PRs related to our Apache Parquet/Arrow adapter type: enhancement Issues and PRs related to improvements to existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support "large" arrow types in the parquet reader
4 participants