Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: async stream of Arrow record batches from Parquet file #258

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kylebarron
Copy link
Owner

@kylebarron kylebarron commented Nov 13, 2024

This takes just 1.1s for the stream to start and then 1.0s more for the first record batch to be fetched. While it's >60s for the full file to download on my internet.

from time import time

t0 = time()
url = "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00217-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet"
store = HTTPStore.from_url(url)
stream = await read_parquet_async("", store=store)
t1 = time()
first = await stream.__anext__()
t2 = time()

print(t1 - t0) # 1.1302871704101562
print(t2 - t1) # 1.0420188903808594

@kylebarron kylebarron enabled auto-merge (squash) November 13, 2024 22:28
@kylebarron kylebarron marked this pull request as draft November 13, 2024 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant