-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dictionary unification fails across multiple matches in large files #3
Comments
Hi @bmschmidt, I just ran into what seems to be the exact same issue with my own CSV file. It's a 4.4GB CSV with three columns (x, y, map_id -> a numeric identifier) and ~94 million rows. Any thoughts on the best workaround for now? Posting my error message below for posterity: Traceback (most recent call last):
File "/home/rainer/Software/miniconda3/bin/quadfeather", line 8, in <module>
sys.exit(main())
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 286, in main
tiler.insert_files(files = rewritten_files, schema = schema, recoders = recoders)
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 356, in insert_files
self.insert_table(tab, tile_budget = self.args.max_files)
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
[Previous line repeated 2 more times]
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 598, in insert_table
self.overflow_buffer.write_batch(
File "pyarrow/ipc.pxi", line 503, in pyarrow.lib._CRecordBatchWriter.write_batch
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches. |
FWIW: the conversion works if I use only half of the dataset. |
Root cause is tricky to assess here but I see two options.
1. Clone this repo and modify this line. schema : pa.Schema,
csv_block_size : int = 1024*1024*128):
https://github.com/bmschmidt/quadfeather/blob/60ba3428784faf36a51bdf40129f2a66980452b2/quadfeather/tiler.py#L142
That affects how many lines of a csv are read in at once—shifting it to
something very large should slurp in the whole dataset. At 4.4 gb this
might be playing it close depending on your machine.
2. Transfer from csv to parquet before ingesting. Pyarrow, duckdb, and
polars all allow incremental building of parquet files larger than memory.
The core issue here is that csv is a weakly typed format and apparently the
parser is guessing wrong on some field. Given that your types should be
float, float, int I don’t know why. But in general ingesting parquet is
more predictable.
If I were doing it myself I’d use duckdb to write to parquet, being sure to
cast x and y to single precision and forcing the ids to be strings.
…On Tue, Jun 13, 2023 at 9:22 AM Rainer Simon ***@***.***> wrote:
FWIW: the conversion works if I use only half of the dataset.
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPFZVQ4SLTIEPIWNGUNC3XLBSS5ANCNFSM5LBQMELQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Reading from a CSV where dictionary types are inferred, multiple batches seem to produce dictionaries that can't be unified if new entries appear not present in the first batch (or something like that).
I thought this was addressed by
tile.remap_all_dicts
, but it is not.Not yet reproduced, but log trace below. In this case, fixable by increasing csv_batch_size to
float("inf")
or equivalent; that won't be possible for larger-than-memory data, though.The text was updated successfully, but these errors were encountered: