Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary unification fails across multiple matches in large files #3

Open
bmschmidt opened this issue Dec 31, 2021 · 3 comments
Open

Comments

@bmschmidt
Copy link
Owner

Reading from a CSV where dictionary types are inferred, multiple batches seem to produce dictionaries that can't be unified if new entries appear not present in the first batch (or something like that).

I thought this was addressed by tile.remap_all_dicts, but it is not.

Not yet reproduced, but log trace below. In this case, fixable by increasing csv_batch_size to float("inf") or equivalent; that won't be possible for larger-than-memory data, though.

DEBUG:quadtiler:Opening overflow on (1, 0, 0)
INFO:quadtiler:Done inserting block 4 of 7
INFO:quadtiler:15 partially filled tiles buffered in memory and 2 flushing overflow directly to disk.
INFO:quadtiler:Inserting block 5 of 7
Traceback (most recent call last):
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/bin/quadfeather", line 8, in <module>
    sys.exit(main())
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 264, in main
    tiler.insert(tab, remaining_tiles)
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 608, in insert
    child_tile.insert(subset, tiles_allowed - tiles_allowed_overflow)
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 612, in insert
    self.overflow_buffer.write_batch(
  File "pyarrow/ipc.pxi", line 408, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.
@rsimon
Copy link

rsimon commented Jun 13, 2023

Hi @bmschmidt,

I just ran into what seems to be the exact same issue with my own CSV file. It's a 4.4GB CSV with three columns (x, y, map_id -> a numeric identifier) and ~94 million rows. Any thoughts on the best workaround for now?

Posting my error message below for posterity:

Traceback (most recent call last):
  File "/home/rainer/Software/miniconda3/bin/quadfeather", line 8, in <module>
    sys.exit(main())
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 286, in main
    tiler.insert_files(files = rewritten_files, schema = schema, recoders = recoders)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 356, in insert_files
    self.insert_table(tab, tile_budget = self.args.max_files)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
    child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
    child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
    child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
  [Previous line repeated 2 more times]
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 598, in insert_table
    self.overflow_buffer.write_batch(
  File "pyarrow/ipc.pxi", line 503, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches.

@rsimon
Copy link

rsimon commented Jun 13, 2023

FWIW: the conversion works if I use only half of the dataset.

@bmschmidt
Copy link
Owner Author

bmschmidt commented Jun 13, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants