Skip to content

Dictionary IDs Arrow IPC #1206

Open
Open
@tustvold

Description

@tustvold

Which part is this question about

The Field data structure contains a dict_id member, that stores an i64. It appears the intention of this is that different dictionaries will have different IDs, however, this currently appears to only be respected by the IPC format and isn't widely utilised by arrow-rs.

Describe your question

Most of arrow-rs is completely agnostic to dict_id, with compute kernels completely ignoring it, even those that recompute dictionaries such as concat.

The only parts of the stack that appear to use the dict_ids are the IPC interfaces, which will error if they encounter the same dict_id multiple times. I think this is inconsistency is a tad confusing, I think we should do one of the following:

  • Keep the current agnosticism within arrow-rs and assign IDs in the writers (potentially using Arc::ptr_eq on the values array)
  • Make arrow-rs respect dict_ids

Of these the first would definitely be simpler to implement, but I'm not familiar enough with the purpose of dict_id to be certain there isn't some use-case this would preclude?

Additional context

As Field is part of the Schema, RecordBatch with different dict_id will appear to have different schema. This may have downstream implications for things like DataFusion which have strong assumptions on schema consistency within a plan.

This cropped up in apache/datafusion#1596 as it is using the arrow IPC format to spill buffers to disk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions