Skip to content

Avro support - Improve existing read capabilities #698

@martin-traverse

Description

@martin-traverse

Describe the enhancement requested

Raising this ticket as a second step on Avro support, following on from #615. On this ticket I'd like to cover:

  1. Round trip test cases for all supported data types (schema and data)
  2. Fix nullability handling - Avro union of [ null, type ] should be handled as a single nullable field / vector, not create an Arrow union
  3. Expose an API for creating an Arrow schema directly from an Avro schema with the same type mapping as the existing consumers
  4. Expose an API to allow a VSR to be recycled when reading data (the VSR should be resized to accommodate an Avro block)

Regarding point 2, I'm think of adding a flag to the AvroToArrowConfig class. By default the flag can be false to preserve the current behaviour.

I have started work on this ticket but will need to wait for #638 to merge before raising a draft PR.

I think there are two more PRs needed in this series, one to provide a high-level API to read / write whole files, this would use the producers / consumers internally, understand Avro's block structure and map one full Arrow VSR to one Avro block. The last PR would be to add some extra features including compression and dictionary encoding / enums.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions