Skip to content

Conversation

jecsand838
Copy link
Contributor

@jecsand838 jecsand838 commented Jan 10, 2025

Which issue does this PR close?

Part of #4886

Rationale for this change

The primary objective of this PR is to enhance the arrow-rs Avro implementation by introducing full support for Avro data types, support for Avro aliases, and implementing a public facing experimental Avro Reader. These enhancements are crucial for several reasons:

1. Enhanced Data Interoperability:
By supporting these additional types, the Avro reader becomes more compatible with a wider range of Avro schemas. This ensures that users can ingest and process diverse datasets without encountering type-related limitations.

What changes are included in this PR?

Avro Codec + RecordDecoder

  1. Support for new types:
    • Lists
    • Fixed
    • Interval
    • Decimal
    • Map
    • Enum
    • UUID
  2. Extended namespace support to all supported types
  3. Added Alias support
  4. Extended support for local timestamps
  5. Added Unit tests
  6. Expanded nullability support
  7. Added module test cases with .avro files

Are there any user-facing changes?

  • N/A

jecsand838 and others added 17 commits December 28, 2024 12:44
1. Namespaces
2. Enums
3. Maps
4. Decimals
…al types.

Signed-off-by: Connor Sanders <connor@elastiflow.com>
…al types.

Signed-off-by: Connor Sanders <connor@elastiflow.com>
…al types.

Signed-off-by: Connor Sanders <connor@elastiflow.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
* Implemented reader decoder for Avro Lists
* Cleaned up reader/record.rs and added comments for readability.

Signed-off-by: Connor Sanders <connor@elastiflow.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
- Fixed
- Interval

Signed-off-by: Connor Sanders <connor@elastiflow.com>
Added Avro codec + decoder support for new types
Signed-off-by: Connor Sanders <connor@elastiflow.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 10, 2025
@tustvold
Copy link
Contributor

tustvold commented Jan 11, 2025

👋 thank you for working on this, it is very exciting to see arrow-avro getting some love. I think what would probably help is to break this up into smaller pieces that can be delivered separately. Whilst I accept this is more work for you, it makes reviewing the code much more practical, especially given our relatively limited review bandwidth. Many of the bullets in your PR description would warrant a separate PR IMO.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jecsand838 I triggered CI for this PR

I haven't had a chance to review the code yet, but I did start to look at the test coverage. What would you think about adding tests using the existing avro testing data in https://github.com/apache/arrow-testing/tree/master/data/avro (already a submodule in this repo)

Key tests in my mind would be:

  1. Read the avro testing files and verify the schema and data read (and leave comments for tests that don't pass)
  2. For the writer implement round trip tests: create a RecordBatch (es), write it to an .avro file and then read it back in and ensure the round tripped batches are equal

You might be able to find code / testing that you can reuse in the datafusion copy: https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs

I also wonder how/if this code is related to the avro rust reader/decoder in https://github.com/apache/avro-rs?

FYI @Jefffrey

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some review of this PR, I think I could likely find some additional time to review / help it along if/when it has end to end tests of reading existing avro files (that would give me confidence that the code being reviewed did the right thing functionally).

To make a specific proposal for splitting up this PRs' functionality as suggested by @tustvold -- #6965 (comment), one way to do so would be:

  1. First PR: Reader improvements / tests showing reading existing .avro files
  2. Second PR: Minimal Writer support (with tests showing round tripping for a few basic data types)
  3. Subsequent PRs: Additional PRs to support writing additional data types

The rationale for breaking it up this way would be to have better confidence in the read path to ensure round tripping also works as needed

Let me know what you think

And thanks again

@jecsand838
Copy link
Contributor Author

@alamb @tustvold Thank you both so much for getting to our PR so quickly! We'd be more than happy to break this PR up as advised and add those additional tests. Your recommendation makes a lot of sense.

I also wonder how/if this code is related to the avro rust reader/decoder in https://github.com/apache/avro-rs?

I'm sure there's functional overlap we can look into. We just attempted to extend the patterns @tustvold put in place. Definitely interested in hearing your thoughts on this however.

@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

@alamb @tustvold Thank you both so much for getting to our PR so quickly! We'd be more than happy to break this PR up as advised and add those additional tests. Your recommendation makes a lot of sense.

I also wonder how/if this code is related to the avro rust reader/decoder in https://github.com/apache/avro-rs?

I'm sure there's functional overlap we can look into. We just attempted to extend the patterns @tustvold put in place. Definitely interested in hearing your thoughts on this however.

Sounds good . Ideally it would be great to avoid duplication, but if the existing avro-rs crate is row oriented, the amount of reusable code might be small as this decoder will need to be row oriented

Thanks for the changes -- I just started the CI checks and took a quick look through this PR. I think the only thing left now is to add some tests that read from the existing .avro files in arrow-testing -- then I imagine it will be ready for a more thorough review.

Thank you for working on this.

@svencowart
Copy link

@alamb, we're improving the reader and providing more Avro tests.

As I was working on this, I noticed the module signature for arrow's parquet reader and datafusion's arrow_array_reader.rs are based on a Builder pattern. To simplify adopting these arrow-avro changes into DataFusion, should we implement the module signature to something similar to what is already in DataFusion today? Or are you looking for something else? Also, do you prefer whether the public signature for the reader should be included in this PR or a PR following this PR that proves the reader changes? I'm happy to contribute the public module signature.

@alamb alamb mentioned this pull request Apr 28, 2025
@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

The chance I am going to be able to find time to review this PR in its current form (2000+ lines) is quite low. Maybe another reviewer will be able to

Is there any possibility to break it up into smaller parts for eaiser review? Specifically, it would be really great to get whatever parts are non breaking into their own PR first or cleanups / etc

Also see a new related PR here:

@jecsand838
Copy link
Contributor Author

@alamb

We think it's ready for review.

However we can break this up to make it easier to review. We do have a completed Avro-Reader already implemented on our fork which we were planning to introduce piece by piece.

I'll work with @nathaniel-elastiflow on creating PRs for each data-type first. That should decompose this down nicely.

@jecsand838
Copy link
Contributor Author

@alamb @tustvold We went ahead and made our first decomposed PR for Map types: #7451, which is ready for review.

@klion26
Copy link
Member

klion26 commented Apr 29, 2025

@klion26 is there any chance you can help review this PR?

@jecsand838 and @nathaniel-elastiflow do you think this PR is ready for review?

@alamb I'll try to help review this

alamb pushed a commit that referenced this pull request Jun 17, 2025
# Which issue does this PR close?

Part of #4886

Related to #6965

# Rationale for this change
 
Avro supports arrays as a core data type, but previously arrow-avro had
incomplete decoding logic to handle them. As a result, any Avro file
containing array fields would fail to parse correctly within the Arrow
ecosystem. This PR addresses this gap by:

1. Completing the implementation of explicit `Array` -> `List` decoding:
It completes the `Decoder::Array` logic that reads array blocks in Avro
format and constructs an Arrow `ListArray`.

Overall, these changes expand Arrow’s Avro reader capabilities, allowing
users to work with array-encoded data in a standardized Arrow format.

# What changes are included in this PR?

**1. arrow-avro/src/reader/record.rs:**

* Completed the Array decoding path which leverages blockwise reads of
Avro array data.
* Implemented decoder unit tests for Array types.

# Are there any user-facing changes?

N/A
alamb pushed a commit that referenced this pull request Jun 28, 2025
# Which issue does this PR close?

Part of [4886](#4886)

Related to [6965](#6965)

# Rationale for this change
 
This change expands upon the Avro reader logic by adding full support
for the Fixed and Uuid types (Uuid relies on Fixed). It builds out the
`Fixed` path currently stubbed out.

# What changes are included in this PR?

Adds `Fixed` and `Uuid` support to the arrow-avro crate with changes to
the following:

1. arrow-avro/src/codec.rs

- Adds support for `Uuid` type
- Adds test

2. arrow-avro/src/reader/cursor.rs:

- Adds a `get_fixed` helper method to read the specified bytes into a
buffer.

3. arrow-avro/src/reader/record.rs:

- Introduces the Fixed decoding path, building out the `nyi`
`Codec::Fixed` in the `Decoder`.
- Introduces the Uuid decoding path, building off of Fixed
- Adds tests.

# Are there any user-facing changes?

n/a

---------

Co-authored-by: Connor Sanders <connor@elastiflow.com>
alamb pushed a commit that referenced this pull request Jul 1, 2025
# Which issue does this PR close?

- Part of #4886

- Related to #6965

# Rationale for this change

This PR addresses a feature gap by introducing support for the Avro
`decimal` logical type, which is currently unimplemented as indicated by
the `test_decimal_logical_type_not_implemented` test case. The `decimal`
type is crucial for handling precise numerical data common in financial
and scientific applications, making this a necessary addition for
broader Avro compatibility.

# What changes are included in this PR?

This PR introduces the necessary changes to both parse and decode the
Avro `decimal` logical type into the corresponding Arrow `Decimal128` or
`Decimal256` data types.

The main changes are:
1.  **Schema Parsing (`codec.rs`):**
* Implemented the logic within `make_data_type` to correctly parse the
`decimal` logical type from the Avro schema.
* The `Codec` enum's `Decimal` variant now correctly stores the
precision, scale, and optional fixed-size from the schema's attributes.

2.  **Decoding Logic (`record.rs`):**
* Added `Decoder::Decimal128` and `Decoder::Decimal256` variants to
handle decoding of decimal values from both `bytes` and `fixed` Avro
types.
* The implementation correctly handles sign extension for negative
numbers to ensure accurate representation in Arrow's decimal arrays.

# Are these changes tested?

This PR includes comprehensive tests to validate the new functionality:
* The existing `test_decimal_logical_type_not_implemented` test has been
replaced with concrete test cases.
* Added unit tests in `record.rs` (`test_decimal_decoding_fixed256`,
`test_decimal_decoding_fixed128`,
`test_decimal_decoding_bytes_with_nulls`, etc.) to cover various
scenarios, including:
    *   Decoding from Avro `fixed` and `bytes` primitive types.
* Handling different precisions to select between `Decimal128` and
`Decimal256`.
    *   Correctly processing null values within decimal arrays.

# Are there any user-facing changes?

N/A
alamb pushed a commit that referenced this pull request Jul 5, 2025
# Which issue does this PR close?

- Part of #4886

- Related to #6965

# Rationale for this change

The `arrow-avro` crate currently lacks support for the Avro `enum` type,
which is a standard and commonly used type in Avro schemas. This
omission prevents users from reading Avro files containing enums,
limiting the crate's utility.

This change introduces support for decoding Avro enums by mapping them
to the Arrow `DictionaryArray` type. This is a logical and efficient
representation. Implementing this feature brings the `arrow-avro` crate
closer to full Avro specification compliance and makes it more robust
for real-world use cases.

# What changes are included in this PR?

This PR introduces comprehensive support for Avro enum decoding along
with a minor Avro decimal decoding fix. The key changes are:

1.  **Schema Parsing (`codec.rs`):**
* A new `Codec::Enum(Arc<[String]>)` variant was added to represent a
parsed enum and its associated symbols.
* The `make_data_type` function now parses `ComplexType::Enum` schemas.
It also stores the original symbols as a JSON string in the `Field`'s
metadata under the key `"avro.enum.symbols"` to ensure schema fidelity
and enable lossless round-trip conversions.
* The `Codec::data_type` method was updated to map the internal
`Codec::Enum` to the corresponding Arrow
`DataType::Dictionary(Box<Int32>, Box<Utf8>)`.

2.  **Decoding Logic (`reader/record.rs`):**
* A new `Decoder::Enum(Vec<i32>, Arc<[String]>)` variant was added to
manage the state of decoding enum values.
* The `Decoder` was enhanced to create, decode, and flush `Enum` types:
        *   `try_new` creates the decoder.
        *   `decode` reads the Avro `int` index from the byte buffer.
* `flush` constructs the final `DictionaryArray<Int32Type>` using the
collected indices as keys and the stored symbols as the dictionary
values.
        *   `append_null` was extended to handle nullable enums.

3.  **Minor Decimal Type Decoding Fix (`codec.rs`)**
* A minor decimal decoding fix was implemented in `make_data_type` due
to the `(Some("decimal"), c @ Codec::Fixed(sz))` branch of `match
(t.attributes.logical_type, &mut field.codec)` not being reachable. This
issue was caught by the new decimal integration tests in
`arrow-avro/src/reader/mod.rs`.

# Are these changes tested?

*   Yes, test coverage was provided for the new `Enum` type: 
* New unit tests were added to `record.rs` to specifically validate both
non-nullable and nullable enum decoding logic.
* The existing integration test suite in `arrow-avro/src/reader/mod.rs`
was used to validate the end-to-end functionality with a new
`avro/simple_enum.avro` test case, ensuring compatibility with the
overall reader infrastructure.
 *  New tests were also included for the `Decimal` and `Fixed` types:
* This integration test suite was also extended to include tests for
`avro/simple_fixed.avro`, `avro/fixed_length_decimal.avro`,
`avro/fixed_length_decimal_legacy.avro`, `avro/int32_decimal.avro`,
`avro/int64_decimal.avro`

# Are there any user-facing changes?

N/A
alamb pushed a commit that referenced this pull request Jul 11, 2025
# Which issue does this PR close?

- Part of #4886

- Related to #6965

# Rationale for this change

The `arrow-avro` crate currently lacks support for reading `bzip2` and
`xz` compression. This prevents users from reading Avro files compressed
using these compression types, limiting the crate's utility.

# What changes are included in this PR?

* Added `bzip2` compression and feature flag.
* Added `xz` compression and feature flag.
* Updated header decoder to use new compression types.

# Are these changes tested?

Yes, I added new test cases covering these changes to the
`test_alltypes` integration test.

# Are there any user-facing changes?

N/A
alamb pushed a commit that referenced this pull request Jul 14, 2025
…oding (#7889)

# Which issue does this PR close?

- Part of #4886

- Related to #6965

# Rationale for this change

The `arrow-avro` crate currently lacks support for the Avro `duration`
type, which is a standard and commonly used type in Avro schemas. This
omission prevents users from reading Avro files containing duration
types, limiting the crate's utility.

This change introduces support for decoding Avro duration types by
mapping them to the Arrow `Interval` type. This is a logical and
efficient representation. Implementing this feature brings the
`arrow-avro` crate closer to full Avro specification compliance and
makes it more robust for real-world use cases.

# What changes are included in this PR?

This PR contains:

* arrow-avro decoder support for Duration types.
* Minor fixes UUID decoding. UUID types now map to `utf8` type to better
align with the [Avro
specification](https://avro.apache.org/docs/1.11.1/specification/#uuid)
* New integration test using a temporary `duration_uuid.avro` file crate
using this python script:
https://gist.github.com/jecsand838/cbdaaf581af78f357778bf87d2f3cf15

# Are these changes tested?

Yes, this PR includes for integration and unit tests covering these
modifications.

# Are there any user-facing changes?

N/A

# Follow-Up PRs

1. PR to update `test_duration_uuid` once
apache/arrow-testing#108 is merged in.

---------

Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
alamb pushed a commit that referenced this pull request Jul 21, 2025
# Which issue does this PR close?

- Part of #4886
- Related to #6965

# Rationale for this change

This change introduces support for Avro files generated by systems like
Impala, which have a specific convention for representing nullable
fields. In Avro, nullability is typically represented by a union of a
type and a type. This PR updates the Avro reader to correctly interpret
these schemas, ensuring proper handling of nullable data and improving
interoperability with Impala-generated data. `null`

# What changes are included in this PR?

This pull request introduces several changes to support Impala-style
nullability in the Avro reader:
- The Avro schema parser has been updated to recognize unions where is
the second type (e.g., `['type', 'null']`) as a nullable field. `null`
- Logic has been added to handle this nullability convention during Avro
decoding.
- New tests are included to verify that Avro files using this
nullability format are read correctly while ensuring that strict mode
properly identifies them.

# Are these changes tested?

Yes, I added new test cases covering these changes to the tests named:
`test_nonnullable_impala`, `test_nonnullable_impala_strict`,
`test_nullable_impala` and `test_nullable_impala_strict`.

# Are there any user-facing changes?

N/A

---------

Co-authored-by: Connor Sanders <connor@elastiflow.com>
alamb pushed a commit that referenced this pull request Jul 22, 2025
# Which issue does this PR close?

Part of #4886

Completes the breaking down/porting of the changes in
#6965. This PR will be closed
upon merge of this PR.

# Rationale for this change

This change brings over the remaining integration tests present in the
original PR, which validate the reader logic against the files from
`testing/data/avro`. PRs containing this logic have already been merged
(but are not yet released) which these tests now validate.

# What changes are included in this PR?

The following files are now read in:

- alltypes_dictionary.avro
- alltypes_nulls_plain.avro
- binary.avro
- dict-page-offset-zero.avro
- avro/list_columns.avro
- nested_lists.snappy.avro
- single_nan.avro
- datapage_v2.snappy.avro
- nested_records.avro
- repeated_no_annotation.avro

# Are these changes tested?

This PR consists of integration tests validating code merged recently
into this crate. No changes in functionality are included.

# Are there any user-facing changes?

N/A
@jecsand838
Copy link
Contributor Author

Closing this PR now since all functionality has been merged in from the smaller decomposed PRs.

@jecsand838 jecsand838 closed this Jul 26, 2025
@alamb
Copy link
Contributor

alamb commented Jul 26, 2025

Thanks again for your help and attention to this matter @jecsand838 -- I realize this process isn't as fast as others, but I think it is progressing nicely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants