Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Nov 11, 2021

Which issue does this PR close?

Resolves #940

Rationale for this change

It appears the arrow-rs slice logic for structs introduced in #389 by @nevi-me effectively slices the child data when a struct array is sliced.

The validation code from validate.cc (and perhaps the C++ code) assumes that the children are not sliced, so when I ported that logic over it is not correct for sliced structs. You can see by the comment I was somewhat confused about the need for offset even when it was originally introduced

What changes are included in this PR?

  1. Update the struct array validation logic to follow the Rust semantics where offsets are applied to both parent and child
  2. Test case from @bjchambers in https://github.com/bjchambers/arrow-rs/blob/reproduce-validation-error/arrow/src/array/data.rs#L1648

In pictures

In pictures, here is what the testcase looks like:


     0 ┌─────┐                     0 ┌─────┐  ┌─────┐        0 ┌─────┐  ┌─────┐
       │     │                       │     │  │     │          │     │  │     │
       ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
       │     │                       │     │  │     │          │     │  │     │
       ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
       │     │                       │     │  │     │          │     │  │     │
       ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
       │     │                       │     │  │     │          │     │  │     │
       ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
    5  │     │                    5  │     │  │     │       5  │     │  │     │
       └─────┘                       └─────┘  └─────┘          └─────┘  └─────┘

        Valid                         Valid    i32[]            Valid    bool[]


       StructArray                     Child Array #1           Child Array #2
                                        (Int32Array)            (BooleanArray)

In rust, when we do slice(1,3) the offset is applied to both the ArrayData and its children:



       0 ┌─────┐                     0 ┌─────┐  ┌─────┐        0 ┌─────┐  ┌─────┐
         │     │                       │     │  │     │          │     │  │     │
─ ─ ─ ─ ─├─────┤─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─├─────┤─ ┼─────┼ ─ ─ ─ ─ ─├─────┤─ ┼─────┼ ─ ─ ─ ─ ─ ─ ─ ─
         │     │                       │     │  │     │          │     │  │     │
         ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
         │     │                       │     │  │     │          │     │  │     │
         ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
         │     │                       │     │  │     │          │     │  │     │
─ ─ ─ ─ ─├─────┤─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─├─────┤─ ┼─────┼ ─ ─ ─ ─ ─├─────┤─ ┼─────┼ ─ ─ ─ ─ ─ ─ ─ ─
      5  │     │                    5  │     │  │     │       5  │     │  │     │
         └─────┘                       └─────┘  └─────┘          └─────┘  └─────┘

          Valid                         Valid    i32[]            Valid    bool[]


         StructArray                     Child Array #1           Child Array #2
                                          (Int32Array)            (BooleanArray)

However, in the C++ validation logic, the assumption is that the children have no offsets (and the offset of the parent is applied):


       0 ┌─────┐                     0 ┌─────┐  ┌─────┐        0 ┌─────┐  ┌─────┐
         │     │                       │     │  │     │          │     │  │     │
─ ─ ─ ─ ─├─────┤─ ─ ─ ─ ─ ─            ├─────┤  ├─────┤          ├─────┤  ├─────┤
         │     │                       │     │  │     │          │     │  │     │
         ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
         │     │                       │     │  │     │          │     │  │     │
         ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
         │     │                       │     │  │     │          │     │  │     │
─ ─ ─ ─ ─├─────┤─ ─ ─ ─ ─ ─            ├─────┤  ├─────┤          ├─────┤  ├─────┤
      5  │     │                    5  │     │  │     │       5  │     │  │     │
         └─────┘                       └─────┘  └─────┘          └─────┘  └─────┘

          Valid                         Valid    i32[]            Valid    bool[]


         StructArray                     Child Array #1           Child Array #2
                                          (Int32Array)            (BooleanArray)

Are there any user-facing changes?

Sliced StructArrays can be created without validation errors (or using unsafe)

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 11, 2021
@alamb alamb requested a review from nevi-me November 11, 2021 11:43
@bjchambers
Copy link
Contributor

Looks good. The explanation (and pictures) were helpful for my understanding -- thanks for explaining the problem so clearly!

@alamb
Copy link
Contributor Author

alamb commented Nov 11, 2021

FYI @jhorstmann

@alamb alamb merged commit 078bc10 into apache:master Nov 12, 2021
@alamb alamb deleted the alamb/validation_error branch November 12, 2021 11:16
alamb added a commit that referenced this pull request Nov 12, 2021
* reproduce validation error

* Fix validation bug

Co-authored-by: Ben Chambers <bjchambers@gmail.com>
alamb added a commit that referenced this pull request Nov 12, 2021
* reproduce validation error

* Fix validation bug

Co-authored-by: Ben Chambers <bjchambers@gmail.com>

Co-authored-by: Ben Chambers <bjchambers@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validation error when manually copying array data for a slice array

2 participants