Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interoperability between arrow-rs and nanoarrow #5052

Open
evgenyx00 opened this issue Nov 7, 2023 · 19 comments · May be fixed by #6426
Open

Interoperability between arrow-rs and nanoarrow #5052

evgenyx00 opened this issue Nov 7, 2023 · 19 comments · May be fixed by #6426
Assignees
Labels
question Further information is requested

Comments

@evgenyx00
Copy link

evgenyx00 commented Nov 7, 2023

Which part is this question about
Deserialization from arrow-rs into nanoarrow

Describe your question
I’ve encountered a problem while serializing a basic Arrow object using StreamWriter with a single RecordBatch, and deserialize the object using nanoarrow, it fails while deserializing RecordBatch, due to header alignment verification in flatcc https://github.com/apache/arrow-nanoarrow/blob/d104e9065101401c63e931acdc7c10f114c47eaf/dist/flatcc.c#L2453

The alignment failure occurs in the calculation of the base offset and the offset of the union value relative to the base.
I'm not fully sure, the problem is arrow-rs or flatbuffers.

Additional context
Tested arrow-rs versions 5.0.0 and 47.0.0, so it's not a degradation or never worked.

Steps to reproduce:

  1. Create arrow object and save only RecordBatch bytes

  2. Test using nanoarrow or example code https://github.com/apache/arrow-nanoarrow/blob/d104e9065101401c63e931acdc7c10f114c47eaf/examples/cmake-ipc/src/app.c

  3. Reproduced on Debian 11 x86 and MacOS M1

  4. Code snippet

`

fn get_arrow_bytes() -> Vec<u8> {


    let mut buf: Vec<u8> = Vec::new();

    {

        let schema = Schema::new(vec![
            Field::new("AAAAAAAA", DataType::Utf8, true)
        ]);

        let mut arrow_writer = writer::StreamWriter::try_new(&mut buf, &schema).unwrap();

        let id_array = StringArray::from(vec!["BBBBBBBB".to_string()]);

        let batch = RecordBatch::try_new(
            Arc::new(schema),
            vec![Arc::new(id_array)]
        ).unwrap();

        arrow_writer.write(&batch).unwrap();
                
        arrow_writer.finish().unwrap();
    }

    buf
}

`

@evgenyx00 evgenyx00 added the question Further information is requested label Nov 7, 2023
@tustvold
Copy link
Contributor

tustvold commented Nov 7, 2023

My memory is admittedly a little hazy, but I definitely remember that flatbuffers do not mandate any alignment internally. I am therefore not sure why flatcc would be including this in its verification process... Is nanoarrow using it in some especially pedantic mode or something?

Edit: reviewing the linked example code it does not appear to be doing anything to guarantee the alignment of the buffer the data is being read into - https://github.com/apache/arrow-nanoarrow/blob/d104e9065101401c63e931acdc7c10f114c47eaf/examples/cmake-ipc/src/app.c#L26. Does this work with data written by other systems? If so could you perhaps provide an example of an IPC file that works and one containing the same data that doesn't?

@evgenyx00
Copy link
Author

evgenyx00 commented Nov 7, 2023

recordbatch.tgz
Attaching tgz, with two binary files recordbatch-good.bin and recordbatch-error.bin, working and non working

As well enclosing slices, used to generate binaries with RecordBatch(the working one was non rust generated )

  1. Working
    let buf = [ 0xff, 0xff, 0xff, 0xff, 0x98, 0x00, 0x00, 0x00, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x0c, 0x00, 0x16, 0x00, 0x0e, 0x00, 0x15, 0x00, 0x10, 0x00, 0x04, 0x00, 0x0c, 0x00, 0x00, 0x00, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x04, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x03, 0x0a, 0x00, 0x18, 0x00, 0x0c, 0x00, 0x08, 0x00, 0x04, 0x00, 0x0a, 0x00, 0x00, 0x00, 0x14, 0x00, 0x00, 0x00, 0x48, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, ];

  2. Non working buf

let buf = [ 0xff, 0xff, 0xff, 0xff, 0xb8, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x0c, 0x00, 0x1a, 0x00, 0x18, 0x00, 0x17, 0x00, 0x04, 0x00, 0x08, 0x00, 0x0c, 0x00, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0x04, 0x00, 0x0a, 0x00, 0x14, 0x00, 0x0c, 0x00, 0x08, 0x00, 0x04, 0x00, 0x0a, 0x00, 0x00, 0x00, 0x24, 0x00, 0x00, 0x00, 0x0c, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x00, 0x00, ];

@evgenyx00
Copy link
Author

Not sure if it helps, in the failed condition the misalignment is 4 bytes
!((k + offset_size) & ((offset_size - 1) | (align - 1u)));
Where:
offset_size = 4
align = 8
k = 72 base(60) + offset(12)

@tustvold
Copy link
Contributor

tustvold commented Nov 7, 2023

The provided non-working buf doesn't even read with arrow-rs, what code did you use to produce it?

Edit: In fact neither files are valid IPC Streams AFAICT...

@evgenyx00
Copy link
Author

Apologies for not being clear, the provided non-working/working samples contains only a RecordBatch, without preceding schema, so they that can be tested by nanowarrow example(app.c), the example app doesn't iterate over all headers.

Used code to reproduce is at the begging of the thread.

Also enclosing full samples of working and non-working.
arrow-samples.tgz

@tustvold
Copy link
Contributor

tustvold commented Nov 8, 2023

Annotating the relevant RecordBatch we get

Good

0x14, 0x00, 0x00, 0x00, // Message offset (20)
0x00, 0x00, 0x00, 0x00, // Message VTable
0x0c, 0x00, // VTable length (12)
0x16, 0x00, // Object size (22)
0x0e, 0x00, // Field 0 offset (version) (14)
0x15, 0x00, // Field 1 offset (header type) (21)
0x10, 0x00, // Field 2 offset (header offset) (16)
0x04, 0x00, // Field 3 offset (bodyLength) (4)
// Message table
0x0c, 0x00, 0x00, 0x00, // SOffset to VTable (12)
0x18, 0x00, 0x00, 0x00, // (bodyLength) (24)
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Padding
0x04, 0x00, // (version)
0x10, 0x00, 0x00, 0x00, // Offset to RecordBatch (16)
0x00, 0x03, // (header type)  (RecordBatch)
// End of Message

// RecordBatch VTable
0x0a, 0x00, // VTable length (10)
0x18, 0x00, // Object size (24)
0x0c, 0x00, // Field 0 offset (length) (12)
0x08, 0x00, // Field 1 offset (nodes) (8)
0x04, 0x00, // Field 2 offset (buffers) (4)
// RecordBatch table
0x0a, 0x00, 0x00, 0x00, // SOffset to VTable (10)
0x14, 0x00, 0x00, 0x00, // Offset to buffers
0x48, 0x00, 0x00, 0x00, // Offset to nodes (72
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Length
0x00, 0x00, 0x00, 0x00, //
// Buffers
0x03, 0x00, 0x00, 0x00, // Vector Length 3
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Offset 0
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Length 1
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Offset 8
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Length 8
0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Offset 16
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer Length 8
// Nodes
0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, // Vector Length 1
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // FieldNode Length 1
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // FieldNode Null Count 0

Bad

0x10, 0x00, 0x00, 0x00, // Message offset (16)
// Message VTable
0x0c, 0x00, // VTable length (12)
0x1a, 0x00, // Object size (26)
0x18, 0x00, // Field 0 offset (version) (24)
0x17, 0x00, // Field 1 offset (header type) (23)
0x04, 0x00, // Field 2 offset (header offset) (4)
0x08, 0x00, // Field 3 offset (bodyLength) (8)
// Message table
0x0c, 0x00, 0x00, 0x00, // SOffset to VTable (12)
0x20, 0x00, 0x00, 0x00, // Offset to header (32)
0x18, 0x00, 0x00, 0x00, // (bodyLength) (8)
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x03, // (header type) (RecordBatch)
0x04, 0x00, // (version)
// End of Message

// RecordBatch VTable
0x0a, 0x00, // VTable length (10)
0x14, 0x00, // Object Size (20)
0x0c, 0x00, // Field 0 offset (length) (12)
0x08, 0x00, // Field 1 offset (nodes) (8)
0x04, 0x00, // Field 2 offset (buffers) (4)
// RecordBatch table
0x0a, 0x00, 0x00, 0x00, // SOffset to VTable (10)
0x24, 0x00, 0x00, 0x00, // Offset to buffers
0x0c, 0x00, 0x00, 0x00, // Offset to nodes
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Length
// Nodes
0x01, 0x00, 0x00, 0x00, // Vector length 1
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // FieldNode length 1
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // FieldNode null count 0
// Buffers
0x03, 0x00, 0x00, 0x00, // Vector length 3
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer offset 0
0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer length 1
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer offset 8
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer length 8
0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer offset 16
0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // Buffer length 8
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,

In the bad example we can see that the vectors don't contain padding to align them to an 8 byte boundary, instead they only have 4 byte alignment. This in turn means that the structs are not correctly aligned, which I suspect is what flatcc is then complaining about.

This appears to be an upstream bug

Edit: Filed google/flatbuffers#8150

@evgenyx00
Copy link
Author

@tustvold appreciate your prompt assistance :)

@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

@bkietz mentioned on #6449 (comment):

FWIW, #6426 and google/flatbuffers#8398 should fix #5052

@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

We have disabled the CI test in #6449, so as part of closing this PR we should enable the tests

paleolimbot added a commit to apache/arrow-nanoarrow that referenced this issue Oct 7, 2024
Closes #641.

Unfortunately we just have to skip checking Rust compatibility due to
apache/arrow-rs#5052 (e.g.,
apache/arrow-rs#6449 ).

This PR also ensures compatibility with big endian Arrow files and Arrow
files from before the continuation token. Support for those had already
been added in the decoder but hadn't made it to the stream reader yet.

Local check:

```bash
# Assumes arrow-testing, arrow-nanoarrow, and arrow are all checked out in the same dir
export gold_dir=../arrow-testing/data/arrow-ipc-stream/integration 
export ARROW_NANOARROW_PATH=$(pwd)/build 
pip install -e "../arrow/dev/archery/[all]"
archery integration --with-nanoarrow=true --run-ipc \
    --gold-dirs=$gold_dir/0.14.1 \
    --gold-dirs=$gold_dir/0.17.1 \
    --gold-dirs=$gold_dir/1.0.0-bigendian \
    --gold-dirs=$gold_dir/1.0.0-littleendian \
    --gold-dirs=$gold_dir/2.0.0-compression \
    --gold-dirs=$gold_dir/4.0.0-shareddict
```
@paleolimbot
Copy link
Member

It seems like the nanoarrow tests are still failing! The failures look the same as they did before; however, if you think this is a nanoarrow issue let me know and I will debug!

@tustvold
Copy link
Contributor

tustvold commented Oct 22, 2024

nanoarrow issue let me know and I will debug

It's somewhat debatable, the rust flatbuffer implementation doesn't generate flatbuffers with correctly aligned structs. This is an issue, and should be fixed, but seems to be stuck on upstream review capacity.

That being said one could make the case that nanoarrow is being overly strict on this, given no other implementation complains, but that gets into philosophical debates over whether specifications or implementations define standards

@paleolimbot
Copy link
Member

In both cases it is upstream (flatcc is the one being strict, in our case). It doesn't seem like flatcc is interested in relaxing this check and while somebody with sufficient flatbuffer experience could patch nanoarrow's vendored copy, it might still fail in places where a previous install was used. It is mostly just very unfortunate and I hope we can find a solution soon.

@alamb
Copy link
Contributor

alamb commented Oct 23, 2024

FWIW I did make a PR to re-enable the check and indeed it still fails

@bkietz did us a solid by making a proposed PR to the rust flatbuffers implementation here: google/flatbuffers#8398. However, there doesn't seem to be any maintainer capacity there at this time (at least not that has time to engange with us)

@bkietz
Copy link
Member

bkietz commented Oct 24, 2024

We could undraft #6426 if it'd be acceptable for arrow-rs to use patched flatbuffers for now (possibly donating that branch of flatbuffers to an ASF controlled repo). Alternatively we could forgo pointing to patched flatbuffers and just check the regenerated .rs in as a "manual" edit for now

@tustvold
Copy link
Contributor

tustvold commented Oct 24, 2024

I would not be opposed to either solution, the flatbuffer definitions don't exactly change frequently. There is also already some precedent for this in the parquet thrift code, where there is a spicy bash script that post-processes what the thrift compiler generates

@alamb
Copy link
Contributor

alamb commented Oct 25, 2024

Alternatively we could forgo pointing to patched flatbuffers and just check the regenerated .rs in as a "manual" edit for now

This is what I would recommend unless the changes needed to the compiler become more substantial.

where there is a spicy bash script that post-processes what the thrift compiler generates

I think this is a good idea. While it is definitely an "old skool" approach to patches, it also has a lot of precident (though we can forgo the 🌶️ part lol)

One common pattern is:

  1. can check in a patch file (output of running diff -du <actual output> <manual output>
  2. then have a script applies that diff via patch during the build

This would keep the changes from stock flatbuffers in a separate file

@bkietz
Copy link
Member

bkietz commented Oct 25, 2024

I'll update my PR today, thanks

@bkietz
Copy link
Member

bkietz commented Oct 25, 2024

Hmm, I'm not seeing how to apply a local .patch file to a cargo dependency. Reading the cargo book, it seems that we can only point to a repository or a local path. (The unpatched flatbuffers crate does not make PushAlignment public, so it's not possible to override Push::alignment function even while editing the generated .rs)

@bkietz
Copy link
Member

bkietz commented Oct 28, 2024

@alamb Never mind; google/flatbuffers#8398 just merged. I'll rebase #6426 to use google/flatbuffers pointing to a commit hash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
5 participants