Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment (probably) shouldn't be required for empty vectors #287

Closed
bkietz opened this issue Aug 14, 2024 · 32 comments
Closed

Alignment (probably) shouldn't be required for empty vectors #287

bkietz opened this issue Aug 14, 2024 · 32 comments

Comments

@bkietz
Copy link
Contributor

bkietz commented Aug 14, 2024

Flatcc's verify_vector requires that vectors be properly aligned even when the vector is empty. The consumers in https://github.com/google/flatbuffers do not enforce this and moreover the producers will sometimes emit an empty vector which they didn't bother to align.

(Encountered in the context of arrow IPC, which uses flatbuffers to store metadata) A minimal example: https://gist.github.com/bkietz/6150b9aae6d1bea0c85c0527d7fb04d6

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 15, 2024

@bkietz thanks for making a note of this. I am not able to dig deeply into this at the moment, I am thrilled that flatcc is being used with Arrow, and I did how some feedback to one of the Arrow designers back then. I have worked a bit with the Arrow format in Julia and Go.

That said, I disagree about alignment. You could get away with it, sort of, for empty vectors, but the (empty) data still needs to be aligned to at least 32-bits because of the length header. Furthermore pointer types in some languages, notably C, are required to be aligned, at the very least if dereferenced, but even if not, who knows what aggressive optimizations a compiler might do. In C vectors can be accessed via raw pointers iff the user takes responsibility for endian conversion, or the platform is known little endian. Therefore there are functions that are expected to return aligned pointers even if the length is empty.

Non-alignment can also lead to unexpected errros such as a user doing pointer math to get the header field, even if the user is not supposed to that directly, since the binary format is well documented. Even a compiler might do that, though unlikely.

As for the google/flatbuffers consumers: No consumers I am aware of, including FlatCC readers, enforce alignment, notably, it is perfectly valid to read from an unaligned buffer, e.g. from a network buffer, iff you are sure you platform arch can handle it, such as x86 and some, but not all ARM. Verification is different though. Googles C++ verifier did not verify alignment very aggressively early on, AFAIR, but that does not mean it should not do that, only that it did not implement it.

From a practical perspective, it would also be an odd exception in the verifier to first check and a fail, then add branch logic to check if it in fact happened to be an empty vector, and doing it before would be a an unwanted performance penality, and still a complication.

Therefore I suggest you try to address the issue at the source.

@mikkelfj
Copy link
Contributor

On a related note:

I would like the verifier to be able to accept unaligned buffers but still verify that they are properly aligned relative to buffer start. This would be useful for verify buffers in unaligned network buffers without having to make a copy.

I think most of the code is already doing that, but AFAIR not all the code, and there should be a mode flag to request such behavior. This could be added as a PR, but it is not something I plan to address short term.

I would also be possible to have a verifier that entirely avoids checking for alignment. This would be done with a compile time config flag, preferably such that the main logic is unchanged but all align helper functions just return true. Again, subject to PR.

@bkietz
Copy link
Contributor Author

bkietz commented Aug 15, 2024

Therefore I suggest you try to address the issue at the source.

I will not be able to do that; de-facto these empty vectors are acceptable to many other readers of the arrow format (and some are even checked in as gold files). Therefore we're going to need to patch our copy of flatcc to handle this corner case.

@bkietz
Copy link
Contributor Author

bkietz commented Aug 15, 2024

Furthermore pointer types in some languages, notably C, are required to be aligned

The alignment requirement need not be an obstacle here.
In the case of empty vectors the pointer returned by SomeTable_somevector() could be

  • NULL (but that might be unfavorable due to memcpy and other functions which forbid NULL even when size is zero)
  • a pointer to a max aligned static buffer reserved for empty vectors
  • any properly aligned pointer to any byte in the buffer (if we want to preserve the guarantee that pointers all point into the same buffer)

@aardappel
Copy link

I agree with @mikkelfj that vectors should be aligned even if empty. In C++, even if you don't dereference (which would be UB and/or a crash on some architectures), even just converting unaligned pointers may result in "unspecified" values.. while this is unlikely to happen in practice, it would be a bad idea to allow it.

Thus, the alignment for a vector of double is 8, and the minimum alignment of ANY vector is 4, since we access the size field from it.

If the C++ verifier doesn't check this, then that is a bug / oversight. My apologies. Ideally this would be fixed, but we'd need to audit if there are any past buggy serializers that allowed this unalignment first. @dbaileychess

@bkietz how did the project end up with these unaligned vectors? which writing code in what language produced these?

Remains the question how to recover from the situation. I guess Arrow needs a patched verifier that disables this particular check. I'd highly recommend fixing any writers to ensure alignment from now on.

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 15, 2024

@bkietz FYI ardappel is is the original FB designer and dbaileychess is the current google maintainer. I support ardappel's response.

I agree that a patched version of the flatcc verifier is probably the best short term solution, but I still recommend to address the problem at the source, perhaps through deprecation.

Flatcc runtime files includes a config.h file that users can use to overwrite flags. This could also be used to add a flag patched behaviour: https://github.com/dvidelabs/flatcc/blob/master/config/config.h

To address your suggestions:

  • NULL is not acceptable because empty is different from absent.
  • A pointer to a default buffer is not suitable for several reasons, although the idea is sound: there would be an unacceptable performance overhead to check for this at runtime (any additional branch is unacceptable), and some applications might assume identity of a buffer location. We did discuss such default objects in a separate context where one could optionally have default tables and other objects, but it was never decided, it is also a different case, and any performance overhead would be opt in.
  • any aligned pointer has the same identity and performance issues.
  • all of the above adds unwanted complexity into code that is already fairly complex, only to address behaviour that is not really desireable.

This is not to be dismissive. I want to support an efficient solution for arrow, but it should not happen by changing the nature of the core format.

Also, I am not sure if it was clear to you before, but a verifier is very different from a reader. A reader can take all the shortcuts it want, almost, because it can assume the buffer is valid in exchange for speed and other objectives. A verifier should ideally be fast, but it should first and foremost be correct and assume a buffer might be invalid.

EDIT: the verififer uses https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/flatcc_rtconfig.h while the main parser/compiler uses the above mentioned config.

@bkietz
Copy link
Contributor Author

bkietz commented Aug 15, 2024

@bkietz how did the project end up with these unaligned vectors? which writing code in what language produced these?

I was primarily testing against the C++ library (which is also what produced the repro case above). I have not checked if any other libraries produce such vectors. I'll try to extract a minimal producer-side repro using the C++ library.

I'd highly recommend fixing any writers to ensure alignment from now on.

That sounds reasonable to me.

FWIW flatcc is the only implementation thus far which has balked at these empty vectors. The arrow integration tests pass flatbuffer metadata between C++, Java, Rust, C#, and Go. The intent is to make sure everything is inter-valid so all arrow implementers verify incoming messages in addition to reading. None of those has raised this error before, so if alignment starts to be asserted for empty vectors in more flatbuffers libraries I suspect more users than arrow will break.

@aardappel
Copy link

Argh.. looks like this has been a bug in the C++ library all along. StartVector passes the alignment of element type T to the centrally used function PreAlign, which then does:

  // Aligns such that when "len" bytes are written, an object can be written
  // after it (forward in the buffer) with "alignment" without padding.
  void PreAlign(size_t len, size_t alignment) {
    if (len == 0) return;
    TrackMinAlign(alignment);
    buf_.fill(PaddingBytes(GetSize() + len, alignment));
  }

Note the early out on len == 0.

Worse, this same function is used for alignment to LenT, the size field, which is ALSO skipped.. this is much more serious, since in theory that can cause the size field to cause an unaligned read. Wonder why that has never been an issue.. @dbaileychess

So this means all buffers generated by the C++ implementation have possibly unaligned empty vectors in them. And any languages that were implemented by studying/copying the C++ implementation in any way may have it too.

So while to be correct, I still think an empty vector should be aligned, I have to admit it does not make sense for any verifier in any language to require it.. including C.

This bug is essentially grandfathered in.

I'd still think it be good to fix this bug where possible. Even that will be challenging, as it may grow the size of some FlatBuffers and brittle tests the world over will break :)

In terms of how to fix it, it may be tempting to just remove the len == 0 check, because in the end when writing 0 bytes, you may still intent to derive a pointer from the current location. But maybe better to first audit the callers of PreAlign, and see which ones could possibly benefit from this "optimisation", and split into 2 functions if any.

The call PreAlign<LenT>(len * elemsize) in StartVector is flat out wrong, since it says I am going to write 0 bytes for LenT alignment, when actually we want to write sizeof(LenT) bytes.. but PreAlign<LenT>(sizeof(LenT)) is also wrong since the bytes are to be written before the alignment point, not after. So this should do something else entirely different if the len == 0 check is kept. This should for sure be fixed even if the bigger alignment of T is not.

@bkietz
Copy link
Contributor Author

bkietz commented Aug 15, 2024

I'll open a draft PR to flatcc adding FLATCC_ALLO_MISALIGNED_EMPTY_VECTORS to flatcc_rtconfig.h

@mikkelfj
Copy link
Contributor

I'll open a draft PR to flatcc adding FLATCC_ALLO_MISALIGNED_EMPTY_VECTORS to flatcc_rtconfig.h

aside from minor typo:
FLATCC_ALLOW_MISALIGNED_EMPTY_VECTORS
I am fine with that. Add comment to explain why the setting exists. Documentation in main README also needs update.

The question is whether it should be enabled by default?

I tend to prefer to keep things as they are by default, partly because that is the correct behavior, and partly because anything else is technically not guaranteed to work and C compilers get increasingly aggressive with optimizations - this is the number one reason for bug reports and fixes in flatcc by far, though mostly due to harmless warnings.

However, given the scale of the bug, I am not entirely sure.

As for other languages not complaining: AFAIK only C and C++ implements verifier logic. Other languages generally just depend on the built-in language safety features such as buffer overrun protection. So it is no surprising that other languages do not complain, even C would not do that without explicit verification. However, that does not mean that all those languages have been reading data correctly, though they likely have.

@mikkelfj
Copy link
Contributor

For reference, the flatcc verifier did have an issue related to verifying buffers that contained alignment of size 8 or above, but it only happened with size prefixed buffers (a feature that was added some years of the original FB release). It was a rare occurrence and only fixed in late 2023, so there is a chance that this fix might have triggered something with arrow.

#210

@dbaileychess
Copy link

I don't think this is an issue for Flatbuffers. @aardappel Is correct in the StartVector() we do special case the len == 0 conditional. Its basically a no-op. BUT, you have to finish a vector with an EndVector() call that:

  /// @cond FLATBUFFERS_INTERNAL
  template<typename LenT = uoffset_t, typename ReturnT = uoffset_t>
  ReturnT EndVector(size_t len) {
    FLATBUFFERS_ASSERT(nested);  // Hit if no corresponding StartVector.
    nested = false;
    return PushElement<LenT, ReturnT>(static_cast<LenT>(len));
  }

Pushes the size of the vector (0) into the buffer. That function aligns the value being written:

  // Write a single aligned scalar to the buffer
  template<typename T, typename ReturnT = uoffset_t>
  ReturnT PushElement(T element) {
    AssertScalarT<T>();
    Align(sizeof(T));
    buf_.push_small(EndianScalar(element));
    return CalculateOffset<ReturnT>();
  }

@dbaileychess
Copy link

I made an test example:

table MyRoot {
  a:string;
  vector:[int];
  c:string;
}

root_type MyRoot;

And made it with this json:

{
  "a": "aaaaaaa",
  "vector": [],
  "c": "cccccccc"
}

And the annotated buffer is:

// Annotated Flatbuffer Binary
//
// Schema file: test.fbs
// Binary file: test.bin

header:
  +0x00 | 10 00 00 00             | UOffset32  | 0x00000010 (16) Loc: 0x10 | offset to root table `MyRoot`

padding:
  +0x04 | 00 00                   | uint8_t[2] | ..                        | padding

vtable (MyRoot):
  +0x06 | 0A 00                   | uint16_t   | 0x000A (10)               | size of this vtable
  +0x08 | 10 00                   | uint16_t   | 0x0010 (16)               | size of referring table
  +0x0A | 04 00                   | VOffset16  | 0x0004 (4)                | offset to field `a` (id: 0)
  +0x0C | 08 00                   | VOffset16  | 0x0008 (8)                | offset to field `vector` (id: 1)
  +0x0E | 0C 00                   | VOffset16  | 0x000C (12)               | offset to field `c` (id: 2)

root_table (MyRoot):
  +0x10 | 0A 00 00 00             | SOffset32  | 0x0000000A (10) Loc: 0x06 | offset to vtable
  +0x14 | 20 00 00 00             | UOffset32  | 0x00000020 (32) Loc: 0x34 | offset to field `a` (string)
  +0x18 | 18 00 00 00             | UOffset32  | 0x00000018 (24) Loc: 0x30 | offset to field `vector` (vector)
  +0x1C | 04 00 00 00             | UOffset32  | 0x00000004 (4) Loc: 0x20  | offset to field `c` (string)

string (MyRoot.c):
  +0x20 | 08 00 00 00             | uint32_t   | 0x00000008 (8)            | length of string
  +0x24 | 63 63 63 63 63 63 63 63 | char[8]    | cccccccc                  | string literal
  +0x2C | 00                      | char       | 0x00 (0)                  | string terminator

padding:
  +0x2D | 00 00 00                | uint8_t[3] | ...                       | padding

vector (MyRoot.vector):
  +0x30 | 00 00 00 00             | uint32_t   | 0x00000000 (0)            | length of vector (# items)

string (MyRoot.a):
  +0x34 | 07 00 00 00             | uint32_t   | 0x00000007 (7)            | length of string
  +0x38 | 61 61 61 61 61 61 61    | char[7]    | aaaaaaa                   | string literal
  +0x3F | 00                      | char       | 0x00 (0)                  | string terminator

You can see the empty vector size field is placed at 0x30 (which is properly aligned) and just contains the value 0. Other padding is added after wards for the other field.

I tried this with all sorts of sizes of both a and c and never saw a misplaced alignment for the empty vector.

@dbaileychess
Copy link

Here is an example where the previously written thing has a non-4 byte alignment (0x1E is not 4 byte aligned), but the C++ code does properly add in the 2 padding bytes to force the vector to be 4 byte aligned at 0x18.

vector (MyRoot.vector):
  +0x18 | 00 00 00 00             | uint32_t   | 0x00000000 (0)            | length of vector (# items)

padding:
  +0x1C | 00 00                   | uint8_t[2] | ..                        | padding

vtable (MyLeaf):
  +0x1E | 06 00                   | uint16_t   | 0x0006 (6)                | size of this vtable

@mikkelfj
Copy link
Contributor

@dbaileychess Thanks for looking into this. I seems flatcc verifier should remain as is.

I'll wait for @bkietz to suggest next steps.

@bkietz
Copy link
Contributor Author

bkietz commented Aug 16, 2024

@dbaileychess alignment to 4 bytes is indeed guaranteed by the presence of the vector size. However the empty vector in question has elements which require 8 byte alignment. I've annotated the buffer from my repro case:

// Annotated Flatbuffer Binary
//
// Schema file: /home/bkietz/arrow/format/Message.fbs
// Binary file: dump.bin

header:
  +0x00 | 14 00 00 00             | UOffset32  | 0x00000014 (20) Loc: 0x14 | offset to root table `org.apache.arrow.flatbuf.Message`

padding:
  +0x04 | 00 00 00 00 00 00       | uint8_t[6] | ......                    | padding

vtable (org.apache.arrow.flatbuf.Message):
  +0x0A | 0A 00                   | uint16_t   | 0x000A (10)               | size of this vtable
  +0x0C | 0E 00                   | uint16_t   | 0x000E (14)               | size of referring table
  +0x0E | 06 00                   | VOffset16  | 0x0006 (6)                | offset to field `version` (id: 0)
  +0x10 | 05 00                   | VOffset16  | 0x0005 (5)                | offset to field `header_type` (id: 1)
  +0x12 | 08 00                   | VOffset16  | 0x0008 (8)                | offset to field `header` (id: 2)

root_table (org.apache.arrow.flatbuf.Message):
  +0x14 | 0A 00 00 00             | SOffset32  | 0x0000000A (10) Loc: 0x0A | offset to vtable
  +0x18 | 00                      | uint8_t[1] | .                         | padding
  +0x19 | 03                      | UType8     | 0x03 (3)                  | table field `header_type` (UType)
  +0x1A | 04 00                   | int16_t    | 0x0004 (4)                | table field `version` (Short)
  +0x1C | 10 00 00 00             | UOffset32  | 0x00000010 (16) Loc: 0x2C | offset to field `header` (union of type `RecordBatch`)
  +0x20 | 00 00                   | uint8_t[2] | ..                        | padding

vtable (org.apache.arrow.flatbuf.RecordBatch):
  +0x22 | 0A 00                   | uint16_t   | 0x000A (10)               | size of this vtable
  +0x24 | 0C 00                   | uint16_t   | 0x000C (12)               | size of referring table
  +0x26 | 00 00                   | VOffset16  | 0x0000 (0)                | offset to field `length` (id: 0) <defaults to 0> (Long)
  +0x28 | 04 00                   | VOffset16  | 0x0004 (4)                | offset to field `nodes` (id: 1)
  +0x2A | 08 00                   | VOffset16  | 0x0008 (8)                | offset to field `buffers` (id: 2)

table (org.apache.arrow.flatbuf.RecordBatch):
  +0x2C | 0A 00 00 00             | SOffset32  | 0x0000000A (10) Loc: 0x22 | offset to vtable
  +0x30 | 0C 00 00 00             | UOffset32  | 0x0000000C (12) Loc: 0x3C | offset to field `nodes` (vector)
  +0x34 | 04 00 00 00             | UOffset32  | 0x00000004 (4) Loc: 0x38  | offset to field `buffers` (vector)

vector (org.apache.arrow.flatbuf.RecordBatch.buffers):
  +0x38 | 00 00 00 00             | uint32_t   | 0x00000000 (0)            | length of vect
[dump.zip](https://github.com/user-attachments/files/16638610/dump.zip)
or (# items)

vector (org.apache.arrow.flatbuf.RecordBatch.nodes):
  +0x3C | 01 00 00 00             | uint32_t   | 0x00000001 (1)            | length of vector (# items)
  +0x40 | 00 00 00 00 00 00 00 00 | int64_t    | 0x0000000000000000 (0)    | struct field `[0].length` of 'org.apache.arrow.flatbuf.FieldNode' (Long)
  +0x48 | 00 00 00 00 00 00 00 00 | int64_t    | 0x0000000000000000 (0)    | struct field `[0].null_count` of 'org.apache.arrow.flatbuf.FieldNode' (Long)

The vector size for RecordBatch.buffers is placed at 0x38, which IIUC means the empty vector element span is the byte range [0x3c, 0x3c). That span is not aligned to 8 bytes

@dbaileychess
Copy link

Thanks for making the annotation, it's always easier for me to see the layout!

So my take is if a vector is empty, it doesn't really matter what alignment it is supposed to have. There is no data in the buffer for it to take an alignment.

I don't count the size prefix location as the start of the vector either. The data part is the part directly after the length_ field. That part should be aligned IMO to the element size. And since there are no elements, it doesn't matter.

I guess a question is, are you checking the length of the vector before trying to access the "backing" data of it?

@aardappel
Copy link

@mikkelfj I feel you should by default allow unaligned empty vectors for alignments > 4, because apparently that is the defacto standard. Checking for alignment >4 should be an option.

@dbaileychess you are correct at least the size is never misaligned, thanks to the (redundant) per element alignment. But it is still a bug that the StartVector code essentially does nothing, this bug has been masked. If instead we had written the vector data assuming correct alignment was already guaranteed, this would be a big bug.

A [double] can be misaligned still, and this is a problem in C/C++, see my first comment on this issue. This is apparently benign in that no compiler is able to know that this pointer is potentially misaligned, and no hardware does funny things with pointer bits for supposedly aligned pointers, but.. they could, in theory.

We could codify going forward that empty vectors are merely aligned to 4, and assume C++ compilers will never be able to find out.

Or, we can fix the C++ builder (and who knows what other languages) going forward to align to the type also?

@aardappel
Copy link

And, while I am ranting, can we reflect for a moment how relatively complicated such a simple thing as alignment can be, and how much effort over the life of FlatBuffers has been spent on it?

Like I said in google/flatbuffers#5875, I would totally just skip alignment entirely if I could do it all over again :)

@bkietz
Copy link
Contributor Author

bkietz commented Aug 16, 2024

I guess a question is, are you checking the length of the vector before trying to access the "backing" data of it?

Either users who specify this option take responsibility for checking size before otherwise accessing the vector field, or we need to also modify reader codegen to ensure that misaligned pointers are never materialized (though it would be odd to react to an rtconfig option in the compiler source...).

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 16, 2024

@aardappel I'm fluctuating on whether flatcc should default allow unaligned empty because @dbaileychess suggested it was not a native bug, but then it was anyway, and that obviously affects that decision.

I guess a question is, are you checking the length of the vector before trying to access the "backing" data of it?

It would be terrible to have extra runtime checks for that, performance wise. There is already an unvoidable null check that hopefully gets optmized out of any inner loops. A verifier can special handle a failed vector alignment check and then see of the length is zero. This only adds overhead to the verifier, and only when relevant.

And, while I am ranting, can we reflect for a moment how relatively complicated such a simple thing as alignment can be, and how much effort over the life of FlatBuffers has been spent on it?

I want nested flatbuffers to be dead for that reason. It is by far the most complicated part of the flatcc builder runtime, and a nutrious bug feeder. I don't even think C++ tries to do it right, last I heard it just used 8 byte aligment. FlatCC tries to create nested flatbuffers that are properly aligned both relative to the nested buffer and to the parent buffer, and it is not pretty.

But, aligment is important for many reasons, including performance, atomic access, GPUs, optimizations. I can see a second format without alignment and it would be much simpler and denser, but it is not FlatBuffers as we know it.

@bkietz feel free to submit PR on the verifier, then we can always fight over what default to use, but we need the code in place it seems, not only for arrow.

@mikkelfj
Copy link
Contributor

As to incorrect behaviour of unaligned empty buffers:

I don't think we should runtime fix this, that is costly. We should require buffers to be valid on platforms that are sensitive to it, and rely on "it probably works anyway" on other platforms.

I don't think we should define empty to be always 4 bytes, simply because of pointer types.

@dbaileychess
Copy link

Perhaps I don't understand the C implementation, but you always have to check the length of a vector before starting to read from it. How do you know how long it is? You shouldn't be dereferncing the .data() field without knowing the length.

@bkietz
Copy link
Contributor Author

bkietz commented Aug 16, 2024

You shouldn't be dereferncing the .data() field without knowing the length.

In fact C forbids that misaligned pointers be materialized at all; 6.3.2.3

A pointer to an object or incomplete type may be converted to a pointer to a different
object or incomplete type. If the resulting pointer is not correctly aligned for the
pointed-to type, the behavior is undefined.

So (int*)ptr can be UB even without the dereference.

credit

@aardappel
Copy link

@mikkelfj we have 10+ years worth of the C++ implementation (and maybe others) allowing incorrectly aligned vectors (for sizeof(T) > 4) to be created and stored.. this is now the defacto standard that implementations must handle. So yes, the default should be off. You can only enforce correct alignment if you know for sure the writer was also C, that sounds very much like an opt-in situation.

I do believe C++ tries to correctly align nested FlatBuffers, but also possible there are implementations that don't. They have actually proven rather useful and are here to stay, so you'll have to deal with them :)

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 16, 2024

@dbaileychess This is from memory, the code is 10 years old by now, but internally the C reader returns a pointer to the first element. AFAIR C++ returns a pointer (internally) to the size header field. The returned pointer is an abstract pointer type via typedef because it needs to be endian translated, but the accessor method goes like:

Tmyvec v = get_vec(buf, ...);
size_t n = get_vec_len(v);
Tmyvecelem x = get_vec_at(x, index); 

Here the problem is get_vec_at. It doesn't want to check the length. I don't even think it wants to check for null, but I am not sure. For little endian, this becomes an unconditional direct memory access instruction.

EDIT: for clarity: get_vec_len(v) will in praxis do a ((uint32_t *)v)[-1] operation, or whatever changes were made to abstract away from aggressive optimizations. I think it actual first casts to uint8_t pointer (aka) char pointer, which is a safe cast, then moves the pointer down, then casts to uint32_t. Either way, it should be fairly robust to any alignment at 4 bytes or above. So I don't think C runtime reader code will have any problems with the current situation. But the pointer still needs to be the native element type internally and it is exposed to the user by some API methods.

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 16, 2024

@bkietz that is probably true, wrt. ptr materialization. Certainly pointer casts have given a lot of headache in flatcc and elsewhere. However, for unreferenced pointers, it is very common to cast them at least to size_t, then tag pointers by adding one to e.g. distinguish between leaf and non-leaf tree nodes. The size_t type is then cast back to a pointer, but the pointer is cleaned up via cast to size_t and back before access. This is such a common pattern that it can be assumed valid. However, I also think it is valid to cast to size_t though not necessarily back if unaligned, even if it is done in praxis.

The problem is really when the pointer is accessed or manipulated in ways that are not carefully implemented for such behavior.

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 16, 2024

@aardappel I'm not arguing either way on what flatcc should do, I am trying to understand. I just stated that my previous statements were affected by conflicting information.

That said, I think you are saying now that C++ does it correctly when writing 0 elements of size 8 [EDIT you did not say that], but earlier you said it did not, and @bkietz also seems to confirm this. If this is the case, C++ writer should be fixed, but it does not remove the 10 years of precedence. As for whatever other languages might do, it does not matter if C++ has the bug. If C++ does not have the bug, I think we need to figure out what other languages do.

@aardappel
Copy link

I think I have been quite clear that C++ has the bug.

@mikkelfj
Copy link
Contributor

Updated README in verifier section to make a note of this:
86b2f78

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 16, 2024

I think I have been quite clear that C++ has the bug.

yes indeed, I already added a correction to my comment. I was misreading another sentence that referenced something else. Sorry about that.

@mikkelfj
Copy link
Contributor

This is now resolved in #289 with FlatCC by defaulting to only requiring empty vectors to be aligned to the size field. The original stricter behavior can be enabled by a compile time flag. This affects the verifier only (with the admitted unlikely but potential risk of UB other code due to misaligned pointers).

bkietz added a commit to apache/arrow that referenced this issue Sep 24, 2024
…3715)

### Rationale for this change

Nanoarrow can now read and write IPC files as of apache/arrow-nanoarrow#585 so it should no longer be skipped as a producer/consumer

### What changes are included in this PR?

Nanoarrow's tester is updated to point to the new integration executable and to report nanoarrow as a consumer/producer of IPC files.

Notably the `null_trivial` case is skipped even though nanoarrow nominally supports it since it represents a corner case in which nanoarrow's flatbuffers library will not accept some vectors produced by other flatbuffers libraries dvidelabs/flatcc#287

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

* GitHub Issue: #43680

Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants