- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.4k
[skip-ci][ntuple][doc] Add reference doc for RNTuple merging #20191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! A couple of comments for discussion.
| ## Goal | ||
| The goal of the RNTuple merging process is producing one output RNTuple from *N* input RNTuples that can be used as if it were produced directly in the merged state. This means that: | ||
|  | ||
| * R1: All fields in the output RNTuple are accessible and have a type compatible with the original fields of the input RNTuples. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the "compatible" bit, it would be good to also clarify how compatibility is defined in this context
|  | ||
| The first input is attached in `EDescriptorDeserializeMode::kForWriting` mode, which doesn't collate the extended header with the non-extended header. Since we use the first input's descriptor as the output schema (barring late model extensions, see later), opening in `kForWriting` mode allows us to write the output to disk preserving the non-extended schema description as per requirement R3. A consequence of this choice is that the merger never produces (new) deferred columns in the output RNTuple's header. | ||
|  | ||
| In `Union` mode only, we allow any following input RNTuple to define new fields that don't appear in the first input. These fields, after being validated, are late model extended into the output model and will thus appear in the output RNTuple's extended header on disk. This means that all columns that were not part of the first input's schema become deferred columns in the output RNTuple (unless the first source had 0 entries). | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless the first source had 0 entries
In which case the non-extended output schema is equal to the non-extended schema of the second input? Or is there something more?
| As for common fields, they are matched by name and validated as follows: | ||
| - any field that is projected in the destination must be also projected in the source and must be projected to the same field; | ||
| - any field that is not projected in the destination must also not be projected in the source; | ||
| - the field types names must be **identical** (*this could probably be relaxed in the future to allow for different but compatible types*) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although not in direct contrast with R1, this sounds related to it, maybe it should be mentioned?
| - any field that is projected in the destination must be also projected in the source and must be projected to the same field; | ||
| - any field that is not projected in the destination must also not be projected in the source; | ||
| - the field types names must be **identical** (*this could probably be relaxed in the future to allow for different but compatible types*) | ||
| - the type checksums, if present, must be identical. Note that if a field has a type checksum and the other doesn't, we consider this valid (*is this sound?*); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note that mode (L4) would allow this to be fully relaxed.
- the input any checksum will do (within the constraint of regular schema evolution support)
- the destination checksum would have the same relationship to the in-memory class layout/checksum as with regular RNTuple Write (i.e. be the same)
| - the type checksums, if present, must be identical. Note that if a field has a type checksum and the other doesn't, we consider this valid (*is this sound?*); | ||
| - the type versions must be identical; | ||
| - the fields' structural roles must be identical; | ||
| - the column representations must match, as follows: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those limitations are also likely removable for mode (L4)
Following this comment's suggestion.
This documentation is meant to be useful for future reference for ourselves (and maybe advanced users who need this information) and we should try to keep it up-to-date as the Merger is updated.