-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ntuple] Add initial in-memory index prototype #15116
Conversation
Starting build on |
Build failed on ROOT-ubuntu2204/nortcxxmod. Failing tests: |
Build failed on mac12arm/cxx20. Errors:
Warnings:
|
Build failed on windows10/default. Errors:
|
Test Results 14 files 14 suites 3d 13h 36m 59s ⏱️ Results for commit 68e90b3. ♻️ This comment has been updated with latest results. |
Build failed on ROOT-ubuntu2004/python3. |
510337a
to
4dc20b9
Compare
fd64176
to
825de2f
Compare
672671b
to
68b602a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool new functionality!
Some comments:
- I think that is a fundamental new functionality that should have a section in the architecture.md
- Instead of adding the
GetHash()
member function, perhaps we can use a "hash visitor". That would allow for using only available public interfaces. TheRPrintValueVisitor
can serve as a blueprint (the actual hash visitor class is probably better placed inRNTupleIndex.[hxx,cxx]
or in a separate compilation unit).
tree/ntuple/v7/src/RField.cxx
Outdated
@@ -1076,6 +1076,12 @@ void ROOT::Experimental::RFieldBase::AcceptVisitor(Detail::RFieldVisitor &visito | |||
visitor.VisitField(*this); | |||
} | |||
|
|||
ROOT::Experimental::NTupleIndexValue_t ROOT::Experimental::RFieldBase::GetIndexRepresentation(void * /*from*/) | |||
{ | |||
R__ASSERT(false && "indexing is not supported for this field type"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw exception instead?
tree/ntuple/v7/test/CMakeLists.txt
Outdated
@@ -35,6 +35,7 @@ ROOT_GENERATE_DICTIONARY(RXTupleDict ${CMAKE_CURRENT_SOURCE_DIR}/RXTuple.hxx | |||
ROOT_ADD_GTEST(ntuple_descriptor ntuple_descriptor.cxx LIBRARIES ROOTNTuple CustomStruct) | |||
ROOT_ADD_GTEST(ntuple_endian ntuple_endian.cxx LIBRARIES ROOTNTuple) | |||
ROOT_ADD_GTEST(ntuple_friends ntuple_friends.cxx LIBRARIES ROOTNTuple CustomStruct) | |||
ROOT_ADD_GTEST(ntuple_index ntuple_index.cxx LIBRARIES ROOTNTuple CustomStruct) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ROOT_ADD_GTEST(ntuple_index ntuple_index.cxx LIBRARIES ROOTNTuple CustomStruct) | |
ROOT_ADD_GTEST(ntuple_index ntuple_index.cxx LIBRARIES ROOTNTuple) |
tree/ntuple/v7/inc/ROOT/RField.hxx
Outdated
@@ -542,6 +544,8 @@ protected: | |||
/// Called by `ConnectPageSource()` once connected; derived classes may override this as appropriate | |||
virtual void OnConnectPageSource() {} | |||
|
|||
virtual std::uint64_t GetHash(const void *from) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is calculating a hash, I'd suggest
virtual std::uint64_t GetHash(const void *from) const; | |
virtual std::uint64_t Hash(const void *from) const; |
std::vector<void *> ptrs; | ||
ptrs.reserve(fieldValues.size()); | ||
for (auto &fieldValue : fieldValues) { | ||
fieldValue.Read(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good use for bulk reading. Doesn't necessarily need to be deployed in this commit but a TODO
could be added.
///////////////////////////////////////////////////////////////////////////// | ||
/// Container for the combined hash of the indexed fields. Uses the implementation from `boost::hash_combine` (see | ||
/// https://www.boost.org/doc/libs/1_55_0/doc/html/hash/reference.html#boost.hash_combine). | ||
struct RIndexValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, make it a private sub class of RNTupleIndex
because the name RIndexValue
may be a little too generic to be directly part of the ROOT
namespace.
|
||
std::size_t GetNElems() const { return fIndex.size(); } | ||
|
||
void Add(const std::vector<void *> &valuePtrs, NTupleSize_t entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make private?
/// | ||
/// Note that in case multiple entries corresponding to the provided index value exist, the first occurrence is | ||
/// returned. Use RNTupleIndex::GetEntryIndices to get all entries. | ||
NTupleSize_t GetEntryIndex(const std::vector<void *> &valuePtrs) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, and further down: how about using "entry number" instead of "entry index" to not confuse it with the index itself. Also, I'd suggest for clarity GetFirstEntryNumber()
and GetAllEntryNumbers()
NTupleSize_t GetEntryIndex(const std::vector<void *> &valuePtrs) const; | |
NTupleSize_t GetFirstEntryNumber(const std::vector<void *> &valuePtrs) const; |
{ | ||
if (sizeof...(Ts) != fFields.size()) | ||
throw RException(R__FAIL("number of value pointers must match number of indexed fields")); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also want to check that the types match, e.g. RField<T>::TypeName() == fFields[i]->GetTypeName()
Perhaps a TODO at this point. We probably need a typeid like check as we don't want a string comparison every time.
tree/ntuple/v7/src/RNTupleIndex.cxx
Outdated
#include <ROOT/RNTupleIndex.hxx> | ||
|
||
ROOT::Experimental::Internal::RNTupleIndex::RNTupleIndex(std::vector<std::unique_ptr<RFieldBase>> &fields, | ||
RPageSource &pageSource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need it for the number of entries.
RPageSource &pageSource) | |
const RPageSource &pageSource) |
You can also consider passing just that number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I agree
tree/ntuple/v7/src/RNTupleIndex.cxx
Outdated
auto fieldOrException = RFieldBase::Create(fieldDesc.GetFieldName(), fieldDesc.GetTypeName()); | ||
if (!fieldOrException) { | ||
throw RException(R__FAIL("could not construct field \"" + std::string(fieldName) + "\"")); | ||
} | ||
auto field = fieldOrException.Unwrap(); | ||
field->SetOnDiskId(fieldDesc.GetId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better: use RFieldDescriptor::CreateField()
94a517c
to
67cbc14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the direction of this new feature, it shows a great deal of thinking! I left just some small comments that came up while reading the code.
/// \brief Get the entry number containing the given index value. | ||
/// | ||
/// \param[in] value The indexed value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is outdated
tree/ntuple/v7/src/RNTupleIndex.cxx
Outdated
#include <ROOT/RNTupleIndex.hxx> | ||
|
||
ROOT::Experimental::Internal::RNTupleIndex::RNTupleIndex(std::vector<std::unique_ptr<RFieldBase>> &fields, | ||
RPageSource &pageSource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I agree
tree/ntuple/v7/test/ntuple_index.cxx
Outdated
auto secondaryPageSource = RPageSource::Create("secondary", fileGuardSecondary.GetPath()); | ||
auto index = ROOT::Experimental::Internal::CreateRNTupleIndex({"event"}, *secondaryPageSource); | ||
auto secondaryNTuple = RNTupleReader::Open("secondary", fileGuardSecondary.GetPath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern of creating a page source, then the index, then using the RNTupleReader seems a bit odd. For now the API is not the main concern since anyway we'll integrate it in the RDF processing. But something to consider in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for testing. Initially, I added a CreateIndex
method to the RNTupleReader
, but since we want to use the index only internally (at least for the foreseeable future) I removed it.
tree/ntuple/v7/src/RNTupleIndex.cxx
Outdated
if (!fIndex.count(indexValue)) | ||
return {}; | ||
|
||
return fIndex.at(indexValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that here we're not constructing the returned vector, but rather copying an existing one, we might want to return a const vector<> *
instead.
But this depends on 3 factors which I'm not aware of, so I'll let you decide what's best:
- how big can this vector get? If it's just a few values, the copy is no problem at all.
- how hot will this method be?
- if later we will want to change the implementation so that we actually need to construct a vector rather than just referencing it, how big of a problem would that be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the reason to return a const pointer rather than a const ref in this case?
EDIT: I realize it's due to the empty vector case
how big can this vector get? If it's just a few values, the copy is no problem at all.
Very likely it will only contain one element. Maybe in some exotic cases a couple of them (depending on pileup), but that will definitely not be a common case.
how hot will this method be?
Since it's called by GetFirstEntryNumber
it will be quite hot (i.e. once every event in the event loop). Possibly GetFirstEntryNumber
can be refactored to not use this one.
if later we will want to change the implementation so that we actually need to construct a vector rather than just referencing it, how big of a problem would that be?
I think not a problem at all, based the points you raised I think it makes sense to change it to a reference.
d15cfe3
to
2cae07a
Compare
While thinking about collisions and the index storage, one thought that crossed my mind is to template the index based on the index field type(s). I'm not sure if that's a good idea, one of the immediate question being "do we want |
This is how I initially implemented it. Indeed it makes the index itself much more straightforward. However, when I started prototyping the actual join/unaligned friends it really didn't work without making that interface overly complicated so in the end I opted for a non-templated version. Perhaps there's some template trickery to still make it play nice with the foreseen interface (or allow for a slightly different but still simple enough interface), I will think about it for a bit. |
Not sure if it applies here, but a common "template trickery" is to have a type-punned base class that provides enough functionality to cater for the hot paths during the event loop. We already use this in RNTuple, for example with |
2cae07a
to
eeacbc6
Compare
After going back and forth between different resolutions, I've decided to for now only allow integral-type fields as index fields (thus dropping floating-point types and strings), and storing their values as While it will solve the potential index collisions, storing the separate values will incur some memory overhead, but preliminary benchmarks indicate it should be manageable (to be followed-up, though). |
9c757b8
to
f4ae977
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and clean! Some comments for consideration before merging.
|
||
void ROOT::Experimental::Internal::RNTupleIndex::Build() | ||
{ | ||
std::vector<RFieldBase::RValue> fieldValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to make this method idempotent, i.e. start with
if (fIsBuilt)
return;
tree/ntuple/v7/src/RNTupleIndex.cxx
Outdated
if (!field->IsSimple() || field->GetTypeName() == "float" || field->GetTypeName() == "double") { | ||
throw RException(R__FAIL("Cannot use field \"" + field->GetFieldName() + "\" with type \"" + | ||
field->GetTypeName() + "\" for indexing. Only integral types are allowed.")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps better to explicitly enumerate the supported types, e.g. check field->GetTypeName()
against a static std::unordered_set<std::string>
. Or use a dynamic cast with the recent RIntegralField
.
Good. Having a floating point 'as is' as the index would be a disaster due to the instability of equality. (One could envision having a floating pointer index only the user provide a transformation from floating point to integer). |
This adds (a first version of) the `RNTupleIndex`, which is an in-memory structure that maps integral RNTuple field values (or combinations thereof) to an entry index in the RNTuple for which the index was built. At this point, the index only resides in memory and thus has to be (re)build each time. `RNTupleIndex` will be used by the `RNTupleProcessor` to enable dataset joins and will be as transparent as possible to users. Currently, no public interface is foreseen.
f4ae977
to
68e90b3
Compare
This PR adds (a first version of) the
RNTupleIndex
, which is an in-memory structure that maps RNTuple field values (or combinations thereof) to an entry index in the RNTuple for which the index was built. Currently, the index only resides in memory and thus has to be (re)build each time.RNTupleIndex
will be used by theRNTupleProcessor
to enable dataset joins and will be as transparent as possible to users. Currently, no public interface is foreseen.At this point, no persistification is foreseen, but this might be added in the future. The implementation of the
RNTupleIndex
in this PR is hash-based. An implementation that is vector-based (but with the same interface) will also be considered.The idea is to benchmark and evaluate both implementations (and potentially more). Based on the results we can decide which one to actually use (or alternatively make multiple implementations available if they show clear tradeoffs in different use cases).