Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] Add initial in-memory index prototype #15116

Merged
merged 1 commit into from
Sep 23, 2024

Conversation

enirolf
Copy link
Contributor

@enirolf enirolf commented Apr 3, 2024

This PR adds (a first version of) the RNTupleIndex, which is an in-memory structure that maps RNTuple field values (or combinations thereof) to an entry index in the RNTuple for which the index was built. Currently, the index only resides in memory and thus has to be (re)build each time. RNTupleIndex will be used by the RNTupleProcessor to enable dataset joins and will be as transparent as possible to users. Currently, no public interface is foreseen.

At this point, no persistification is foreseen, but this might be added in the future. The implementation of the RNTupleIndex in this PR is hash-based. An implementation that is vector-based (but with the same interface) will also be considered.

The idea is to benchmark and evaluate both implementations (and potentially more). Based on the results we can decide which one to actually use (or alternatively make multiple implementations available if they show clear tradeoffs in different use cases).

@enirolf enirolf requested a review from vepadulano April 3, 2024 08:43
@enirolf enirolf self-assigned this Apr 3, 2024
@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac12arm/cxx20, windows10/default
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2204/nortcxxmod.
Running on root-ubuntu-2204-3.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@phsft-bot
Copy link
Collaborator

Build failed on mac12arm/cxx20.
Running on 194.12.161.128:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2024-04-03T09:41:19.032Z] FAILED: tree/ntuple/CMakeFiles/ROOTNTuple.dir/v7/src/RField.cxx.o
  • [2024-04-03T09:41:19.641Z] /Users/sftnight/build/workspace/root-pullrequests-build/root/tree/ntuple/v7/src/RField.cxx:1076:47: error: return type of out-of-line definition of 'ROOT::Experimental::RFieldBase::GetIndexRepresentation' differs from that in the declaration

Warnings:

  • [2024-04-03T09:41:19.641Z] /Users/sftnight/build/workspace/root-pullrequests-build/root/tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx:76:68: warning: unused parameter 'lowerBound' [-Wunused-parameter]
  • [2024-04-03T09:41:20.407Z] /Users/sftnight/build/workspace/root-pullrequests-build/root/tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx:76:68: warning: unused parameter 'lowerBound' [-Wunused-parameter]

@phsft-bot
Copy link
Collaborator

Build failed on windows10/default.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Errors:

  • [2024-04-03T09:31:03.191Z] C:\build\workspace\root-pullrequests-build\root\tree\ntuple\v7\src\RField.cxx(1077,1): error C2556: 'uint64_t ROOT::Experimental::RFieldBase::GetIndexRepresentation(void *)': overloaded function differs only by return type from 'ROOT::Experimental::NTupleIndexValue_t ROOT::Experimental::RFieldBase::GetIndexRepresentation(void *)' [C:\build\workspace\root-pullrequests-build\build\tree\ntuple\ROOTNTuple.vcxproj]
  • [2024-04-03T09:31:03.192Z] C:\build\workspace\root-pullrequests-build\root\tree\ntuple\v7\src\RField.cxx(1076,47): error C2371: 'ROOT::Experimental::RFieldBase::GetIndexRepresentation': redefinition; different basic types [C:\build\workspace\root-pullrequests-build\build\tree\ntuple\ROOTNTuple.vcxproj]

Copy link

github-actions bot commented Apr 3, 2024

Test Results

    14 files      14 suites   3d 13h 36m 59s ⏱️
 2 697 tests  2 697 ✅ 0 💤 0 ❌
35 511 runs  35 511 ✅ 0 💤 0 ❌

Results for commit 68e90b3.

♻️ This comment has been updated with latest results.

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2004/python3.
Running on root-ubuntu-2004-1.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

@enirolf enirolf force-pushed the ntuple-index branch 2 times, most recently from 510337a to 4dc20b9 Compare April 10, 2024 10:07
@enirolf enirolf force-pushed the ntuple-index branch 2 times, most recently from fd64176 to 825de2f Compare May 8, 2024 18:14
@enirolf enirolf changed the title [ntuple] Add indexing prototype [ntuple] Add initial in-memory index prototype Jun 27, 2024
@enirolf enirolf force-pushed the ntuple-index branch 3 times, most recently from 672671b to 68b602a Compare June 28, 2024 07:59
@enirolf enirolf marked this pull request as ready for review June 28, 2024 08:00
tree/ntuple/v7/inc/ROOT/RField.hxx Outdated Show resolved Hide resolved
tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx Outdated Show resolved Hide resolved
tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx Outdated Show resolved Hide resolved
tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx Outdated Show resolved Hide resolved
tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx Outdated Show resolved Hide resolved
tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx Outdated Show resolved Hide resolved
tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx Show resolved Hide resolved
tree/ntuple/v7/src/RNTupleIndex.cxx Outdated Show resolved Hide resolved
tree/ntuple/v7/src/RNTupleIndex.cxx Show resolved Hide resolved
tree/ntuple/v7/src/RNTupleIndex.cxx Outdated Show resolved Hide resolved
Copy link
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool new functionality!

Some comments:

  • I think that is a fundamental new functionality that should have a section in the architecture.md
  • Instead of adding the GetHash() member function, perhaps we can use a "hash visitor". That would allow for using only available public interfaces. The RPrintValueVisitor can serve as a blueprint (the actual hash visitor class is probably better placed in RNTupleIndex.[hxx,cxx] or in a separate compilation unit).

@@ -1076,6 +1076,12 @@ void ROOT::Experimental::RFieldBase::AcceptVisitor(Detail::RFieldVisitor &visito
visitor.VisitField(*this);
}

ROOT::Experimental::NTupleIndexValue_t ROOT::Experimental::RFieldBase::GetIndexRepresentation(void * /*from*/)
{
R__ASSERT(false && "indexing is not supported for this field type");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw exception instead?

@@ -35,6 +35,7 @@ ROOT_GENERATE_DICTIONARY(RXTupleDict ${CMAKE_CURRENT_SOURCE_DIR}/RXTuple.hxx
ROOT_ADD_GTEST(ntuple_descriptor ntuple_descriptor.cxx LIBRARIES ROOTNTuple CustomStruct)
ROOT_ADD_GTEST(ntuple_endian ntuple_endian.cxx LIBRARIES ROOTNTuple)
ROOT_ADD_GTEST(ntuple_friends ntuple_friends.cxx LIBRARIES ROOTNTuple CustomStruct)
ROOT_ADD_GTEST(ntuple_index ntuple_index.cxx LIBRARIES ROOTNTuple CustomStruct)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ROOT_ADD_GTEST(ntuple_index ntuple_index.cxx LIBRARIES ROOTNTuple CustomStruct)
ROOT_ADD_GTEST(ntuple_index ntuple_index.cxx LIBRARIES ROOTNTuple)

@@ -542,6 +544,8 @@ protected:
/// Called by `ConnectPageSource()` once connected; derived classes may override this as appropriate
virtual void OnConnectPageSource() {}

virtual std::uint64_t GetHash(const void *from) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is calculating a hash, I'd suggest

Suggested change
virtual std::uint64_t GetHash(const void *from) const;
virtual std::uint64_t Hash(const void *from) const;

std::vector<void *> ptrs;
ptrs.reserve(fieldValues.size());
for (auto &fieldValue : fieldValues) {
fieldValue.Read(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good use for bulk reading. Doesn't necessarily need to be deployed in this commit but a TODO could be added.

/////////////////////////////////////////////////////////////////////////////
/// Container for the combined hash of the indexed fields. Uses the implementation from `boost::hash_combine` (see
/// https://www.boost.org/doc/libs/1_55_0/doc/html/hash/reference.html#boost.hash_combine).
struct RIndexValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, make it a private sub class of RNTupleIndex because the name RIndexValue may be a little too generic to be directly part of the ROOT namespace.


std::size_t GetNElems() const { return fIndex.size(); }

void Add(const std::vector<void *> &valuePtrs, NTupleSize_t entry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make private?

///
/// Note that in case multiple entries corresponding to the provided index value exist, the first occurrence is
/// returned. Use RNTupleIndex::GetEntryIndices to get all entries.
NTupleSize_t GetEntryIndex(const std::vector<void *> &valuePtrs) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, and further down: how about using "entry number" instead of "entry index" to not confuse it with the index itself. Also, I'd suggest for clarity GetFirstEntryNumber() and GetAllEntryNumbers()

Suggested change
NTupleSize_t GetEntryIndex(const std::vector<void *> &valuePtrs) const;
NTupleSize_t GetFirstEntryNumber(const std::vector<void *> &valuePtrs) const;

{
if (sizeof...(Ts) != fFields.size())
throw RException(R__FAIL("number of value pointers must match number of indexed fields"));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also want to check that the types match, e.g. RField<T>::TypeName() == fFields[i]->GetTypeName()

Perhaps a TODO at this point. We probably need a typeid like check as we don't want a string comparison every time.

#include <ROOT/RNTupleIndex.hxx>

ROOT::Experimental::Internal::RNTupleIndex::RNTupleIndex(std::vector<std::unique_ptr<RFieldBase>> &fields,
RPageSource &pageSource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need it for the number of entries.

Suggested change
RPageSource &pageSource)
const RPageSource &pageSource)

You can also consider passing just that number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree

Comment on lines 89 to 95
auto fieldOrException = RFieldBase::Create(fieldDesc.GetFieldName(), fieldDesc.GetTypeName());
if (!fieldOrException) {
throw RException(R__FAIL("could not construct field \"" + std::string(fieldName) + "\""));
}
auto field = fieldOrException.Unwrap();
field->SetOnDiskId(fieldDesc.GetId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better: use RFieldDescriptor::CreateField()

@enirolf enirolf force-pushed the ntuple-index branch 2 times, most recently from 94a517c to 67cbc14 Compare July 2, 2024 11:12
Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the direction of this new feature, it shows a great deal of thinking! I left just some small comments that came up while reading the code.

tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx Show resolved Hide resolved
Comment on lines 77 to 79
/// \brief Get the entry number containing the given index value.
///
/// \param[in] value The indexed value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is outdated

#include <ROOT/RNTupleIndex.hxx>

ROOT::Experimental::Internal::RNTupleIndex::RNTupleIndex(std::vector<std::unique_ptr<RFieldBase>> &fields,
RPageSource &pageSource)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree

Comment on lines 64 to 150
auto secondaryPageSource = RPageSource::Create("secondary", fileGuardSecondary.GetPath());
auto index = ROOT::Experimental::Internal::CreateRNTupleIndex({"event"}, *secondaryPageSource);
auto secondaryNTuple = RNTupleReader::Open("secondary", fileGuardSecondary.GetPath());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern of creating a page source, then the index, then using the RNTupleReader seems a bit odd. For now the API is not the main concern since anyway we'll integrate it in the RDF processing. But something to consider in general

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for testing. Initially, I added a CreateIndex method to the RNTupleReader, but since we want to use the index only internally (at least for the foreseeable future) I removed it.

if (!fIndex.count(indexValue))
return {};

return fIndex.at(indexValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that here we're not constructing the returned vector, but rather copying an existing one, we might want to return a const vector<> * instead.
But this depends on 3 factors which I'm not aware of, so I'll let you decide what's best:

  1. how big can this vector get? If it's just a few values, the copy is no problem at all.
  2. how hot will this method be?
  3. if later we will want to change the implementation so that we actually need to construct a vector rather than just referencing it, how big of a problem would that be?

Copy link
Contributor Author

@enirolf enirolf Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the reason to return a const pointer rather than a const ref in this case?

EDIT: I realize it's due to the empty vector case

how big can this vector get? If it's just a few values, the copy is no problem at all.

Very likely it will only contain one element. Maybe in some exotic cases a couple of them (depending on pileup), but that will definitely not be a common case.

how hot will this method be?

Since it's called by GetFirstEntryNumber it will be quite hot (i.e. once every event in the event loop). Possibly GetFirstEntryNumber can be refactored to not use this one.

if later we will want to change the implementation so that we actually need to construct a vector rather than just referencing it, how big of a problem would that be?

I think not a problem at all, based the points you raised I think it makes sense to change it to a reference.

@enirolf enirolf force-pushed the ntuple-index branch 2 times, most recently from d15cfe3 to 2cae07a Compare July 9, 2024 09:32
@hahnjo
Copy link
Member

hahnjo commented Jul 22, 2024

While thinking about collisions and the index storage, one thought that crossed my mind is to template the index based on the index field type(s). I'm not sure if that's a good idea, one of the immediate question being "do we want RNTupleIndex<std::string> and RNTupleIndex<std::uint64_t> to be different types?" But still maybe something to consider, it might simplify the field value storage (if needed) and the entire hashing business...

@enirolf
Copy link
Contributor Author

enirolf commented Aug 13, 2024

While thinking about collisions and the index storage, one thought that crossed my mind is to template the index based on the index field type(s). I'm not sure if that's a good idea, one of the immediate question being "do we want RNTupleIndex<std::string> and RNTupleIndex<std::uint64_t> to be different types?" But still maybe something to consider, it might simplify the field value storage (if needed) and the entire hashing business...

This is how I initially implemented it. Indeed it makes the index itself much more straightforward. However, when I started prototyping the actual join/unaligned friends it really didn't work without making that interface overly complicated so in the end I opted for a non-templated version. Perhaps there's some template trickery to still make it play nice with the foreseen interface (or allow for a slightly different but still simple enough interface), I will think about it for a bit.

@hahnjo
Copy link
Member

hahnjo commented Aug 14, 2024

Not sure if it applies here, but a common "template trickery" is to have a type-punned base class that provides enough functionality to cater for the hot paths during the event loop. We already use this in RNTuple, for example with RFieldBase that can be conveniently passed around if you only need its interface functions.

@enirolf
Copy link
Contributor Author

enirolf commented Sep 10, 2024

After going back and forth between different resolutions, I've decided to for now only allow integral-type fields as index fields (thus dropping floating-point types and strings), and storing their values as std::uint64_t. For now, we don't have any concrete examples where a non-integral field is desired to be used for indexing, so to stay within the scope of "initial prototype", I think this approach should suffice.

While it will solve the potential index collisions, storing the separate values will incur some memory overhead, but preliminary benchmarks indicate it should be manageable (to be followed-up, though).

@enirolf enirolf force-pushed the ntuple-index branch 2 times, most recently from 9c757b8 to f4ae977 Compare September 11, 2024 15:26
Copy link
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean! Some comments for consideration before merging.


void ROOT::Experimental::Internal::RNTupleIndex::Build()
{
std::vector<RFieldBase::RValue> fieldValues;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to make this method idempotent, i.e. start with

if (fIsBuilt)
   return;

Comment on lines 82 to 93
if (!field->IsSimple() || field->GetTypeName() == "float" || field->GetTypeName() == "double") {
throw RException(R__FAIL("Cannot use field \"" + field->GetFieldName() + "\" with type \"" +
field->GetTypeName() + "\" for indexing. Only integral types are allowed."));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps better to explicitly enumerate the supported types, e.g. check field->GetTypeName() against a static std::unordered_set<std::string>. Or use a dynamic cast with the recent RIntegralField.

@pcanal
Copy link
Member

pcanal commented Sep 20, 2024

I've decided to for now only allow integral-type fields as index fields (thus dropping floating-point types

Good. Having a floating point 'as is' as the index would be a disaster due to the instability of equality. (One could envision having a floating pointer index only the user provide a transformation from floating point to integer).

This adds (a first version of) the `RNTupleIndex`, which is an
in-memory structure that maps integral RNTuple field values (or
combinations thereof) to an entry index in the RNTuple for which the
index was built. At this point, the index only resides in memory and
thus has to be (re)build each time.

`RNTupleIndex` will be used by the `RNTupleProcessor` to enable
dataset joins and will be as transparent as possible to users.
Currently, no public interface is foreseen.
@enirolf enirolf merged commit 9d986f9 into root-project:master Sep 23, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants