-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
Description
Currently RDF's special column rdfentry_, despite the name, does not correspond to the global TChain entry number in MT runs (see also the relevant docs).
This is surprising for users (hence the big warning in the docs linked above) and makes it unnecessarily difficult to e.g. attach a numpy array as an additional column (because it's hard to index into it correctly without stable global row numbers).
We could instead make rdfentry_ always match the "real" (global) entry number in the dataset -- if only each MT task knew the offset of the current tree w.r.t. all other trees in the chain.
Proposed solution
- have TTreeProcessorMT tell each MT task which tree it is processing w.r.t. to the global chain (
#1,#2,#3, ...) - have each task calculate its tree's offset by going over a list of tree entry numbers, filling missing values as needed (the list of entry numbers would be implemented as an array of fixed size
nTreeswith atomic elements. This plus the fact that threads only need to write into the atomic elements if they see the value has not been calculated yet should minimize thread contention)
Other solution considered
- we could always build a global TChain, for every task, and always use global entry numbers everywhere. However this would require that TTreeProcessorMT reads out the number of entries in each tree before the tasks even start, because it first needs to come up with entry ranges for each task. My intuition is that this would bring a larger performance impact than the proposed solution: we know from DistRDF that the (redundant) opening O(1k) remote files at startup is a significant cost.
- we could do nothing:
rdfentry_would be unstable and it could not be relied upon to e.g. index into manually added "friend columns" or to fill TEntryLists (like this user would have liked to do)
ianna