-
Notifications
You must be signed in to change notification settings - Fork 76
Guide to the codebase
This page is to help developers get a sense of where to find things in the Uproot codebase.
The purpose of Uproot is to provide an array-oriented, pure Python way of reading and writing ROOT files. It doesn't address any part of ROOT other than I/O, and it provides basic, low-level I/O. It's not an environment/user interface: if someone wants an immersive experience, they can write packages on top of Uproot, such as uproot-browser. However, we do want to streamline the process of navigating through files with Uproot, to the point of being "not annoying."
Although the style of programming is almost entirely imperative, not array-oriented like Awkward Array, there's a wide range in "depth" of code editing. Some changes would be "surface level," making a function more friendly/ergonomic to use, while others are "deep," manipulating low-level byte streams.
All of the source code is in src/uproot. The tests are (roughly) numbered by PR or issue number, and the version number is controlled by src/uproot/version.py (not by pyproject.toml, even though this is a hatchling-based project). If there is no version.py or it has a placeholder version, the information in this paragraph may be out-of-date. (Please update!)
Within src/uproot, all of the files and directories are for reading except writing, sink, and serialization.py, which are for writing. A few are shared, such as models, _util.py, compression.py, and streamers.py.
Everything is for conventional ROOT I/O, not RNTuple, except for models/RNTuple.py with some shared utilities in compression.py and const.py.
So almost all of the code, per line and per file, is for reading conventional ROOT I/O.
A ROOT file consists of
- the TFile header, which starts at byte 0 of the file and has one of two fixed sizes (one uses 32-bit pointers and the other uses 64-bit pointers);
- the TStreamerInfo, which describes how to deserialize most classes—which byte means what—reading this is optional;
- the root (no pun) TDirectory, which describes where to find named objects, including other TDirectories;
- the named objects themselves, which can each be read separately, but must each be read in their entirety;
- a few non-TDirectory classes (TTree and RNTuple are the only ones I know of) point to data beyond themselves;
- chunks of data associated with a TTree (TBasket) or RNTuple (RBlob);
- the TFree list, which keeps track of which parts of the ROOT file are unoccupied by real data. This can be completely ignored when reading a ROOT file.
None of the objects listed above except the TFile header has a fixed location in the file. To know the byte location of any object, one must find it by following a chain from the TFile header to the root TDirectory to any subdirectory to the object and maybe to a TBasket if the object is a TTree.
RNTuple has its own system of navigation, starting at a ROOT::Experimental::RNTuple
instance, which is a conventional ROOT I/O object that can live in a TDirectory like any TTree or histogram, but its headers, footers, column metadata, etc., are all new, custom objects, exposed to the conventional ROOT I/O system as generic RBlobs.
TBaskets and RBlobs can't be (or at least, aren't in practice) stored in a TDirectory.
If multiple objects have the same name in the same TDirectory, they're distinguished by a cycle number. It's common for ROOT to gradually update an object (such as TTree) by writing updates in the same directory with different cycle numbers, keeping only the most recent two. (Uproot updates TTrees in place with a single cycle number.)
Addressing and reading/writing data in a ROOT file is like reading/writing data in RAM, but instead of pointers, we have seek positions. Most conventional ROOT I/O seek positions are 32-bit, but there are modes in which objects can point to other objects with 64-bit seek positions when necessary. A ROOT file can (and often does) have a mixture of 32-bit and 64-bit seek positions.
Also like addressing data in RAM, space has to be allocated and freed when objects are created or deleted (when writing). Deleting an object creates a gap that is not filled by moving everything else in the file (which can be many GB), and new objects should take advantage of this space if they'll fit, rather than always allocating at the end of the file. This is why the file maintains a TFree list, just like malloc
and free
in libc. This can be ignored while reading, but keep in mind that any part of a ROOT file might be uninitialized junk, just like RAM.
A TDirectory consists of an array of TKeys, which specify the name, cycle number, title, class name, compressed size, uncompressed size, and seek position of the object. At the seek position, there's another TKey with nearly all the same fields, to characterize the object if you didn't find it from a TDirectory (such as TBaskets and RBlobs). TDirectory and TKey are never compressed; the data they point to may be.
Any C++ class instance that ROOT's reflection understands (i.e. anything compiled by Cling) can be written to the file and read back later. What actually gets written are the data members of the C++ class—public and private—and none of the code. Class definitions change, and a ROOT file may be written with one version of a class (with members x
and y
, say) and read by a process in which the class has different members (x
, y
, and z
). Thus, each class needs a numerical version—a single, increasing integer—and the ROOT file should have TStreamerInfo records for all the versions of all the classes it contains.
ROOT files don't always have TStreamerInfo records for all the classes they contain. Some very basic classes, defined before TStreamerInfo "dictionary" generation was automated, have TStreamerInfo records that don't seem to match the actual serialization or none at all. Also, the classes needed to read the TStreamerInfo can't be derived from TStreamerInfo itself. (This includes TStreamerInfo, all of the subclasses of TStreamerElement, TList, and TString.) Most often, files lacking TStreamerInfo records that are absolutely necessary for deserializing the objects were produced by hadd. (This comes up repeatedly in issues: there's nothing we can do if we don't have the TStreamerInfo.)
C++ ROOT has a large number of built-in classes. If a ROOT file contains objects of the same class names and versions that were compiled into that version of ROOT, ROOT can use its built-in streamer knowledge. Uproot has a smaller set of built-in streamer knowledge, consisting of histograms and TTrees from the past 10 years (beginning 5 years before the Uproot project started and staying up to date as new ROOT versions come out).
It also sometimes happens that users compile non-release versions of ROOT (from GitHub or nightlies) and the C++ class name-version combinations in these ROOT executables have different TStreamerInfo from the same name-version combinations in released versions of ROOT. Uproot needs to be flexible with the assumptions it makes about how data are serialized. In practice, this means that Uproot makes up to two attempts to read each object: first using its built-in streamer knowledge (so that it doesn't need to read a file's TStreamerInfo) and if that files, it reads the file's TStreamerInfo and attempts to read the object again.
In principle, the serialization format of C++ class instances in TTrees is the same as the serialization format of the same class elsewhere, in a TDirectory, for instance. Some optimizations complicate that story, however.
- Most often, objects in TTrees are "split" into their constituent fields, with one TBranch per field. This is why a TTree's TBranches can contain child TBranches, to keep track of which TBranches came from the same class. Even though this changes how the data are laid out, we like split objects because (1) numerical data can be read much more quickly than if it had been embedded in classes, that would have to be iterated over, in Python, (2) if part of a class is unreadable for some reason, it's likely that the parts a user cares about are in numerical fields, which are readable as separate TBranches, and (3) if a user is only interested in a few members of a class, they don't have to read the other members at all. This last reason was the motivation for splitting in the first place. (RNTuple is based on splitting at all levels, everywhere, like Awkward Arrays.)
- Normally, class instances are preceded by a 4-byte integer specifying the number of serialized bytes and a 2-byte class version. This applies not only to the top-level class named in the TDirectory (such as TH1F), but also its constituent superclasses (such as TH1, TNamed, TObject, ...) and members (such as TAxis, TObjArrays of TFunctions, ...). High bits in the 4-byte integer can specify that the class version will be skipped (saving 2 bytes per nested object), and some TBranches specify that all headers will be skipped (saving 6 bytes per object). We don't know where all of the indicators of whether headers are skipped or not are located, which is the source of a few issues.
- TTree data has an additional mode called "memberwise splitting," which is indicated in the high bits of the 4-byte header. Memberwise splitting is like TBranch splitting but at a smaller level of granularity: instead of all members
x
of a TTree's classes being in TBranchparent_branch.x
and all membersy
of that class being in TBranchparent_branch.y
, a memberwise-split TBranch has allx
contiguous for list items within an entry/event followed by ally
within the same entry/event. They are not split between TBranches and they are not split between entries (which usually correspond to events in HEP). Uproot has not implemented reading of memberwise-split data, except in one experimental case. We can, however, identify when memberwise splitting is used and raise an error.
Whole objects—that is, each entire object with all its superclasses and members—addressed in a TDirectory can be compressed. Compression is identified by the compressed size being smaller than its uncompressed size. (Otherwise, we assume that it is not compressed.) In a TTree, compression is only applied at the level of a whole TBasket, which can contain many objects. A compressed object is a sequence of independently compressed blocks, each with a header (compression algorithm, compressed size, uncompressed size, and a checksum in the case of LZ4) and the compressed data. It's a sequence because the compressed data size can be larger than the largest expressible compressed size, which is a 3-byte number.
The actual compression algorithm and compression level used may be entirely different from the fCompress
specified in the TFile header, the TTree, and the TBranch that the data belongs to. For instance, TStreamerInfo blocks are often ZLIB compressed, even if the TFile specifies LZ4.
As stated above, RNTuple is entirely different. After navigating to the `ROOT::Experimental::RNTuple" object (also called an "anchor"), a newly designed layout takes over, which has very little in common with the old ROOT I/O (one exception: compressed objects have the same format). This new format has a specification, so many of the problems we have finding information (e.g. about whether headers exist or not) wouldn't even come up. RNTuple is functionally equivalent to an Awkward/Apache Arrow/Feather dataset on disk—fully split into columns, with metadata to find the columns and specify their data types.
Uproot is not only an independent implementation of ROOT I/O, but also Python, rather than C++, so we make some different decisions from ROOT.
First of all, we don't assume that a ROOT file can change while we're reading it and we don't assume that another process can change the file while we're writing it. We assume that users treat ROOT files as fixed artifacts, copying from an input file to an output file if need be, rather than using it as a shared filesystem. Although Uproot has an "update" mode that can add or delete objects from an existing ROOT file, it is not thread-safe: multiple Python threads cannot write to the same file. Also when writing objects to a file, Uproot uses a different allocation strategy than ROOT (always keeps the TFree at the end of the file), but as long as it maintains a correct TFree list, it's compatible.
Uproot does not run Cling or any C++, so class methods are either reimplemented in Python or are not available at all. A C++ user who creates custom classes with custom methods has to load a shared library/DLL to use those methods; there's no equivalent in Uproot. Moreover, Uproot is not a look-alike of C++ ROOT: it implements different methods than ROOT because Python has different needs.
Perhaps the biggest difference is in TTree-reading: ROOT is designed around iterating over TTree data, producing C++ class instances on demand and sometimes reusing a preallocated instance to avoid memory management in the loop, but Uproot is designed around filling arrays—NumPy, Awkward, or Pandas—for other libraries to perform computations on. Some TTrees are so large that the TBranches of interest can't be fully loaded into RAM, and for this case, uproot.iterate loads contiguous sets of entries/events in each loop iteration, but this is an elaboration of the primary access method, which is about eagerly loading data into memory.
Accordingly, normal access methods in ROOT hide the splitting of classes into sub-TBranches, so that the split-level is an optimization detail. Uproot always exposes each TBranch as an individual object to read—in this sense, Uproot is more low-level, since the way that you'd read a split TTree is different from the way you'd read an unsplit TTree.
The equivalent of a C++ class in Uproot is a Model. Model instances generally aren't created with a constructor, but are read directly from a ROOT file (with the Model.read
classmethod). Rather than mapping C++ classes onto Python classes directly—mapping C++ members to Python attributes and the C++ class hierarchy onto the Python class hierarchy—Uproot's Models are representations of the C++ class as data:
- C++ member data are in a
Model._members
dict - C++ object superclasses are in a
Model._bases
list
so getting an inherited member from some model means checking the local dict, then recursively searching the members of the Model instances in the Model._bases
list. There are Python methods for doing these searches, but C++ data are "held at arm's length" from Python itself.
(Historical note: before Uproot 4, C++ classes and Python classes were directly mapped, but it was harder to maintain because "Uproot doing its own work" got mixed with "data acting like ROOT data.")
The Model class name encodes the C++ class name-version pair through classname_encode
/classname_decode
functions. C++ class names include namespaces (::
) and template instantiations (<
, >
, and ,
), which can't be included in a Python class name. These characters, as well as any underscore in the C++ name, are converted into hexadecimal codes surrounded by underscores. The whole name is prepended by Model_
and appended by _v#
where #
is the class version number, so it's impossible to confuse a C++ class name for a Python Model name, even if the C++ name doesn't use any illegal characters.
Here's an example:
>>> cpp_name = "std::sorted_map<int,std::string>"
>>> cpp_version = 6
>>> model_name = uproot.model.classname_encode(cpp_name, cpp_version)
>>> model_name
'Model_std_3a3a_sorted_5f_map_3c_int_2c_std_3a3a_string_3e__v6'
>>> uproot.model.classname_decode(model_name)
('std::sorted_map<int,std::string>', 6)
You'll see a lot of _3a3a_
(for ::
) and _3c_
... _3e_
(for <
... >
) in Model class names. Note that translating the underscores into _5f_
(between sorted
and map
) ensures that the transformation is always reversible, and it's not possible to confuse any _v#
suffixes that users put at the ends of their class names with ours.
Models for most C++ classes that exist are generated on the fly. When a deserialization routine encounters a class that isn't in the global uproot.classes
dict or the relevant file's ReadOnlyFile._custom_classes
dict, Uproot reads the file's ReadOnlyFile.streamers
(if it hasn't already) and uses the TStreamerInfo to generate Python code for the class and evaluate it. Then it has a new Model to Model.read
the object. The Model class definitions are put into a Python module called uproot.dynamic, which is empty when Uproot is first imported. (It's not necessary for dynamically generated classes to be in a module in Python; this is for possible user convenience.)
Models can be versionless (no _v#
suffix in the name) or versioned; all dynamically generated Models are versioned. Models for the most commonly used classes (histograms and TTrees), ROOT classes that don't seem to agree with their TStreamerInfo or don't have TStreamerInfo (basic classes like TObject, TRef, TDatime, ...), or are needed for the reading of TStreamerInfo itself (TStreamerInfo, all the TStreamerElement subclasses, and TList) are predefined in the uproot.models module. Each of these classes is a submodule containing the built-in class. Since they are hand-written, many of them are versionless. If there are any version-specific differences in deserialization, they may be handled with if-then-else clauses in their read
classmethod or read_members
method. (See, for instance, this version-dependent branch in Model_TStreamerInfo.)
The uproot.classes dict is the global collection of Model classes built so far. This dict maps C++ class names (strings) to versionless Model class definitions or a DispatchByVersion class object (see src/uproot/model.py, which keeps a known_versions
dict to map version numbers (integers) to versioned Model class definitions. Some of these DispatchByVersion class objects have been built by hand, such as those in src/uproot/models/TAtt.py.
Sometimes, we can't actually build a Model class object, for a variety of reasons. At the very least, we build a Model whose name starts with Unknown_
(rather than Model_
) and put it in the uproot.unknown_classes
dict. In some cases, these unknown objects can be skipped over, allowing subsequent data to be deserialized. In other cases, it can't, and deserializing an unknown Model instance raises an error.
Very few Models can be serialized, mostly just those that support Uproot's writing of histograms and TTrees. All of the serializable Models have hand-written serialization methods—generic serialization from TStreamerInfo has not been implemented.
However, those that can be serialized can be converted into PyROOT objects, and all PyROOT objects can be converted into Uproot Models (if the Model class is in uproot.classes
), using the functions in src/uproot/pyroot.py. These translations go through ROOT's TMessage system (serializing and deserializing in memory).
C++ classes are loaded into Python Models without any of their class methods. For classes that we do not recognize—because the set of classes in ROOT is vast or because the set of classes users can define is infinite—there is no way to get the functionality of the C++ method other than to write it in Python.
Some classes are important enough to Uproot's functionality or to HEP data analysis that they do have hand-written methods. These can't be Model methods, at least not for dynamically generated Models, so they are defined as mix-in classes: classes without constructors or non-transient data attributes that exist only to be superclasses of Models. They provide methods only, which can be "mixed in" to the Model class that manages the non-transient data. (I'm specifying "non-transient data" because some methods add hidden attributes to the class to cache expensive operations, but caches can be dropped and regenerated without loss of functionality.)
Behaviors are defined in the src/uproot/behaviors module. These mix-ins get associated with new Model class objects automatically when the Model is first defined. This association (implemented in src/uproot/behavior.py) is by name: each submodule in the src/uproot/behaviors is named for the corresponding C++ class without template parameters, and there is a Python class in this submodule with the same name, which ultimately gets attached to the new Model.
See TGraph.py for a typical example (written by an Uproot user) and TParameter.py for an example with C++ templates.
The largest set of behaviors is in TBranch.py, since this implements all TTree-handling (described in its own section, below).
Three classes representing ROOT data are not Models: ReadOnlyFile, ReadOnlyDirectory, and ReadOnlyKey (all defined in src/uproot/reading.py).
The entry point for most Uproot users, uproot.open
, returns a ReadOnlyDirectory (the root directory) containing a ReadOnlyFile, which they can access through ReadOnlyDirectory.file
(for access to streamers and other file-level things).
ReadOnlyKeys are generally invisible to users. ReadOnlyKeys are the objects that actually fetch data and invoke the decompression, if necessary, by calling decompress
from compression.py.
ReadOnlyFile, ReadOnlyDirectory, ReadOnlyKey, TTree, and RNTuple are the only classes that can be used to read more of the file. They all have context managers so that
with uproot.open("some_file.root") as file:
# do something
and even
with uproot.open("some_file.root")["deeply/nested/object"] as obj:
# do something
close the file when Python exits the with
statement. Each object that gets produced by reading from the file points back to the _parent
that initiated the reading, all the way back to the ReadOnlyFile, so if it goes out of a with
block scope, it can close the file through a chain of __exit__
calls.
Objects that can't be used to read more data from the file, such as histograms (anything but the five mentioned above), detach themselves from the ReadOnlyFile so that they are no longer "live" objects. They can be pickled and unpickled without having access to the original ROOT file, for instance. This is why ReadOnlyFile has a DetachedFile counterpart, to hold the file's information without the handle to read more data.
Uproot can read local files, files accessed through HTTP, and files accessed through XRootD. (Someday soon, any fsspec-enabled source may be added to the list, and the fsspec implementation may replace the direct HTTP and XRootD. See #692.) These backends are implemented through Sources in the src/uproot/source module.
Each Source has the same two interfaces for getting bytes from a file: if Uproot needs all the bytes between two seek locations, Uproot asks for Source.chunk(start, stop)
and gets a Chunk back. A Chunk is potentially delayed data, like a future/promise, that waits for data when its get
method is called (perhaps immediately). The get
method returns a Python bytes
object or a NumPy array.
The second interface is Source.chunks(ranges, notifications)
, which asks the backend for a set of data ranges as a list of (start, stop)
pairs and provides a notification Queue which will receive events as data arrive. This is for TTree and RNTuple, which know up-front about all of the TBaskets/RBlobs a user wants to read. If a server permits it, all of these ranges are requested in one network round trip (a "multi-part GET" for HTTP and a "vector read" for XRootD), to avoid wasting latency with many small requests. The responses may come back in any order, so while the Source.chunks
method immediately returns Chunk objects (which are futures/promises), the Chunks get filled asynchronously. The TTree code (and hopefully RNTuple) uses notifications
to process the first to be filled first, rather than waiting on an unfilled Chunk while other Chunks are already filled.
If a server does not accept multi-part/vector requests, HTTP and XRootD have fallback Sources that keep the server busy with a number of worker threads, by default 10. Either way, the Chunks are filled asynchronously and notifications
is a good way to process data in the order that they are received.
Because of the multithreaded backends, the "file object" that with
opens and closes can also launch and shut down threads. That makes the with
context manager doubly important: unclosed files are also unused threads.
The first step in data-reading/interpretation is the Source, which either provides one Chunk or a set of Chunks. Each Chunk is asynchronously filled on a background thread. The ReadOnlyFile object starts with a Chunk for the beginning of the file attached, and if this Chunk is big enough, some desired data (TStreamerInfo? the root TDirectory?) may already be in it, so all requests for a Chunk go through the ReadOnlyFile.chunk
method.
The only class that accesses a Chunk directly is a Cursor. Apart from its asynchronously loaded data, the Chunk is stateless—a Cursor represents "where we are in the file." It has methods for skipping bytes, interpreting a field, a ROOT-style string, an array, special types like Float16 and Double32, etc. Cursors are frequently copied, to keep many "fingers" or "pins" at various points in the file, to be able to pick up reading where we left off. (This is part of the reason that everything would break if another process changes the file while Uproot is reading it.)
Cursors are directly accessed by the code that constructs a Model instance. This starts in the classmethod named read
, which creates the Model instance and calls methods numbytes_version
and read_members
on it. Most Model subclasses only overload read_members
(though some take control of numbytes_version
to disable it, or the whole read
process).
If part of a serialized object involves different classes for different instances (i.e. the class has a polymorphic pointer as a member), it goes through the read_object_any
method of src/uproot/deserialization.py. This function might find that it's a nullptr
(Python None
), a new class (might trigger the creation of a new Model class), a new instance, or a previously-read instance, which makes it possible to construct cycles in the object graph. The Cursor carries a refs
dict to keep track of previously-referenced data, which ROOT indicates with integer seek points.
When Model.read
is done, a new Model instance has been created. This is the pathway that objects outside of TTrees and RNTuples (such as histograms) take.
TTrees are a bit like TDirectories, at least in the sense that they can initiate more reading; the Model_TTree_v#
instance contains metadata describing data names and types, and its behaviors (mixed in methods) seek elsewhere in the file to read TBasket data and reformat them into arrays (NumPy, Awkward, or Pandas).
A TTree's data members include a TObjArray of TBranch objects, which contain most of the information: one TBranch for each "column" of data. (To make a table, the lengths of all the TBranches must be, and are, identical.) TBranches can contain sub-TBranches, but Uproot doesn't do much with this structure: the look-up functions are by default recursive (like the look-up functions in TDirectory). Most interior TBranches don't have any TBasket data anyway (exception: data with type TClonesArray).
TBranches also have a TObjArray of TLeaf instances, and true to their name, a TLeaf does not have any child TLeaves. But that's where the tree-branch-leaf analogy stops. TBranch is primarily a pointer to data elsewhere in the file—it knows the seek locations of all of the TBaskets where the data are actually stored—and TLeaf primarily describes the data type. There are different TLeaf subclasses for all of the basic data types (e.g. TLeafI for 32-bit integer, TLeafD for 64-bit float), but information about more complex data types is stored elsewhere, such as the TStreamerInfo.
Interior TBranches typically don't have any TLeaves and the "leaf" TBranches (sorry for the overloaded term!) typically have exactly one TLeaf. When a TBranch actually has more than one TLeaf, it's called a leaf-list and the data type it describes is an unsplit C struct
. These aren't very common because these data can be and usually are written in "split" mode—one TBranch per struct
member—and more complex data structures (C++ class
with inheritance or any nestedness, particularly with variable-length arrays or std::vector
) defer to TStreamerInfo anyway.
(All of this is historical: more complex cases were added over the years, and they didn't use the preexisting mechanisms.)
Uproot's approach is to read a whole Model_TTree_v#
object (the metadata), including all of its TBranches and TLeaves (skipping some of the metadata that we don't use, for speed). TTree mix-in behaviors define such methods as TBranch.array
, HasBranches.arrays
, and HasBranches.iterate
, where HasBranches
is a mix-in for any Models that can contain TBranches (TTree and TBranch). These methods start identify seek points in the file corresponding to the requested TBaskets, request them from the Source with the chunks
(plural) method, and interpret them as they come in, filling the arrays that are returned to the user. Interpretation is a major system in itself (next section).
All of these methods are defined in src/uproot/behaviors/TBranch.py, which is by far the largest mix-in behavior (a little over 3000 lines). Mix-ins for src/uproot/behaviors/TTree.py and src/uproot/behavior/TBranchElement.py are minimal; they don't have much beyond the TBranch itself. A few top-level functions, uproot.iterate and uproot.concatenate, are also defined with the TBranch behaviors, since they're just multi-file versions of the HasBranches.iterate and HasBranches.arrays methods. Arguably, uproot.dask is also such a function, but it's defined in src/uproot/_dask.py.
A lot of what the TBranch.array
and HasBranches.arrays
/HasBranches.iterate
functions do is regularize their arguments, bringing in user-provided arguments in a variety of formats and converting them into a common format for central processing. Once converted, these methods prepare a list of ranges_or_baskets
(ranges are (start, stop)
seek positions and baskets are already-read TBaskets) for the monster _ranges_or_baskets_to_arrays
function. The core of this function is a loop that consumes the notifications
queue (remember that from the Sources.chunks
interface?), converting a Chunk into a TBasket as Chunks arrive, and converting a TBasket into an intermediate array when that's ready. The intermediate arrays are then concatenated and finalized and returned to the user. Each step in the process is a locally defined function.
It has this asynchronous structure because it can run in a parallel-processing mode (when decompression_executor
and/or interpretation_executor
are not None
). The downside is that if anything goes wrong (it hasn't in a while), it could hang here, spinning in the notification queue handler, waiting for notifications that will never come.
The HasBranches.arrays
method differs from TBranch.array
in that it reads multiple TBranches, selecting them by filter_name
or expressions
. The expressions
are strings interpreted as Python code, using the language module defined in src/uproot/language/python.py. The idea was that more languages could be added, such as the TTree::Draw
syntax that physicists are so familiar with, but that hasn't happened, at least not yet. (The formulate package isn't in a good state yet.)
The HasBranches.iterate
method differs from HasBranches.arrays
in that it returns an iterator over sequential, non-overlapping entry ranges (e.g. 0‒1000, 1000‒2000, 2000‒3000, ...). The entry-boundaries between TBaskets might differ from the user-specified step_size
, so one iteration step might need the first half of a TBasket and the next iteration step needs the second half of that same TBasket. HasBranches.iterate
maintains a ranges_or_baskets
list with the (start, stop)
ranges of the TBaskets it needs and the actual TBasket objects that have already been read by the previous iteration. (That's why it has that interface.)
TBaskets are chunks of data, which we interpret as a part of an array. HasBranches.iterate
returns a different entry range in each iteration step, but all three array-fetching functions have entry_start
and entry_stop
arguments that can lead to reading fewer TBaskets. A TBasket is the smallest unit of TTree data that can be read.
The byte-for-byte reading (Model) and methods (Behavior) of a TBasket are both defined in src/uproot/models/TBasket.py. Since this model is versionless (version-dependent deserialization, if there was any, would be handled by if
statements) and not usually accessed by users, the model/behavior distinction is not important, so we just make it a Model. The data interpreted by this Model (in read_members
) includes its TKey, so that we can jump over the name, title, and class name. (Strings are slower to deserialize and unnecessary for this object.)
If the TBranch has a simple data type, such as a single TLeafI (32-bit integers) or TLeafD (64-bit floats), then all of the TBasket data after the header is just an array (possibly compressed). The fast-reading that Uproot was designed for comes from not actually iterating over all of those numbers, but instead just casting them as an array. Even with a big-endian → little-endian byte-swap (performed by NumPy), this is where most of the time is saved.
If the TBranch has a ragged array data type, the TBasket data is just an array of content followed by an array of offsets (possibly compressed), which are the 1-dimensional buffers passed to an Awkward ListOffsetArray(NumpyArray)
. There are a few complicated details, like the fact that ROOT's offsets are byte offsets, starting at the beginning of the TKey header, but that just means that the fKeylen
has to be subtracted and the offsets divided by the content itemsize
. Also, the ROOT offsets are missing the last value, which is stored in a header value called fLast
, and there's an extra 4 bytes before and after the offsets. All of these things can be managed in NumPy.
The most significant slowdown for ragged arrays comes when they're std::vector
in C++, rather than variable-length arrays (constructed as "branch_name[counter_name]"
in ROOT). The std::vector
ragged array has a 10-byte header at the beginning of each entry in the content array. This is managed with some fancy NumPy (see the else
clause after self._header_bytes == 0
in src/uproot/interpretation/jagged.py), which accounts for a factor-of-several slow-down, but it's still often less expensive than the decompression step, and therefore invisible.
The worst case is if the TBranch has an arbitrary data type. Then the data in the TBasket has content and offsets, but the content are generic data, like objects in TDirectories. In fact, it is even possible to put histograms inside of a TTree (and Uproot can read them). These generally follow the rules set by TStreamerInfo, but with exceptions like sometimes-missing headers or memberwise splitting. Data like these are on the rough edge of what Uproot can support.
Before Uproot 5, these complex data structures were read with Python—the Model.read
and Model.read_members
methods of auto-generated classes. Now much of this Python code is supplanted with AwkwardForth, a much faster interpreter, optimized for making Awkward Arrays (see below). The original code still exists as a fall-back for cases that couldn't be implemented in Forth, which adds complexity. But the benefit is that these data can now be read hundreds of times faster. This includes cases like std::vector<std::vector<T>>
, which doesn't count as a simple ragged array because of the double-nesting: this relatively common data type is now 400× faster to read.
The actual conversion of bytes in TBaskets into arrays is handled by the Interpretations (next section).
Almost all of the TBaskets that a TTree wants to read are free-floating objects at random locations in the ROOT file. The one exception is that if the writing process is ended before it finishes its last TBasket, that TBasket will be found, uncompressed, inside the TBranch, among the TTree metadata. In src/uproot/behaviors/TBranch.py and src/uproot/models/TBasket.py, these are called "embedded" TBaskets. In earlier versions of Uproot (3 and before), it was called "recovery mode."
An Interpretation is an object that translates TBasket content bytes into arrays. All of the Interpretations are defined in the src/uproot/interpretations directory, along with the abstract superclass in src/uproot/interpretations/init.py, which explains the general procedure in its docstrings.
The complete set of Interpretation types is:
-
AsDtype
and its subclasses in src/uproot/interpretation/numerical.py. The basicAsDtype
consists of purely numerical data, like 32-bit integers or 64-bit floats. There's also anAsDtypeInPlace
to fill a user-supplied array, rather than allocating a new one (just an optimization),AsFloat16
andAsDouble32
for ROOT's compact floating point types, andAsSTLBits
forstd::bitset<N>
objects. That last one still isn't implemented (there was an implementation in Uproot 3!), but no one has complained. -
AsJagged
(src/uproot/interpretation/jagged.py) for ragged arrays. It takes anAsDtype
as a parameter, butAsJagged
can't contain arbitrary Interpretations, the way thatListArray
can contain arbitraryContent
types in Awkward Array. AnAsJagged
Interpretation is also parameterized by a number ofheader_bytes
, as described in the previous section. -
AsStrings
(src/uproot/interpretation/strings.py) for string data. ROOT strings consist of a 1-byte length followed by data or a 1-byte0xff
followed by 4-byte length if the length of the data is 255 or greater, so it needs to be handled differently from other ragged data. (There's a variant of string data that always has a 4-byte length.) This Interpretation has been accelerated by AwkwardForth because there is a benefit in doing so, unlikeAsJagged
(even withheader_bytes != 0
). -
AsObjects
(src/uproot/interpretation/objects.py) for arbitrary, generic data. In the most general case,AsObjects
uses Python or AwkwardForth code to walk through the bytes, creating Python objects or Awkward Array buffers, respectively. However, there's also anAsStridedObjects
case that applies if the byte widths of the data type and all of its members are fixed. For instance,TLorentzVector
contains (superclass)TObject
data and (nested)TVector3
data, but all of these components have the same number of bytes from oneTLorentzVector
to the next. This special case can be interpreted as a NumPy structured array, bypassing the need to iterate over the bytes, which is a performance cost in AwkwardForth or in Python (though a much larger cost in Python).AsStridedObjects
can even be a content ofAsJagged
. -
AsGrouped
(src/uproot/interpretation/grouped.py) is not an Interpretation in the reading-data sense. If users ask for data from a TBranch that has no TBaskets, only sub-TBranches, this groups the sub-TBranches into a single array package. The meaning of a "package" depends on the backend: NumPy, Awkward, or Pandas.
src/uproot/interpretation/library.py defines the array backends. NumPy
produces NumPy arrays, with dtype="O"
if it needs to store non-numerical data (Model instances, usually), and its "package" type is a Python dict of arrays. Awkward
produces Awkward Arrays, whose "package" type is an Awkward RecordArray
. Pandas
produces Pandas Series, with awkward-pandas for non-numerical data (including simple ragged arrays and strings). Its "package" type is a Pandas DataFrame.
An Interpretation "machine" has several well-defined steps (described in src/uproot/interpretation/init.py):
- An Interpretation's
basket_array
method takes TBasketdata
andbyte_offsets
(afterfKeylen
subtraction) and produces an intermediate array. The intermediate array type is not seen by users; they're defined with the Interpretations themselves as implementation details. - An Interpretation's
final_array
method takes the full set of intermediate arrays, trims off parts of the first and last if theentry_start
andentry_stop
don't align with TBasket boundaries, concatenates them, and passes them to... - The Library's
finalize
method, to make an array of the appropriate type: NumPy, Awkward, or Pandas. - The
HasBranches.arrays
andHasBranches.iterate
functions call the Library'sgroup
method directly, if it needs to group arrays from a TBranch's sub-TBranches into a package.
Each TBranch has an interpretation
property with the Interpretation that it will use by default. The default can be overridden in TBranch.array
(singular). The choice of interpretation
is derived from a variety of sources (this information is not all in one place in ROOT), using the interpretation_of
function in src/uproot/interpretation/identify.py.
This function uses TLeaf types, TStreamerInfo, directives from C++ comments that ROOT inserted into TBranch and TLeaf titles—anything we can find that will shed light on how a particular TBranch is supposed to be interpreted. Some deserialization errors are not solved by walking through the byte-for-byte interpretation, but by finding a difference between two TBranches that indicates that a different Interpretation object should be constructed (i.e. passing header=False
instead of header=True
). This module also has a fairly complete C++ type parser, to recognize composed STL container types (implemented in src/uproot/containers.py).
(If the data can be AsStridedObjects
, this function builds up the AsObjects
interpretation first, then calls simplify
to see if it can be simplified to an AsStridedObjects
.)
A new feature of Uproot 5 is that AsObjects
and AsStrings
can be interpreted with AwkwardForth, rather than Python. AwkwardForth is a subset of Standard Forth with built-in functions for interpreting bytes and filling array buffers. The reference manual for AwkwardForth can be useful if you need to interpret the Forth code itself.
Only AsStrings
takes a fixed Forth program; the general purpose of AwkwardForth is for data whose types are discovered at runtime. The AsObjects
Interpretation constructs Forth code from strings based on the data it finds. In fact, the Interpretation starts using the old Uproot 4 Python-based object reading, and builds the Forth code while reading the data, when it has the maximum amount of information (which headers are going to be present, and such) for at least the first whole entry. After one entry, possibly more (if any lists in the first entry happen to be empty), it builds a complete Forth program for interpreting that particular data type, builds a Forth virtual machine, and runs it for the rest of the entries. If it gets to the end of the dataset and still doesn't have a complete Forth program, at least it has the Python objects, which it converts into Awkward Arrays as in Uproot 4.
Since AsObjects.basket_array
may run in parallel, the Forth programs are built on a thread-local variable. (Accumulating Forth code is not thread-safe.) Only one thread can make the finalized Forth source code (there's a lock for that), and each thread individually makes its own Forth virtual machine. The virtual machine state is also not thread-safe, but each runs in its own thread.
All of this should be local to a single AsObjects instance—the Forth-building/running of any one AsObjects instance should not be visible to any others, and each TBranch instance has its own AsObjects instance.
File-writing is an entirely separate system from file-reading, and it is much less capable—many data types can be read from a range of ROOT versions, only a few data types representing a single version can be written. However, the file-writing code can't ignore structures of interest to ROOT and not Uproot, the way that file-reading can. For instance, the TFree record must be correctly constructed when writing a file, but it is ignored by the Uproot reader.
Accordingly, there's a WritableFile
, WritableDirectory
, WritableTree
, and WritableBranch
defined in src/uproot/writing/writable.py, which are writable versions of ReadOnlyFile
, ReadOnlyDirectory
, TTree
, and TBranch
. Some read methods, such as show
, are implemented for the writable classes by having them initiate a read of the file they are writing. Since the reading functions assume that the file structure doesn't change, these functions open new file handles every time they're called.
The scope of file-writing is:
- uproot.create, uproot.recreate, uproot.update either creates a new file (the first two) or updates a file already created by ROOT. In the latter case, we have to recognize data structures as ROOT has made them, in any configuration that is legal for ROOT. When creating a new file from scratch, we can choose our own configuration, as long as it's within the set of legal configurations.
- only local files can be written, no remote (which would be terribly inefficient, anyway);
- creation of nested TDirectories (by naming objects in assignments with names containing "
/
"); - writing TObjStrings (arbitrary string data);
- writing histograms;
- writing TTrees containing numerical data, rectilinear arrays (like NumPy), and up to one level of raggedness (single
ListArray
/ListOffsetArray
from Awkward Array).
The interface for writing is:
with uproot.recreate("some_file.root") as file:
file["name/of/object"] = python_object
with interpreters that translate certain types of python_object
into ROOT C++ equivalents defined in src/uproot/writing/identify.py.
To support this interface, the file must remain open and get updated. Our implementation strategy is to convert valid ROOT-file states before any user command into valid ROOT-file states after the user command—there is no "flushing" or "synchronizing" after the fact (except what the operating system does between a file's write
and actual bytes actually getting written to disk). Therefore, the context manager (with
statement) isn't strictly needed: terminating Python without closing file
won't put the ROOT file into an invalid state. But it's good Python practice, so we always recommend it.
The file on disk is supported by a scaffolding of Python objects that remember where various seek points are so that they can update them. For instance, if a user calls WritableTree.extend to add a TBasket to every TBranch in a WritableTree
, the number of entries in the TTree, among other things, needs to change. There's a Cursor pointing to the seek location in the file where the TTree's number of entries is stored; that number gets updated in the file in-place. (This involves seeking to that location and overwriting the number, which is why this would be terribly inefficient in a remote protocol, due to the many network round trips).
This scaffolding is a suite of classes with "Cascade" in their names because changes to one part of the file often imply changes to other parts. It's a system like React for web development, in which invariants are maintained actively, by a lower-level layer, so if you say "x = y + 1
" somewhere and change y
, the system will go and update x
.
In Uproot's cascade system (src/uproot/writing/_cascade.py), the CascadeLeaf
superclass describes scaffolding with a direct responsibility for some bytes on disk and the CascadeNode
superclass describes collections of CascadeLeaves or other CascadeNodes. It's strictly a DAG (no cycles) so that a change in some CascadeNode value does not lead to infinite loops.
The hierarchy of dependencies may be opposite of what you'd expect: we tend to think of a ROOT file containing TDirectories, which contain TTrees, which contain TBranches and such, but these CascadeNodes point from specific structures to more general ones, since changing a TBranch might require the TTree to get rewritten, which might not have room in a preallocated TDirectory, so that gets rewritten, and whenever any new object allocates disk-memory for itself, it needs to update the whole file's TFree record, so that's at the bottom of a lot of cascade networks.
In the cascade system, the word "location" refers to a seek point and "allocation" refers to an object's size. (The byte range that a CascadeLeaf is responsible for is from location
to location + allocation
.) Some allocations are deliberately bigger than the current size of an object, to allow it to grow. For instance, a TDirectory's DirectoryData
is preallocated with room for some number of TKeys (10?). Users can add objects to the TDirectory, but when it exceeds the preallocated number, a new DirectoryData
has to be allocated elsewhere. (TDirectory comes in two parts; its DirectoryHeader
points to the current DirectoryData
, so the DirectoryHeader
doesn't need to move as it grows, and it can have a stable pointer for other objects to point to.)
The cascading system is used as described for all TDirectory and histogram implementations, but the WritableTree implementation (in src/uproot/writing/_cascadetree.py) cut some corners, not exactly following the meaning of the CascadeLeaf/CascadeNode classes, but it using the same cascading concept.
The whole cascade system is extremely low-level and very hidden from users. Each of the CascadeNode/CascadeLeaf types for common objects is wrapped by a high-level type in src/uproot/writing/writable.py.
I came to the end of the things I had intended to write about, but if there's something that isn't covered here that needs to be explained, just ask and I'll add more sections.