Skip to content

Conversation

ariostas
Copy link
Collaborator

This PR fixes #1442 (at least partially). It's not clear to me why this getter is needed, but it has a huge performance cost, so I switched to using the branch names directly.

This change gives it a huge boost in performance for trees with many branches, although it is still significantly slower than Uproot3.

For comparison, I generated a file with 1000 branches, as done in #1442 and the output of

%timeit uproot.open("test.root:df").arrays(library="pd")

went from

3.07 s ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

to

353 ms ± 7.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@ianna do you know why this getter could have been needed in the first place? All the tests seem to pass, so I'm not sure why it was needed.

@ikrommyd
Copy link
Contributor

Ci is failing due to numpy 2.3 and awkward. Perhaps we need a quick awkward patch release.

@ariostas
Copy link
Collaborator Author

I just found out that another big improvement could be made by changing the way cache_key works. Currently, it uses the index, which is very slow to find for TTrees with many branches, so we should think of a better alternative.

@property
def cache_key(self):
"""
String that uniquely specifies this ``TBranch`` in its path, to use as
part of object and array cache keys.
"""
if self._cache_key is None:
sep = ":" if isinstance(self._parent, uproot.behaviors.TTree.TTree) else "/"
self._cache_key = f"{self.parent.cache_key}{sep}{self.name}({self.index})"
return self._cache_key
@property
def index(self):
"""
Integer position of this ``TBranch`` in its parent's list of branches.
Useful for cases in which the
:ref:`uproot.behaviors.TBranch.TBranch.name` is not unique: the
non-recursive index is always unique.
"""
for i, branch in enumerate(self.parent.branches):
if branch is self:
return i
else:
raise AssertionError

@ariostas
Copy link
Collaborator Author

After the last commit it's actually back to the speed of Uproot3!

@ariostas
Copy link
Collaborator Author

Here are the results from the line profiler for uproot.open('test.root:df').arrays(library='pd')

Total time: 13.6904 s
File: uproot5/src/uproot/behaviors/TBranch.py
Function: __getitem__ at line 1684

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1684                                               def __getitem__(self, where):
  1685      2002     219000.0    109.4      0.0          original_where = where
  1686                                           
  1687      2002    6412000.0   3202.8      0.0          if uproot._util.isint(where):
  1688                                                       return self.branches[where]
  1689      2002     251000.0    125.4      0.0          elif isinstance(where, str):
  1690      2002    1737000.0    867.6      0.0              where = uproot._util.ensure_str(where)
  1691                                                   else:
  1692                                                       raise TypeError(f"where must be an integer or a string, not {where!r}")
  1693                                           
  1694      2002     703000.0    351.1      0.0          if where.startswith("/"):
  1695                                                       recursive = False
  1696                                                       where = where[1:]
  1697                                                   else:
  1698      2002     177000.0     88.4      0.0              recursive = True
  1699                                           
  1700      2002    1139000.0    568.9      0.0          got = self._lookup.get(where)
  1701      2002     157000.0     78.4      0.0          if got is not None:
  1702      1001     487000.0    486.5      0.0              return got
  1703                                           
  1704      1001     376000.0    375.6      0.0          if "/" in where:
  1705                                                       this = self
  1706                                                       try:
  1707                                                           for piece in where.split("/"):
  1708                                                               if piece != "":
  1709                                                                   this = this[piece]
  1710                                                       except uproot.KeyInFileError:
  1711                                                           raise uproot.KeyInFileError(
  1712                                                               original_where,
  1713                                                               keys=self.keys(recursive=recursive),
  1714                                                               file_path=self._file.file_path,
  1715                                                               object_path=self.object_path,
  1716                                                           ) from None
  1717                                                       return this
  1718                                           
  1719      1001     112000.0    111.9      0.0          elif recursive:
  1720      1001 1862508000.0    2e+06     13.6              got = _get_recursive(self, where)
  1721      1001     173000.0    172.8      0.0              if got is not None:
  1722                                                           return got
  1723                                                       else:
  1724      2002    8413000.0   4202.3      0.1                  raise uproot.KeyInFileError(
  1725      1001      74000.0     73.9      0.0                      original_where,
  1726      1001        1e+10    1e+07     86.2                      keys=self.keys(recursive=recursive),
  1727      1001     705000.0    704.3      0.0                      file_path=self._file.file_path,
  1728      1001    8427000.0   8418.6      0.1                      object_path=self.object_path,
  1729                                                           )
  1730                                           
  1731                                                   else:
  1732                                                       raise uproot.KeyInFileError(
  1733                                                           original_where,
  1734                                                           keys=self.keys(recursive=recursive),
  1735                                                           file_path=self._file.file_path,
  1736                                                           object_path=self.object_path,
  1737                                                       )

Total time: 0.230938 s

As you can see, most of the time is spent on computing keys, but the real issue is that _get_recursive is always returning None because where is get('branchname') instead of just branchname.

In the meeting we found that this was done in #116, so I'll look into that.

@ariostas
Copy link
Collaborator Author

I implemented some simpler approach to compute indices instead of having some post creation hook. Let me know what you think.

Also, I reverted to using the getter when the branch name is not alphanumeric. So I think this PR is ready for review now.

@ariostas ariostas requested review from ianna and nsmith- June 12, 2025 15:44
Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ariostas - looks great! Thanks for fixing it!

@ariostas ariostas marked this pull request as draft June 15, 2025 12:57
@ariostas ariostas marked this pull request as ready for review June 16, 2025 08:07
@ariostas
Copy link
Collaborator Author

This is ready for review again, but there's a bunch of tests failing for an unrelated reason.

@ariostas ariostas force-pushed the ariostas/faster_branch_getting branch from 5ff1382 to 6d55f75 Compare June 19, 2025 12:19
@ianna
Copy link
Collaborator

ianna commented Jun 19, 2025

@ariostas - Great! Thanks! Please, merge it if you finished. Thanks!

@ariostas ariostas merged commit bba889a into main Jun 19, 2025
30 of 96 checks passed
@ariostas ariostas deleted the ariostas/faster_branch_getting branch June 19, 2025 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor performance reading file with many branches
4 participants