Skip to content

Conversation

@pfackeldey
Copy link
Collaborator

@pfackeldey pfackeldey commented Aug 6, 2025

This PR adds a trace function that lets you trace e.g. Processor.process with awkward's typetracer to infer the necessary columns, e.g.:

def analysis(events):
    # COPIED FROM AGC: https://github.com/iris-hep/calver-coffea-agc-demo/blob/2025_IRISHEP_Training/agc-coffea-2025-virtual-arrays-and-executors.ipynb
    import awkward as ak
    import numpy as np
    
    # pT > 30 GeV for leptons, > 25 GeV for jets
    selected_electrons = events.Electron[
        (events.Electron.pt > 30) & (np.abs(events.Electron.eta) < 2.1)
    ]
    selected_muons = events.Muon[
        (events.Muon.pt > 30) & (np.abs(events.Muon.eta) < 2.1)
    ]
    selected_jets = events.Jet[(events.Jet.pt > 25) & (np.abs(events.Jet.eta) < 2.4)]

    # single lepton requirement
    event_filters = (
        ak.count(selected_electrons.pt, axis=1) + ak.count(selected_muons.pt, axis=1)
    ) == 1
    # at least four jets
    event_filters = event_filters & (ak.count(selected_jets.pt, axis=1) >= 4)
    # at least two b-tagged jets ("tag" means score above threshold)
    B_TAG_THRESHOLD = 0.5
    event_filters = event_filters & (
        ak.sum(selected_jets.btagDeepFlavB > B_TAG_THRESHOLD, axis=1) >= 2
    )

    # apply filters
    selected_jets = selected_jets[event_filters]

    trijet = ak.combinations(
        selected_jets, 3, fields=["j1", "j2", "j3"]
    )  # trijet candidate
    trijet["p4"] = trijet.j1 + trijet.j2 + trijet.j3  # four-momentum of tri-jet system

    trijet["max_btag"] = np.maximum(
        trijet.j1.btagDeepFlavB, np.maximum(trijet.j2.btagDeepFlavB, trijet.j3.btagDeepFlavB)
    )
    trijet = trijet[
        trijet.max_btag > B_TAG_THRESHOLD
    ]  # at least one-btag in trijet candidates
    # pick trijet candidate with largest pT and calculate mass of system
    trijet_mass = trijet["p4"][ak.argmax(trijet.p4.pt, axis=1, keepdims=True)].mass

    # ensure we can handle cross-references
    # just touch them so they land in the report
    _ = ak.flatten(events.Electron.matched_jet)

    return ak.flatten(trijet_mass)
    
    
from coffea.nanoevents import NanoEventsFactory
from coffea.nanoevents.trace import trace

events = NanoEventsFactory.from_root(
  {"nanoaod.root": "Events"},
  mode="virtual",
  access_log=(access_log := []),
).events()

necessary_columns = trace(analysis, events)

print(necessary_columns)
# frozenset({
#   'Electron_eta',
#   'Electron_jetIdx',
#   'Electron_pt',
#   'Jet_btagDeepFlavB',
#   'Jet_eta',
#   'Jet_mass',
#   'Jet_phi',
#   'Jet_pt',
#   'Muon_eta',
#   'Muon_phi',
#   'Muon_pt',
#   'nElectron',
#   'nJet',
#   'nMuon',
# })

If analysis is not traceable, you can set throw=False in trace, i.e. trace(analysis, events, throw=False), and it will trace as much as it can.

This feature works well together with #1387 as you can then do:

needed = trace(analysis, events)

preload = lambda b: b.name in needed

events = NanoEventsFactory.from_root(
  {"nanoaod.root": "Events"}, 
  mode="virtual", 
  access_log=(access_log := []), 
  preload=preload,
).events()

in order to preload everything the tracer could figure out automatically.

@pfackeldey pfackeldey requested review from ikrommyd and lgray August 6, 2025 21:52
@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Aug 6, 2025

I should add that the access_log gives the same information when running analysis(events) directly, but that involves data loading. For small chunks the runtime is likely similar to the typetracer if data is more-or-less local, and it would not fail on data-dependent operations which are not traceable by the typetracer. If data has to be streamed via network it's likely better to trace instead.

@pfackeldey pfackeldey mentioned this pull request Aug 7, 2025
7 tasks
Copy link
Collaborator

@ikrommyd ikrommyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great overall @pfackeldey!

I would probably add a test to test tracing versus access log just like you mention in the PR description. We have test processors in https://github.com/scikit-hep/coffea/tree/master/src/coffea/processor/test_items

I want to add tracing with length-zero/one numpy backed arrays if that's possible to this PR which I intend to try out soon.

@lgray
Copy link
Collaborator

lgray commented Aug 27, 2025

@pfackeldey Could you address @ikrommyd's comments, this otherwise looks great to me, thanks for the contribution. Happy to merge once changes implemented.

@ikrommyd
Copy link
Collaborator

@lgray I will take a shot at length one/zero tracing. I've just been busy

@ikrommyd
Copy link
Collaborator

ikrommyd commented Aug 30, 2025

@pfackeldey @lgray what do you think of something like this? Length one would be more complicated if we did it like this because it's not just a null byte buffer (b"\x00\x00\x00\x00\x00\x00\x00\x00") everywhere. I don't feel super comfortable with how buffer keys work so Peter let me know if there is something I can to simplify this.

from coffea.nanoevents import NanoEventsFactory
import awkward as ak
from functools import partial

events = NanoEventsFactory.from_root({"tests/samples/nano_dy.root": "Events"}).events()
form = ak.forms.from_dict(events.attrs["@form"])
buffer_key = events.attrs["@buffer_key"]

def generate(materialized, buffer_key):
    materialized.add(buffer_key)
    return b"\x00\x00\x00\x00\x00\x00\x00\x00"

materialized = set()
partial(generate, materialized=materialized)
container = {k: partial(generate, materialized=materialized, buffer_key=k) for k in form.expected_from_buffers(buffer_key=buffer_key)}

array = ak.from_buffers(
    form=form,
    length=0,
    container=container,
    buffer_key=buffer_key,
    backend=ak.backend(events),
    byteorder=ak._util.native_byteorder,
    allow_noncanonical_form=False,
    highlevel=True,
    behavior=events.behavior,
    attrs=events.attrs,
)
print(materialized)

print(array.Electron.pt + 1)
print(materialized)

# Should print
{'a9490124-3648-11ea-89e9-f5b55c90beef/%2FEvents%3B1/0-40/data/Electron_pt%2C%21load%2C%21content',
 'a9490124-3648-11ea-89e9-f5b55c90beef/%2FEvents%3B1/0-40/offsets/nElectron%2C%21load%2C%21counts2offsets%2C%21skip%2C%21offsets'}

@ikrommyd
Copy link
Collaborator

ikrommyd commented Aug 30, 2025

I think I can do length zero/one support like this. What I'm not sure of is because form.length_zero_array() doesn't let you specify buffer keys, I'm assuming that the order of the returned buffers stays intact so I can do zip(buffer_keys, buffers) later.

from coffea.nanoevents import NanoEventsFactory
import awkward as ak
from functools import partial
from coffea.nanoevents.util import unquote

events = NanoEventsFactory.from_root({"tests/samples/nano_dy.root": "Events"}).events()

length = 0
form = ak.forms.from_dict(events.attrs["@form"])
buffer_key = events.attrs["@buffer_key"]
buffer_keys = form.expected_from_buffers(buffer_key=buffer_key).keys()
if length == 0:
    buffers = ak.to_buffers(form.length_zero_array())[2].values()
elif length == 1:
    buffers = ak.to_buffers(form.length_one_array())[2].values()
else:
    raise ValueError

materialized = set()

def generate(buffer, materialized, buffer_key):
    materialized.add(buffer_key)
    return buffer
    
container = {}
for key, buffer in zip(buffer_keys, buffers):
    container[key] = partial(generate, buffer=buffer, materialized=materialized, buffer_key=key)

array = ak.from_buffers(
    form=form,
    length=0,
    container=container,
    buffer_key=buffer_key,
    backend=ak.backend(events),
    byteorder=ak._util.native_byteorder,
    allow_noncanonical_form=False,
    highlevel=True,
    behavior=events.behavior,
    attrs=events.attrs,
)
print(materialized)
array.Electron.pt + 1
print(materialized)

keys = set()
for _buffer_key in materialized:
    elements = unquote(_buffer_key.split("/")[-1]).split(",")
    keys |= {
        elements[idx - 1] for idx, instr in enumerate(elements) if instr == "!load"
    }
print(frozenset(keys))

@ikrommyd
Copy link
Collaborator

I'm pasting here a notebook with Peter's example above with length zero/one tracing. I get correct results.
The typetracer actually overtouches Muon_pt.
Untitled3.html

@ikrommyd
Copy link
Collaborator

@lgray @pfackeldey, I'm open to comments about the interface. Also fixed a "bug". We should add the @original_array attribute to the tracer, otherwise, it's going to load from the original actual events to perform methods that use global index.

@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Sep 2, 2025

@ikrommyd this does look correct to me, it should be correct to assume that you can zip them together again (I think this is valid because python3 dicts are ordered by default).

To me, this tracer logic is a bit weird (not negative, also not in a positive sense): Because we can generate something we can follow control flow now, or escape the awkward array world while still recognizing what has been materialized. However, what we generate is kind of nonsense (e.g. 0 length array), so if one does data-dependent logic, or logic based on lengths, this tracing will likely go wrong.
I'm wondering how much benefit this gives then compared to just using the normal typetracing and just catching a tracer error, i.e. tracing as much as we can.

The biggest benefit is probably that we're able to escape the awkward world and recognize necessary columns when going into pure NumPy or an ML evaluation. If that is useful for analysts, the logic with that we're tracing should be able to be chosen, maybe something like this:

from coffea.nanoevents.trace import (
  trace, 
  form_keys_to_columns,
  len0_virtual_array, 
  len1_virtual_array, 
  typetracer_array,
)

# tracer = len0_virtual_array(events)
# tracer = len1_virtual_array(events)
tracer = typetracer_array(events)

materialized_form_keys = trace(tracer, throw=False)

print(form_keys_to_columns(materialized_form_keys))
>> {"Electron_pt", "nElectron"}

I doubt though that it will be clear to any physicists when they should use what - this is highly technical and tracing details. Maybe it depends on if this should become an automatic preprocessing step (hidden from users), or if this is supposed to be used by physicists directly. In the latter case it's probably best to stick this a one-liner that does it best-effort, in the former case we probably want more freedom/granularity. I don't know exactly how you'd like to incorporate this into coffea.

PS: also interesting to see that we're doing better in recognizing necessary columns with virtual arrays than the typetracer approach. This is likely because there are some 'hacks' to recognize touching on the layout level for some operations, where virtual arrays fully operate on buffer-level only.

edit: just say your update logic with the mode kwarg. that's probably better than the explicit way I wrote above, because we can choose a reasonably good default that physicists use when just running the one-liner 👍

@ikrommyd
Copy link
Collaborator

ikrommyd commented Sep 2, 2025

I was also debating whether I want a mode kwarg or separate functions. I wouldn't make form_keys_to_columns a seprate function though. I would probably have maybe 3 functions like trace_with_typetracer, trace_with_... and they all return the frozenset of branches. I don't want to publicly show the DSL jargon names of form keys 🤣.

It's true that it's technical and indeed I think the main way of interacting with this would be through pre-processing with a good default.

The only people that can probably decide with one to use are the people who have been exposed to the manual map_partitions with if ak.backend(..) == "typetracer"` branching. In that case, they manually create length zero or length one arrays so they maybe know what a length zero/one array can do better. Length zero vs length one is very difficult to choose and I think there's only a few edge cases where length zero doesn't work.

@ikrommyd ikrommyd dismissed their stale review September 3, 2025 12:20

Can't review my own stuff now

@ikrommyd
Copy link
Collaborator

ikrommyd commented Sep 3, 2025

@pfackeldey @lgray what do y'all think now? I added separate functions for tracing with typetracer/length one array/length zero array and one function trace that attempts all three and does the set union of the three methods.

The trace function currently attempts all 3 in order until one succeeds. If the typetracer succeeds, it doesn't attempt the other two. Do you think it should do all 3 anyways?

@lgray
Copy link
Collaborator

lgray commented Sep 3, 2025

I'll take a look today.

Quick comments:

  • We should rank the "tries" in order of traceability, with full typetracers being the most traceable, length one the least. This is what you're doing so we're good. Passing a more traceable one should hopefully imply the less traceable one also passes.
  • For running all of them and taking the set union, it would be interesting as a diagnostic as well to compare to whatever passed to what's remaining, perhaps as a debug mode?

@ikrommyd
Copy link
Collaborator

ikrommyd commented Sep 3, 2025

I'll take a look today.

As soon as we agree on the API, I will add tests.

@ikrommyd
Copy link
Collaborator

ikrommyd commented Sep 8, 2025

It's now tested...if the API is fine, it's good to go from my side.

@pfackeldey
Copy link
Collaborator Author

It's now tested...if the API is fine, it's good to go from my side.

LGTM!

@lgray lgray enabled auto-merge (squash) September 8, 2025 14:04
@lgray lgray disabled auto-merge September 8, 2025 14:04
@lgray
Copy link
Collaborator

lgray commented Sep 8, 2025

Ah, do we want to skooch the awkward pin in this PR?

@ikrommyd
Copy link
Collaborator

ikrommyd commented Sep 8, 2025

Ah, do we want to skooch the awkward pin in this PR?

It’s in master already so no

@ikrommyd ikrommyd merged commit b766e51 into master Sep 8, 2025
23 checks passed
@ikrommyd ikrommyd deleted the pfackeldey/trace branch September 8, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants