-
Notifications
You must be signed in to change notification settings - Fork 135
feat: add typetracing to autodetect which columns are needed #1388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I should add that the |
ikrommyd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great overall @pfackeldey!
I would probably add a test to test tracing versus access log just like you mention in the PR description. We have test processors in https://github.com/scikit-hep/coffea/tree/master/src/coffea/processor/test_items
I want to add tracing with length-zero/one numpy backed arrays if that's possible to this PR which I intend to try out soon.
|
@pfackeldey Could you address @ikrommyd's comments, this otherwise looks great to me, thanks for the contribution. Happy to merge once changes implemented. |
|
@lgray I will take a shot at length one/zero tracing. I've just been busy |
|
@pfackeldey @lgray what do you think of something like this? Length one would be more complicated if we did it like this because it's not just a null byte buffer ( from coffea.nanoevents import NanoEventsFactory
import awkward as ak
from functools import partial
events = NanoEventsFactory.from_root({"tests/samples/nano_dy.root": "Events"}).events()
form = ak.forms.from_dict(events.attrs["@form"])
buffer_key = events.attrs["@buffer_key"]
def generate(materialized, buffer_key):
materialized.add(buffer_key)
return b"\x00\x00\x00\x00\x00\x00\x00\x00"
materialized = set()
partial(generate, materialized=materialized)
container = {k: partial(generate, materialized=materialized, buffer_key=k) for k in form.expected_from_buffers(buffer_key=buffer_key)}
array = ak.from_buffers(
form=form,
length=0,
container=container,
buffer_key=buffer_key,
backend=ak.backend(events),
byteorder=ak._util.native_byteorder,
allow_noncanonical_form=False,
highlevel=True,
behavior=events.behavior,
attrs=events.attrs,
)
print(materialized)
print(array.Electron.pt + 1)
print(materialized)
# Should print
{'a9490124-3648-11ea-89e9-f5b55c90beef/%2FEvents%3B1/0-40/data/Electron_pt%2C%21load%2C%21content',
'a9490124-3648-11ea-89e9-f5b55c90beef/%2FEvents%3B1/0-40/offsets/nElectron%2C%21load%2C%21counts2offsets%2C%21skip%2C%21offsets'} |
|
I think I can do length zero/one support like this. What I'm not sure of is because from coffea.nanoevents import NanoEventsFactory
import awkward as ak
from functools import partial
from coffea.nanoevents.util import unquote
events = NanoEventsFactory.from_root({"tests/samples/nano_dy.root": "Events"}).events()
length = 0
form = ak.forms.from_dict(events.attrs["@form"])
buffer_key = events.attrs["@buffer_key"]
buffer_keys = form.expected_from_buffers(buffer_key=buffer_key).keys()
if length == 0:
buffers = ak.to_buffers(form.length_zero_array())[2].values()
elif length == 1:
buffers = ak.to_buffers(form.length_one_array())[2].values()
else:
raise ValueError
materialized = set()
def generate(buffer, materialized, buffer_key):
materialized.add(buffer_key)
return buffer
container = {}
for key, buffer in zip(buffer_keys, buffers):
container[key] = partial(generate, buffer=buffer, materialized=materialized, buffer_key=key)
array = ak.from_buffers(
form=form,
length=0,
container=container,
buffer_key=buffer_key,
backend=ak.backend(events),
byteorder=ak._util.native_byteorder,
allow_noncanonical_form=False,
highlevel=True,
behavior=events.behavior,
attrs=events.attrs,
)
print(materialized)
array.Electron.pt + 1
print(materialized)
keys = set()
for _buffer_key in materialized:
elements = unquote(_buffer_key.split("/")[-1]).split(",")
keys |= {
elements[idx - 1] for idx, instr in enumerate(elements) if instr == "!load"
}
print(frozenset(keys)) |
|
I'm pasting here a notebook with Peter's example above with length zero/one tracing. I get correct results. |
|
@lgray @pfackeldey, I'm open to comments about the interface. Also fixed a "bug". We should add the |
|
@ikrommyd this does look correct to me, it should be correct to assume that you can zip them together again (I think this is valid because python3 dicts are ordered by default). To me, this tracer logic is a bit weird (not negative, also not in a positive sense): Because we can generate something we can follow control flow now, or escape the awkward array world while still recognizing what has been materialized. However, what we generate is kind of nonsense (e.g. 0 length array), so if one does data-dependent logic, or logic based on lengths, this tracing will likely go wrong. The biggest benefit is probably that we're able to escape the awkward world and recognize necessary columns when going into pure NumPy or an ML evaluation. If that is useful for analysts, the logic with that we're tracing should be able to be chosen, maybe something like this: from coffea.nanoevents.trace import (
trace,
form_keys_to_columns,
len0_virtual_array,
len1_virtual_array,
typetracer_array,
)
# tracer = len0_virtual_array(events)
# tracer = len1_virtual_array(events)
tracer = typetracer_array(events)
materialized_form_keys = trace(tracer, throw=False)
print(form_keys_to_columns(materialized_form_keys))
>> {"Electron_pt", "nElectron"}I doubt though that it will be clear to any physicists when they should use what - this is highly technical and tracing details. Maybe it depends on if this should become an automatic preprocessing step (hidden from users), or if this is supposed to be used by physicists directly. In the latter case it's probably best to stick this a one-liner that does it best-effort, in the former case we probably want more freedom/granularity. I don't know exactly how you'd like to incorporate this into coffea. PS: also interesting to see that we're doing better in recognizing necessary columns with virtual arrays than the typetracer approach. This is likely because there are some 'hacks' to recognize touching on the layout level for some operations, where virtual arrays fully operate on buffer-level only. edit: just say your update logic with the |
|
I was also debating whether I want a mode kwarg or separate functions. I wouldn't make It's true that it's technical and indeed I think the main way of interacting with this would be through pre-processing with a good default. The only people that can probably decide with one to use are the people who have been exposed to the manual |
|
@pfackeldey @lgray what do y'all think now? I added separate functions for tracing with typetracer/length one array/length zero array and one function The |
|
I'll take a look today. Quick comments:
|
As soon as we agree on the API, I will add tests. |
|
It's now tested...if the API is fine, it's good to go from my side. |
LGTM! |
|
Ah, do we want to skooch the awkward pin in this PR? |
It’s in master already so no |
This PR adds a
tracefunction that lets you trace e.g.Processor.processwith awkward's typetracer to infer the necessary columns, e.g.:If
analysisis not traceable, you can setthrow=Falseintrace, i.e.trace(analysis, events, throw=False), and it will trace as much as it can.This feature works well together with #1387 as you can then do:
in order to preload everything the tracer could figure out automatically.