Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AwkwardArray structures #450

Open
Tracked by #552
danielballan opened this issue Jun 13, 2023 · 2 comments
Open
Tracked by #552

Add support for AwkwardArray structures #450

danielballan opened this issue Jun 13, 2023 · 2 comments

Comments

@danielballan
Copy link
Member

@tacaswell and I have wanted to add support for AwkwardArrays from the start, but I think we do not have a GH Issue for it yet.

Notes from chat with @jpivarski a couple weeks ago...

Requirements:

  • We would like to support upload and download of AwkwardArray structures.
  • In the Python client we would like the option to access the data with or without dask-awkward.
  • In normal Tiled fashion, we would like to be able to download a specific slice of interest, and we would like the Tiled server to be able to only read, serializing, and transmit the specific slice of interest.

Proposed Approach:

  • We considered using Arrow to transport AwkwardArrays between client and server. However, representing Awkward in Arrow blurs out detailed form information. Specifically, it loses the form_key that can be used to address specific buffers.
  • Instead, we will operate directly on AwkwardArray's own representation, which comprises JSON-encodable form, outer length (an integer), and a dict-like container whose keys are referenced in the form and whose values are buffers.
  • By reusing the typetracer machinery in awkward (which was developed to support dask-awkward) we can project a slice into a form and get a "projected form". The example below illustrates this, and uses only one piece of internal awkward API (_touch_data). This could conceivably be made into a public method.
  • HTTP endpoints may look roughly like:
    /api/v1/awkward/full/{path_to_dataset}?slice=...
    /api/v1/awkward/buffers/{path_to_dataset}?form_key=...&form_key=...&form_key=...&slice=...
    
  • Multiple buffers may be encoded in a container format like TAR or ZIP (not necessarily compressed, just used as a container). @jakirkham pointed out an advantage of ZIP: web browsers understand it.

Code snippet:

import numpy as np
import awkward as ak

# The array we want to talk about.
array = ak.Array(
    [[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}], [], [{"x": 3.3, "y": [1, 2, 3]}]]
)

# On the server, you separately store form, length, and all the named buffers.
form, length, container = ak.to_buffers(array)

# When a client wants a lazy (slice-only) object, send them the form+length
# and keep a type-tracer array in your TiledObject's metadata.
meta_step1 = ak.Array(
    form.length_zero_array(highlevel=False).to_typetracer(forget_length=True)
)

# Second type-tracer array will tell us the set of buffers that the result of
# the slice will need, so that the client can make multiple requests.
typetracer, report = ak.typetracer.typetracer_with_report(
    form,
    forget_length=True,
)
meta_step2 = ak.Array(typetracer)

# You can test the slice on meta_step1 or meta_step2, but meta_step2 will also
# tell you which buffers of the *sliced* array you'll need.
try:
    meta_step2[0, "y", 1:].layout._touch_data(recursive=True)
except:
    print("Nope, you can't do it!")
else:
    print("Yes, you can.")

# This is a list of nodes (prefixes of form_keys) in the *sliced* array.
print(report.data_touched)
form_keys_touched = set(report.data_touched)

# Having decided that a slice is okay, serialize it and send it to the server.
# Maybe send one HTTP request per node/expected form_key, but maybe not.

# On the server, get an array to slice. We only want to read the parts that
# will survive after slicing. Do it by making a meta_step2, slice, and look
# at the report.

# Let's assume at this point that we have a report with the nodes that are touched.

# Project the form onto a smaller form that doesn't have record fields that won't
# survive the slice.

def project_form(form):
    if isinstance(form, ak.forms.RecordForm):
        if form.fields is None:
            original_fields = [None] * len(form.contents)
        else:
            original_fields = form.fields

        fields = []
        contents = []
        for field, content in zip(original_fields, form.contents):
            projected = project_form(content)
            if projected is not None:
                fields.append(field)
                contents.append(content)

        if form.fields is None:
            fields = None

        return form.copy(fields=fields, contents=contents)

    elif isinstance(form, ak.forms.UnionForm):
        raise NotImplementedError

    elif isinstance(form, (ak.forms.NumpyForm, ak.forms.EmptyForm)):
        if form.form_key in form_keys_touched:
            return form.copy()
        else:
            return None

    else:
        if form.form_key in form_keys_touched:
            return form.copy(content=project_form(form.content))
        else:
            return None

projected_form = project_form(form)

print(form)
print(projected_form)

projected_container = container

projected_array = ak.from_buffers(projected_form, length, projected_container)

print(repr(projected_array))

print(repr(projected_array[0, "y", 1:]))

# Send that!
@danielballan
Copy link
Member Author

Notes from discussion today:

  • When Awkward arrays are uploaded via HTTP, a good storage format is directory-of-buffers, where the filename is the form key. This enables a future enhancement where buffers can be added (and removed and updated) without copying all the unchanged buffers. More standard formats like Parquet would not enable this.
  • The form itself, and the length, will be in the tiled "structure" in the database.
  • Structures are often repeated. A run of many root files many have an identical structure. There could be benefit in the future to storing this in a separate table, with a foreign key. It matters more for awkward than for array because the form JSON can be comparatively large.
  • Serving a directory of existing root files is a sensible thing to try. But, to start, the whole root file will have to be marshaled from disk. Grabbing selecting columns (or form keys...) would require detailed knowledge of root. This is a kerchunk-like optimization. Note that for PB-scale root files the offsets themselves---the table of contents, so to speak---is itself TB-scale. JSON encoding is not the way.

@danielballan
Copy link
Member Author

This is well begun and released in v0.1.0a107, but there are some interesting ideas above I want to address or capture in separate GH issues before closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant