Awkward Array for ML on graphs? #2814

jpivarski · 2023-11-10T19:49:50Z

jpivarski
Nov 10, 2023
Maintainer

I'm moving a question from pydata/xarray#7988 (reply in thread) into this forum because that thread is about ragged xarrays.

Here's the comment from @swamidass:

@jpivarski Im happy to explain more the graph use case and why ak package doesn't work so well. Basically there are two key needs:

A good data loading and management library that:

is numpy restricted. other small dependencies are fine, but nothing large or with portability issues, or interferes with major tensor libraries (onnx, torch, Jax, tf).

keeps all the meta data together in a clean way.

supports lazy loading of files.

As you can see, Xarray fits the bill here quite nicely, even including dask as a way to parallelize. It is missing ragged integration (Datatree does not work), but that's workable with composite tensors (I'll add explain in another response).

A backend agnostic library for native ragged array calculations (the key two basic ops are segment reduce and broadcast, from which most everything can be constructed). Key backends to include are tensorflow, numpy (strict!), Jax and torch. This library would need to provide a consistent set of ops with identical API for all four backends, and ideally also a context manager for switching between backends in different parts of the code (usually between numpy and one of tensorback ends).

Right now this library just doesn't exist. Leaving aside ragged arrays, this doesn't exist for even just normal array ops.

If it did exist, though, and ak built itself around such a library, it might be perfect for us.

This would enable us to write complicated functions, as complex as the leading graph neural networks in one function. One function, based on context, could work natively with any one of the tensor packages, and also operate in the non-GPU preprocessing and deployment contexts (which are a many!).

If Ak allowed us to do that, we switch over to it tomorrow. Is that something you'd be interested in?

If so, I do think I recently found a surprising way to accomplish the main blocker: building that backend agnostic tensor library. If that worked, I'd be curious if ak could produce low dependency (and perhaps reduced feature) API that restricts itself to this set of ops.

If you did that would be a compelling reason to move over. Though some attention would have to be put into ensuring some key details that are critical for interoperating with tensor libraries and our graph use case.

I'll start with a response because I think Awkward Array already covers this—you can tell me if I'm still missing something.

numpy restricted. other small dependencies are fine, but nothing large or with portability issues

The strict (can't be installed without) dependencies for Awkward Array are numpy, packaging, and importlib_metadata, typing_extensions if the Python version is not the latest. It's deliberately a small list. The flip-side of that is that if you use even basic functionality like writing to Parquet, Awkward will complain that pyarrow isn't installed, so the workflow of trying something, finding out that you need to install something else, then trying it again may be annoying, but the alternative would be to make Awkward difficult to install for some users, and we chose the conservative approach.

keeps all the meta data together in a clean way

Depending on what you mean by metadata, that may be a new feature: #2757 (in main) added a top-level attrs dict (that gets propagated through all operations) and #2794 (still-open PR) adds per-field attributes. This was inspired by an issue that compared Awkward with xarray (#1391). I said there that we're not attempting to displace xarray (or any array library for rectilinear data), but sometimes you'll get data from a metadata-rich source and at least want to preserve that metadata through pre-processing to the next step, which could be xarray.

supports lazy loading of files

dask-awkward is a Dask container type, like dask.array and dask.bag, but for Awkward Arrays. Everything is lazy up to the compute() call.

backend agnostic library for native ragged array calculations

✔️

segment reduce and broadcast

>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])

>>> ak.sum(array, axis=1)
<Array [6.6, 0, 9.9] type='3 * float64'>

>>> array + ak.Array([10, 20, 30])
<Array [[11.1, 12.2, 13.3], [], [34.4, 35.5]] type='3 * var * float64'>

Key backends to include are tensorflow ... torch

#1466 is for the set of functions that convert between Awkward and TensorFlow RaggedTensor and Torch NestedTensor. If it would be helpful to add that, we can get back to it. I think I saw that TensorFlow exposes the offsets and content views, so it can be an easy O(1) function in both directions.

numpy

That's default.

Jax

Awkward has a backend for JAX, specifically for the purpose of supporting autodiff. It is experimental—requested for autodiff in particle physics (https://github.com/gradhep), but not widely used yet.

For JAX's JIT-compilation, there doesn't seem to be a way to support it. Even with PyTrees, we run into issues in which we need to create arrays whose shapes are determined by values in other arrays, and that's forbidden in the XLA model. We have Numba and cppyy (and soon Julia) for compiled backends, but not JAX.

This library would need to provide a consistent set of ops with identical API for all four backends

Ah, I just realized that you meant for the buffers in an Awkward Array to be backed by TensorFlow or Torch, which is not what #1466 will do—it's for conversions. For backing arrays (see ak.to_backend), we only have NumPy for main memory and CuPy for GPUs because once that choice is made, any Python library can view the data without copying.

We don't advertise the CuPy backend yet because we have not yet implemented the full API on GPUs yet. Awkward Arrays in @numba.cuda.jit-compiled functions is complete (and was presented in a tutorial), but not the ak.* functions, and we need those to consider Awkward Arrays to be feature-complete on GPUs. This project should be finished next summer. (There's a fixed set of cpu-kernels to rewrite as cuda-kernels.)¹

ideally also a context manager for switching between backends in different parts of the code

Each ak.Array has its own backend, so it wouldn't be a global context switch. CPU calculations on NumPy-backed Awkward Arrays can be happening at the same time as CUDA calculations on CuPy-backed Awkward Arrays.

A lot of the discussion about ragged arrays and xarray has focused on keeping the xarray interface, which I'm in favor of—xarray users should have a familiar interface, even if that means restricting to only ragged arrays, not the full typesystem. But it sounds like your needs are different, and I don't know of any blockers to using Awkward Array for your task.

I'm being cagey about the distinction between GPUs and CUDA because we pass this handling, down to the compilation itself, onto CuPy. I don't know if CuPy has or will have the capability to cross-compile to ROCm, etc. We're writing very generic CUDA in the hope that auto-translation will become possible. ↩

swamidass · 2023-11-13T16:59:51Z

swamidass
Nov 13, 2023

Thanks for the note. The issue is still being able to write graph kernels in a single library that can work with any backend. Right now it doesn't seem Awk can do that. Am I missing something?

4 replies

jpivarski Nov 13, 2023
Maintainer Author

What do you mean by "graph kernels"? Do you mean user-defined functions with specific kinds of input or output types?

If that function is expressed by combining ak.* functions, then only the CPU backend is currently supported; the GPU backend will be done by mid-summer next year.

If the function is expressed in Numba, then both CPU and GPU backends are currently supported.

The disconnect between you and me is likely in what you mean by a kernel—I'm assuming that you'd just write functions, but maybe there's a specific form that it has to take, that it has to be a subclass of something in Torch or TensforFlow? Because on the point of writing high-performance functions, that's what Awkward Array has been about since the beginning.

I'll write up an example, which might clarify what I mean. Suppose you have a ragged array, which I'll construct from NumPy.

>>> import numpy as np
>>> import awkward as ak
>>> counts = np.random.poisson(5, 1000000)
>>> values = np.random.normal(0, 1, counts.sum())
>>> array = ak.unflatten(values, counts)

>>> array.type.show()
1000000 * var * float64

>>> array.show()
[[-0.0131, -0.0111, -0.296, 1.13, -1.4],
 [-1.28, 0.976, -0.72],
 [-0.0116, 2.77, -0.445, -0.623, -1.34, ..., 0.512, -0.686, 2.42, 1.76, -0.232],
 [-0.318, -0.345, -1, -0.959, -1.12],
 [-1.25, 0.19],
 [0.205, 0.0524, 0.318, 1.52],
 [0.534, -0.461, 1.31, 0.603, -0.223, -1.3, -0.554],
 [0.133],
 [-1.49, -0.779, 1.2],
 [0.679, 0.101, 1.9, -0.671, -0.43, 1.32, -1.57, 0.38, 0.307],
 ...,
 [-0.0507, 1.06, -0.2, -1.35, 0.134, -0.591, 0.515, -0.389],
 [-1.16, -1.14, 0.144],
 [1.06, 1.28, 0.8, -0.284, -0.0306, 1.5, 0.131, -0.621],
 [0.54, -1.06, -0.0903, -0.215, -0.0229, 0.518, -0.556],
 [-1.87, 0.344, 0.653, 2.59, -0.372],
 [-0.0513, 0.988, -0.0958, 0.436, -0.524, -0.446, 0.588],
 [0.848, -0.467, 0.122],
 [0.224, 0.783, 1.06, 0.826, -0.828, -0.908],
 [-0.00619, 0.0185, 1.05, -0.221]]

And suppose that the operation you want is "standard deviation of each segment." There's a function for this,

>>> ak.std(array, axis=1)
<Array [0.807, 0.958, 1.25, ..., 0.538, 0.791, 0.493] type='1000000 * float64'>

but I could have constructed it by combining ak.* functions.

>>> cnt = ak.count(array, axis=1)
>>> np.sqrt(ak.sum(array**2, axis=1)/cnt - (ak.sum(array, axis=1)/cnt)**2)
<Array [0.807, 0.958, 1.25, ..., 0.538, 0.791, 0.493] type='1000000 * float64'>

There's a large suite of such functions that perform the computation on the CPU using arrays in main memory. You can move the function to a GPU (whichever one is in the current stream selected by CuPy):

>>> array_gpu = ak.to_backend(array, "cuda")

but currently, not all of the ak.* functions can be performed on that GPU-bound array.

Numba provides an alternative way to write functions, using for loops, but JIT-compiled and fast:

>>> @nb.njit
... def calculate_standard_deviation(array):
...     output = np.empty(len(array), dtype=np.float64)
...     for i, segment in enumerate(array):
...         sum_x = 0.0
...         sum_xx = 0.0
...         for x in segment:
...             sum_x += x
...             sum_xx += x**2
...         if len(segment) == 0:
...             output[i] = 0
...         else:
...             output[i] = np.sqrt(sum_xx/len(segment) - (sum_x/len(segment))**2)
...     return output
... 
>>> calculate_standard_deviation(array)
array([0.8070488 , 0.958046  , 1.25189563, ..., 0.53779815, 0.79106332,
       0.4933499 ])

which already works for GPUs. (The following is untested because I'm on a computer without an NVIDIA GPU right now.)

import math

@nb.cuda.jit
def calculate_standard_deviation(array, output):
    i = nb.cuda.grid(1)   # equivalent of CUDA's threadIdx
    if i < len(array):
        sum_x = 0.0
        sum_xx = 0.0
        for x in array[i]:
            sum_x += x
            sum_xx += x**2
        if len(segment) == 0:
            output[i] = 0
        else:
            output[i] = math.sqrt(sum_xx/len(segment) - (sum_x/len(segment))**2)

output = np.empty(len(array), dtype=np.float64)
num_threads = 1024
num_blocks = int(math.ceil(len(output) / num_threads))
calculate_standard_deviation[num_blocks, num_threads](array, output)

Whatever the inputs and outputs need to be, it should be possible to calculate it either in an array-oriented way or in an imperative way with Numba. If the array object needs automated conversion to and from RaggedTensor or NestedTensor, then that's a matter of just finishing #1466 and we can re-prioritize that.

swamidass Nov 18, 2023

Perhaps this will make some sense of it. See this Jax library:

https://github.com/google-deepmind/jraph/blob/master/jraph/_src/models.py

That's the sort of computations I'd want to be able to write one time and have work across multiple different tensor libraries, without converting to numpy arrays unless I want to.

And using Numba isn't helpful because that isn't automatically differentiated.

It would be better if awk could use different tensor frameworks as the backend, with clearer separation between the algorithms/computations/data models and the backend ops, so the same code can be run on any one of many possible backends.

swamidass Nov 18, 2023

There are two libraries that achieve this aim that would be worth looking at.

The first is keras_core, which can work on Jax, numpy/jax, tensorflow or torch. https://keras.io/keras_core/

The second is dgl, which can run on both tensorflow and PyTorch. https://www.dgl.ai/

Both are architected in a similar way too.

jpivarski Nov 18, 2023
Maintainer Author

I think the key point must be the automatic differentiation.

When I talk about "conversions," I mean zero-copy views: to take the bytes in RAM or GPU global memory that a Torch tensor PyObject is pointing to and make a NumPy or CuPy PyObject point to the same bytes in memory. If we implement the same conversions to and from TensorFlow and JAX, then a function on those NumPy or CuPy arrays does work across those three tensor libraries.

We do have a clear separation between data backends and algorithms/computations: none of the Awkward Array operations (ak.* functions, slices, ufuncs, etc.) explicitly refer to NumPy. They all go through a class called NumpyLike which homogenizes the interfaces to the backends. Those backends are NumPy, CuPy, and JAX, because any array/tensor in RAM from any library can be zero-copied to and from NumPy and any array/tensor in GPU global memory from any library can be zero-copied to and from CuPy. (JAX is there for automatic differentiation.) We could do exactly what you're asking by adding shims between Torch's and TensorFlow's APIs (which are not very NumPy-like, so it would be a lot of work) but it wouldn't increase our ability to do computations on tensors from these libraries. What matters is the values in the array and their interpretation, not the type of the PyObject that points to them.

Automatic differentiation is different because it's no longer just one array to carry around, but its derivative as well, and the operations would need to know how to compute derivatives. But that's why we have a third backend, JAX. A similar argument could be made for anything auxiliary, such as metadata or units, but we can pass metadata through a calculation by putting it in an ak.Array's attrs.

We can finish work on #1466 to perform these zero-copy views, and that would make it easier to cast between the tensor libraries and Awkward Array, because it would hide the steps involved in checking which device the tensor is on, choosing NumPy or CuPy, and calling the appropriate asarray/frombuffer equivalent. But actually matching all of Torch/TensorFlow's API calls to equivalents in NumpyLike wouldn't add capabilities that zero-copy viewing them doesn't already provide.

ConstantinVasilev · 2024-01-03T06:46:16Z

ConstantinVasilev
Jan 3, 2024

Hi @swamidass could you please give simple example of such function that performs graph calculation using ragged tensors? Does it rely on representing the graphs as ragged tensors?

This would enable us to write complicated functions, as complex as the leading graph neural networks in one function.

3 replies

swamidass Jan 3, 2024

See the GraphNetwork function in jraph (https://github.com/google-deepmind/jraph/blob/master/jraph/_src/models.py) for a code for the key graph-network operation. The rest of the file shows how that function can be used to create most well known graph network architectures.

Of note, that function uses "segment" functions extensively (e.g. segment_sum, segment_max, segment_sort, etc). Segment tensors are, essentially, ragged tensors with one None axis.

There are other ways of doing this, but they usually have big downsides compared to using segment/ragged tensors.

ConstantinVasilev Jan 4, 2024

Thanks, but isnt this what libraries like tf with their ragged tensors do, why do you need this in xr or ak?

One more Q, you mentioned that xr Datatree doesnt work as ragged integration why is that?

swamidass Jan 8, 2024

The issue is that:

these libraries are all different, wiht different APIs and different support, and a compatibility layer in something like awk would be valuable. For example, the functions I showed there are all in Jax, which doesn't work with torch, etc.
they are focused on autodifferentiation, not the sort of things that xarray are good at it. Ideally a library could use any of the tensor libraries as backend arrays, while providing a common api for all of them.

For your last question, I think I explained it before. But perhaps you could could show me how you'd use datatree to represent, say, 1,000 molecule graphs (atoms are nodes, bonds are graphs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Awkward Array for ML on graphs? #2814

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Awkward Array for ML on graphs? #2814

jpivarski Nov 10, 2023 Maintainer

Footnotes

Replies: 2 comments · 7 replies

swamidass Nov 13, 2023

jpivarski Nov 13, 2023 Maintainer Author

swamidass Nov 18, 2023

swamidass Nov 18, 2023

jpivarski Nov 18, 2023 Maintainer Author

ConstantinVasilev Jan 3, 2024

swamidass Jan 3, 2024

ConstantinVasilev Jan 4, 2024

swamidass Jan 8, 2024

jpivarski
Nov 10, 2023
Maintainer

Replies: 2 comments 7 replies

swamidass
Nov 13, 2023

jpivarski Nov 13, 2023
Maintainer Author

jpivarski Nov 18, 2023
Maintainer Author

ConstantinVasilev
Jan 3, 2024