Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sketch support for writing, reading sliced AwkwardArrays. #549

Merged
merged 29 commits into from
Sep 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
830e312
Sketch support for writing, reading sliced AwkwardArrays.
danielballan Aug 8, 2023
662bf00
Use py38-compatibility type hints.
danielballan Aug 9, 2023
e821abb
Add awkward to some dependency sets.
danielballan Aug 9, 2023
22825a1
Vendor zipfile from Python 3.9.17.
danielballan Aug 17, 2023
1531002
Extract form keys with attributes from form.
danielballan Aug 17, 2023
7fedb72
Pack before upload.
danielballan Aug 17, 2023
ebe737f
Remove redundant default setting; param may soon be deprecated
danielballan Aug 25, 2023
e65ab66
Use public API added in awkward v2.4.0.
danielballan Sep 5, 2023
560b8b5
Handle UnionForm
danielballan Sep 5, 2023
8ec2fd5
Raise clear error for unsupported UnionForms.
danielballan Sep 5, 2023
063f718
Add minimum version pin for awkward.
danielballan Sep 5, 2023
3733c91
Refactor to use Form.expected_form_buffers.
danielballan Sep 5, 2023
80aa360
Move buffers to dedicated route.
danielballan Sep 7, 2023
395e03c
Support /awkward/full
danielballan Sep 7, 2023
4100c0f
Add support for JSON, Feather, Parquet.
danielballan Sep 7, 2023
0a2ccd6
Test use of returned client and client obtained via lookup.
danielballan Sep 7, 2023
71b524c
Include 'buffers' link.
danielballan Sep 7, 2023
b45c05b
Remove slicing options from export, not yet supported.
danielballan Sep 7, 2023
93bf734
Add file ext aliases for parquet, feather, arrow.
danielballan Sep 7, 2023
0bf9ba4
Remove outdated comment.
danielballan Sep 7, 2023
a30f3ef
Fix URLs.
danielballan Sep 28, 2023
4acbe42
Add Awkward to structure lists in docs.
danielballan Sep 28, 2023
3c0b643
Refactor AwkwardBuffersAdapter into AwkwardAdapter.
danielballan Sep 28, 2023
6c3f94c
Add awkward example.
danielballan Sep 28, 2023
cbb58b3
Document all writing methods, including new awkward ones.
danielballan Sep 28, 2023
5ed5659
Rename AwkwardArrayClient -> AwkwardClient.
danielballan Sep 28, 2023
d5e9ef9
Remove outdated plans.
danielballan Sep 28, 2023
86d8859
Fix bugs that failed tests
danielballan Sep 28, 2023
109592c
Use allow_noncanonical_form.
danielballan Sep 28, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ exclude =
dist,
versioneer.py,
tiled/_version.py,
docs/source/conf.py
share
web-frontend
tiled/serialization/_zipfile_py39.py,
docs/source/conf.py,
share,
web-frontend,
max-line-length = 115
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Tiled puts an emphasis on **structures** rather than formats, including:
* N-dimensional strided arrays (i.e. numpy-like arrays)
* Sparse arrays
* Tabular data (e.g. pandas-like "dataframes")
* Nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))
* Hierarchical structures thereof (e.g. xarrays, HDF5-compatible structures like NeXus)

Tiled implements extensible **access control enforcement** based on web security
Expand Down
78 changes: 70 additions & 8 deletions docs/source/explanations/structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,11 @@ potentially any language.
The structure families are:

* array --- a strided array, like a [numpy](https://numpy.org) array
* awkward --- nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))
* container --- a of other structures, akin to a dictionary or a directory
* sparse --- a sparse array (i.e. an array which is mostly zeros)
* table --- tabular data, as in [Apache Arrow](https://arrow.apache.org) or
[pandas](https://pandas.pydata.org/)
* container --- a of other structures, akin to a dictionary or a directory

Support for sparse arrays and [Awkward Array](https://awkward-array.org/) are
planned.

## How structure is encoded

Expand Down Expand Up @@ -57,7 +55,7 @@ a string label for each dimension.
This `(10, 10)`-shaped array fits in a single `(10, 10)`-shaped chunk.

```
$ http :8000/metadata/small_image | jq .data.attributes.structure
$ http :8000/api/v1/metadata/small_image | jq .data.attributes.structure
```

```json
Expand Down Expand Up @@ -91,7 +89,7 @@ This `(10000, 10000)`-shaped array is subdivided into 4 × 4 = 16 chunks,
which is why the size of each chunk is given explicitly.

```
$ http :8000/metadata/big_image | jq .data.attributes.structure
$ http :8000/api/v1/metadata/big_image | jq .data.attributes.structure
```

```json
Expand Down Expand Up @@ -130,7 +128,7 @@ This is a 1D array where each item has internal structure,
as in numpy's [strucuted data types](https://numpy.org/doc/stable/user/basics.rec.html)

```
$ http :8000/metadata/structured_data/pets | jq .data.attributes.structure
$ http :8000/api/v1/metadata/structured_data/pets | jq .data.attributes.structure
```

```json
Expand Down Expand Up @@ -180,6 +178,70 @@ $ http :8000/metadata/structured_data/pets | jq .data.attributes.structure
}
```

### Awkward

[AwkwardArrays](https://awkward-array.org/) express nested, variable-sized
data, including arbitrary-length lists, records, mixed types, and missing data.
This often comes up in the context of event-based data, such as is used in
high-energy physics, neutron experiments, quantum computing, and very high-rate
detectors.

AwkwardArrays are specified by:

* An outer `length` (always an integer)
* A JSON `form` (specified by AwkwardArray, giving the internal layout)
* Named buffers of bytes, whose names match information in the `form`

The first two are included in the structure.

```
$ http :8000/api/v1/metadata/awkward_array | jq .data.attributes.structure
```

```json
{
"length": 3,
"form": {
"class": "ListOffsetArray",
"offsets": "i64",
"content": {
"class": "RecordArray",
"fields": [
"x",
"y"
],
"contents": [
{
"class": "NumpyArray",
"primitive": "float64",
"inner_shape": [],
"parameters": {},
"form_key": "node2"
},
{
"class": "ListOffsetArray",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"primitive": "int64",
"inner_shape": [],
"parameters": {},
"form_key": "node4"
},
"parameters": {},
"form_key": "node3"
}
],
"parameters": {},
"form_key": "node1"
},
"parameters": {},
"form_key": "node0"
}
}

```

### Sparse Array

There are a variety of ways to represent
Expand Down Expand Up @@ -227,7 +289,7 @@ order, but we cannot make requests like "rows 100-200". (Dask has the same
limitation, for the same reason.)

```
$ http :8000/metadata/long_table | jq .data.attributes.structure
$ http :8000/api/v1/metadata/long_table | jq .data.attributes.structure
```

```json
Expand Down
58 changes: 57 additions & 1 deletion docs/source/reference/python-client.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,30 @@ structure_families, and specs of its children along with their counts.
tiled.client.container.Container.distinct
```

And, finally, there are convenience methods for writing:


```{eval-rst}
.. autosummary::
:toctree: generated

tiled.client.container.Container.create_container
tiled.client.container.Container.write_array
tiled.client.container.Container.write_awkward
tiled.client.container.Container.write_dataframe
tiled.client.container.Container.write_sparse
```

and a low-level method for creating a new node to write into:


```{eval-rst}
.. autosummary::
:toctree: generated

tiled.client.container.Container.new
```

## Structure Clients

For each *structure family* ("array", "table", etc.) there is a client
Expand Down Expand Up @@ -133,6 +157,32 @@ Tiled currently includes two clients for each structure family:
tiled.client.array.DaskArrayClient.read_block
tiled.client.array.DaskArrayClient.read
tiled.client.array.DaskArrayClient.export
tiled.client.array.DaskArrayClient.write
tiled.client.array.DaskArrayClient.write_block
```

```{eval-rst}
.. autosummary::
:toctree: generated

tiled.client.array.ArrayClient
tiled.client.array.ArrayClient.read_block
tiled.client.array.ArrayClient.read
tiled.client.array.ArrayClient.export
tiled.client.array.ArrayClient.write
tiled.client.array.ArrayClient.write_block
```

### Awkward

```{eval-rst}
.. autosummary::
:toctree: generated

tiled.client.awkward.AwkwardClient
tiled.client.awkward.AwkwardClient.read
tiled.client.awkward.AwkwardClient.write
tiled.client.awkward.AwkwardClient.export
```

```{eval-rst}
Expand All @@ -154,6 +204,8 @@ Tiled currently includes two clients for each structure family:
tiled.client.sparse.SparseClient
tiled.client.sparse.SparseClient.read
tiled.client.sparse.SparseClient.export
tiled.client.sparse.SparseClient.write
tiled.client.sparse.SparseClient.write_block
```

### DataFrame
Expand All @@ -166,6 +218,8 @@ Tiled currently includes two clients for each structure family:
tiled.client.dataframe.DaskDataFrameClient.read_partition
tiled.client.dataframe.DaskDataFrameClient.read
tiled.client.dataframe.DaskDataFrameClient.export
tiled.client.dataframe.DaskDataFrameClient.write
tiled.client.dataframe.DaskDataFrameClient.write_partition
```

```{eval-rst}
Expand All @@ -175,7 +229,9 @@ Tiled currently includes two clients for each structure family:
tiled.client.dataframe.DataFrameClient
tiled.client.dataframe.DataFrameClient.read_partition
tiled.client.dataframe.DataFrameClient.read
tiled.client.dataframe.DaskDataFrameClient.export
tiled.client.dataframe.DataFrameClient.export
tiled.client.dataframe.DataFrameClient.write
tiled.client.dataframe.DataFrameClient.write_partition
```

### Xarray Dataset
Expand Down
6 changes: 5 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ all = [
"anyio",
"asyncpg",
"appdirs",
"awkward >=2.4.3",
"blosc",
"cachetools",
"cachey",
Expand Down Expand Up @@ -102,6 +103,7 @@ array = [
# This is the "kichen sink" fully-featured client dependency set.
client = [
"appdirs",
"awkward >=2.4.3",
"blosc",
"click !=8.1.0",
"dask[array]",
Expand Down Expand Up @@ -215,8 +217,9 @@ server = [
"aiosqlite",
"alembic",
"anyio",
"asyncpg",
"appdirs",
"asyncpg",
"awkward >=2.4.3",
"blosc",
"cachetools",
"cachey",
Expand Down Expand Up @@ -310,6 +313,7 @@ exclude = '''
)/
| versioneer.py
| tiled/_version.py
| tiled/serialization/_zipfile_py39.py
| docs/source/conf.py
| share
| web-frontend
Expand Down
118 changes: 118 additions & 0 deletions tiled/_tests/test_awkward.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
import io

import awkward
import pyarrow.feather
import pyarrow.parquet

from ..catalog import in_memory
from ..client import Context, from_context, record_history
from ..server.app import build_app
from ..utils import APACHE_ARROW_FILE_MIME_TYPE


def test_slicing(tmpdir):
catalog = in_memory(writable_storage=tmpdir)
app = build_app(catalog)
with Context.from_app(app) as context:
client = from_context(context)

# Write data into catalog. It will be stored as directory of buffers
# named like 'node0-offsets' and 'node2-data'.
array = awkward.Array(
[
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
[],
[{"x": 3.3, "y": [1, 2, 3]}],
]
)
returned = client.write_awkward(array, key="test")
# Test with client returned, and with client from lookup.
for aac in [returned, client["test"]]:
# Read the data back out from the AwkwardArrrayClient, progressively sliced.
assert awkward.almost_equal(aac.read(), array)
assert awkward.almost_equal(aac[:], array)
assert awkward.almost_equal(aac[0], array[0])
assert awkward.almost_equal(aac[0, "y"], array[0, "y"])
assert awkward.almost_equal(aac[0, "y", :1], array[0, "y", :1])

# When sliced, the serer sends less data.
with record_history() as h:
aac[:]
assert len(h.responses) == 1 # sanity check
full_response_size = len(h.responses[0].content)
with record_history() as h:
aac[0, "y"]
assert len(h.responses) == 1 # sanity check
sliced_response_size = len(h.responses[0].content)
assert sliced_response_size < full_response_size


def test_export_json(tmpdir):
catalog = in_memory(writable_storage=tmpdir)
app = build_app(catalog)
with Context.from_app(app) as context:
client = from_context(context)

# Write data into catalog. It will be stored as directory of buffers
# named like 'node0-offsets' and 'node2-data'.
array = awkward.Array(
[
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
[],
[{"x": 3.3, "y": [1, 2, 3]}],
]
)
aac = client.write_awkward(array, key="test")

file = io.BytesIO()
aac.export(file, format="application/json")
actual = bytes(file.getbuffer()).decode()
assert actual == awkward.to_json(array)


def test_export_arrow(tmpdir):
catalog = in_memory(writable_storage=tmpdir)
app = build_app(catalog)
with Context.from_app(app) as context:
client = from_context(context)

# Write data into catalog. It will be stored as directory of buffers
# named like 'node0-offsets' and 'node2-data'.
array = awkward.Array(
[
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
[],
[{"x": 3.3, "y": [1, 2, 3]}],
]
)
aac = client.write_awkward(array, key="test")

filepath = tmpdir / "actual.arrow"
aac.export(str(filepath), format=APACHE_ARROW_FILE_MIME_TYPE)
actual = pyarrow.feather.read_table(filepath)
expected = awkward.to_arrow_table(array)
assert actual == expected


def test_export_parquet(tmpdir):
catalog = in_memory(writable_storage=tmpdir)
app = build_app(catalog)
with Context.from_app(app) as context:
client = from_context(context)

# Write data into catalog. It will be stored as directory of buffers
# named like 'node0-offsets' and 'node2-data'.
array = awkward.Array(
[
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
[],
[{"x": 3.3, "y": [1, 2, 3]}],
]
)
aac = client.write_awkward(array, key="test")

filepath = tmpdir / "actual.parquet"
aac.export(str(filepath), format="application/x-parquet")
actual = pyarrow.parquet.read_table(filepath)
expected = awkward.to_arrow_table(array)
assert actual == expected
Loading