Support a fallback structure family for "opaque bytes" #434

danielballan · 2023-04-18T13:23:34Z

This is another idea that arose during the NIST visit.

Tiled's data model constrains everything to be one of its recognized structure families (array, dataframe, sparse, node) or JSON-encodable metadata sitting alongside one of those types. There will be cases where there is binary (not JSON-encodable) information that is relevant and that some clients programs will know what to do with.

Our line on this so far has been, "If you have a files, use a static file server or Globus or another file-based solution, and link to that from the metadata in Tiled." And for cases where you have a lot of un-structured data (directory of PowerPoint documents or PDFs) I think think that's the right call. But John Henry at NIST articulated a compelling argument that it is useful to enable Tiled to carry binary data in-line when it's useful.

I think this would take the shape of a new structure family, perhaps opaque_bytes. Its structure would simply be a length, and it would be sliceable by byte range. Tiled would not be able to transcode it, only send it in its original representation as bytes. Any context necessary to interpret the bytes would have been either known a priori to a client. The (JSON-encodable) metadata attached to the opaque_bytes node may provide helpful information in this regard for a client, but it would be "opaque" to Tiled itself.

We are all in agreement that if you have mostly unstructured / opaque data, then Tiled is not adding value and you should just use a static file server. But if you have a little unstructured / opaque data and you want to place it logically alongside structure data, there is an argument that Tiled should enable this.

The text was updated successfully, but these errors were encountered:

danielballan · 2023-07-10T13:54:31Z

One possible name for this proposed new structure family is unstructured. I think I like that better than my original suggestion, opaque_bytes.

danielballan · 2023-07-10T14:03:04Z

Like any other node in Tiled, an unstructured node can include metadata which a client or user may rely on the open/interpret it.

It's worth considering the trade-offs of promoting certain special fields in structure:

text vs binary
text encoding
mimetype
length (i.e. Content-Length)

The only one we will always know is bytesize. The rest will need to be optional.

danielballan · 2023-07-10T14:12:06Z

We have discussed various bars that data can clear:

We technically have the bytes, but everything is totally unlabeled.
We have the bytes and some metadata.
We know how to open the file and interpret the bytes as numbers.
We have a schema that tells us in a "machine-actionable" way significance of the numbers.

Tiled currently insists that you start at (2). When @jmaruland and I first visited NIST, John Henry made the case that we should actually start at (1) --- that we should accept files that we cannot open. That leaves a lot of Tiled's capability on the table. Interfaces like these rely on Tiled being able to open the file and provide it in a known format:

https://tiled-demo.blueskyproject.io/ui/browse/generated/short_table
https://tiled-demo.blueskyproject.io/ui/browse/fxi/raw/1b0b4d73-6d87-43ab-8d62-ed035c51b9b4/primary/data/Andor_image

Given "unstructured" data, Tiled would have to fall back to showing only a "Download" button and leave it to the client/user to figure out how to open the data.

padraic-shafer · 2023-07-15T20:36:21Z

One possible name for this proposed new structure family is unstructured. I think I like that better than my original suggestion, opaque_bytes.

I'll add a few more into the mix for consideration...because, well, naming things is hard. :)

Simplicity: bytes are what you have and the term is readily understood. To avoid a naming clash with the python built-in, raw_bytes or raw might work instead.
unstructured is a nice foil to structure_family/StructureFamily but it's a bit of a misnomer; one of these might be closer to the meaning in this context: unknown, unrecognized, or undefined.
- Caveat: This is a different intent here than the KeyError, UnknownStructureFamily; hopefully these would not get confused down the road.
Express that the user/client is responsible: custom or user_defined

I hadn't intended to complicate a simple keyword choice, but for some reason felt compelled to do so anyway. :)

danielballan · 2023-07-31T13:54:23Z

I think you've convinced me that unstructured is not quite right. From the point of view of the user, the data has a structure; it just has not been described to Tiled in a way that Tiled understands.

Of those options I think I like unknown best. Comments:

Since everything is bytes, including data with a known structure family, I think there's potential for confusion there. And we may want to reserve the term "raw" for Support access to raw encoded chunks #277.
I think I want to avoid custom or user_defined because I consider structure family to be intentionally not an extension point in Tiled. I think a custom, user-defined structure family would involve configuring the server and client(s)---possibly in multiple programming languages---to understand it, at least as far as getting from bytes to numbers, if not all the way to the meaning of the numbers. What is being proposed in this issue is not that. It is a escape hatch that says, "Tiled will send this data as is, and it's up to the client(s) to have some a priori knowledge of how to decode it. Tiled's existing mechanisms---the structure filed and the transcoding mechanisms---cannot help."

danielballan · 2023-07-31T14:02:32Z

I think the current proposal to beat is:

structure_family: unknown
structure:
  mimetype: "..."  # e.g. "application/octet-stream", "text/plain;chatset=utf-8"
  length: ...  # number of bytes

Maybe unspecified should also be considered?

padraic-shafer · 2023-07-31T21:29:00Z

I think the current proposal to beat is …

Agreed. I don’t have strong feelings re: unknown vs. unspecified. Maybe @prjemian and @dylanmcreynolds have a preference?

dylanmcreynolds · 2023-07-31T21:38:48Z

Slight preference for unspecified...unknown has a slight negative connotation. What about bytearray? I know there's a collision with python types, but isn't it precisely what we're describing? I think the term is pretty common across many languages.

danielballan · 2023-08-22T21:04:59Z

I had no idea that bytearray was a common term beyond Python. I'm open to it. I agree we want to avoid attaching a negative connotation to this.

dylanmcreynolds · 2023-08-22T22:00:58Z

octet stream ?

is used to indicate that a body contains arbitrary binary data

padraic-shafer · 2023-08-22T23:09:11Z

octet stream ?

is used to indicate that a body contains arbitrary binary data

That makes sense. Stream has its own baggage of expectations, but "application/octet-stream" is so commonly used that it's hard to argue against.

danielballan · 2023-08-23T00:54:08Z

We use application/octet-stream as a MIME type when we send numpy arrays (or chunks of numpy arrays) as C-ordered buffers:

$ git grep "application/octet-stream"
docs/source/explanations/compression.md:content-type: application/octet-stream
docs/source/tutorials/export.md:* C-ordered memory buffer `application/octet-stream`
share/tiled/static/default_ui_settings.yml:      - mimetype: application/octet-stream
tiled/_tests/test_writing.py:        assert value.startswith("data:application/octet-stream;base64,")
tiled/client/array.py:        media_type = "application/octet-stream"
tiled/client/array.py:                headers={"Content-Type": "application/octet-stream"},
tiled/client/array.py:                headers={"Content-Type": "application/octet-stream"},
tiled/media_type_registration.py:            if media_type in {"application/octet-stream", "text/plain"}:
tiled/media_type_registration.py:    "application/octet-stream",
tiled/media_type_registration.py:        "application/octet-stream",
tiled/media_type_registration.py:        "application/octet-stream",
tiled/media_type_registration.py:        "application/octet-stream",
tiled/media_type_registration.py:    for media_type in ["application/octet-stream", APACHE_ARROW_FILE_MIME_TYPE]:
tiled/serialization/array.py:    "application/octet-stream",
tiled/serialization/array.py:    "application/octet-stream",
tiled/server/core.py:    StructureFamily.array: {"*/*": "application/octet-stream", "image/*": "image/png"},
tiled/utils.py:            content = f"data:application/octet-stream;base64,{base64.b64encode(content).decode('utf-8')}"

Unlike TIFF or PNG or Arrow, the context necessary to interpret the C-ordered buffers (their data type and shape) is not inlined into the payload itself---it's in the structure JSON from a different endpoint. That's why we went with application/octet-stream, meaning, "If you don't already know what this binary data is, I can't help you here." A web browser, for example, would not be able to make sense of that as anything but "arbitrary binary data". It takes a Tiled-aware application to join this with the structure info and interpret it.

For category of use cases addressed by this GH issue, we may actually know a specific MIME type. Use cases include things like Word documents, MATLAB scripts, and PDFs, probably associated with some more structured scientific data. Tiled will not be able to transcode or slice into these nodes, but it can give the client a good hint by saying, "The person who gave me this said it was applicaiton/pdf. I hope that means something to you! Good luck!" And for browser, that will be a great hint.

So my initial reaction is that adding a MIME type like application/octet-stream to the StructureFamiy enum would be mixing things that shouldn't be mixed. We should pick a name that is not a MIME type because the node will also have a MIME type.

padraic-shafer · 2023-08-23T01:14:50Z

it's hard to argue against.

OK, I stand corrected. 😆

danielballan · 2023-08-23T12:46:36Z

it's hard to argue against.

dylanmcreynolds · 2023-08-23T15:24:05Z

This is getting silly, but the more I think about it, the more I think I like plain old bytes, even with the python type naming collision. What is it? It's bytes. What do we know about it? Nothing, other than than it's bytes.

danielballan · 2023-08-23T15:25:14Z

That's pretty compelling.

prjemian · 2023-08-23T15:25:19Z

Simplicity

padraic-shafer · 2023-08-30T12:15:54Z

It seems like we have a winner. Should we proceed with using bytes?

danielballan · 2023-08-30T14:16:28Z

Let's do it. #450 is a good reference for which parts of the codebase need to be touched to add a new StructureFamily.

Some design things to nail down before we write code.

What will be in the structure? Re-reading the discussion so far, I think we want just mimetype (required) and length (required). If MIME type is unknown, we can use MIME types own catch-all (application/octet-stream). MIME type also has a way to provide text-vs-binary and encoding.
Will there be a new route for this? Perhaps /bytes/full/{path}?
Note that, unlike all the other structures, this will not go through (de)serialization---we'll just send the bytes.

dylanmcreynolds · 2023-08-30T15:23:38Z

Is there any reason that an adapter can't define structure-family = bytes but their own mime-type? It's hard for my brain to escape the notion that specific mime types could be very useful to clients.

danielballan · 2023-08-30T15:26:52Z

I think we're on the same page. Compare to this array example, which has a structure_family ("array") and a structure (see JSON below).

$ http https://tiled-demo.blueskyproject.io/api/v1/metadata/generated/small_image/ | jq .data.attributes.structure_family
"array"

$ http https://tiled-demo.blueskyproject.io/api/v1/metadata/generated/small_image/ | jq .data.attributes.structure
{
  "data_type": {
    "endianness": "little",
    "kind": "f",
    "itemsize": 8
  },
  "chunks": [
    [
      300
    ],
    [
      300
    ]
  ],
  "shape": [
    300,
    300
  ],
  "dims": null,
  "resizable": false
}

This proposal is that the structure_family would be "bytes" and the structure would be {"mimetype": "...", "length": N}.

padraic-shafer · 2023-08-30T15:51:31Z

I'd be interested in drafting a PR for this, along with some follow up discussions.

It would be great to have a companion for this. @jmaruland are you interested in working on this together?

jmaruland · 2023-08-30T17:16:10Z

@padraic-shafer Yes, I would love to. I worked on a very similar issue a while ago when we were trying to move away from JSONSchema models to Pydantic models. I will be fun to revisit this topic.

padraic-shafer · 2023-08-30T17:26:11Z

Fantastic! I'll find a time later this week for us to discuss where to start, and how to proceed.

danielballan · 2023-08-30T20:59:45Z

Follow-up thoughts here:

We already plan to add a route for accessing underlying files, discussed in Run client against catalog in-process #473, something like /asset/{id}. The route we want for this issue has the same meaning: "Get me the underlying file." It should probably be the same route.
Keeping in mind the rule, "Illegal or nonsensical states should be unrepresentable, I think we may not want to put the mimetype in the structure because it's already in the data_source:

$ http :8000/api/v1/metadata/example?show_sources=true 'Authorization:Apikey secret' | jq .data.attributes.data_sources
[
  {
    "id": 2,
    "structure": {
      "data_type": {
        "endianness": "little",
        "kind": "i",
        "itemsize": 8
      },
      "chunks": [
        [
          3
        ]
      ],
      "shape": [
        3
      ],
      "dims": null,
      "resizable": false
    },
    "mimetype": "application/x-zarr",
    "parameters": {},
    "assets": [
      {
        "data_uri": "file://localhost/tmp/tmpp0dp686u/data/example",
        "is_directory": true,
        "id": 2
      }
    ],
    "management": "writable"
  }
]

And there is space for a size under "assets". It's in the SQL database, just not exposed in the API yet. Maybe better to just refer to those as the truth and let the structure be null, same as it is for "container" structure family.

This work will probably overlap a bit with Respect range requests. #521 and should be loosely coordinated with it.

danielballan mentioned this issue Aug 21, 2023

Roadmap for v0.1.0 #552

Open

10 tasks

padraic-shafer mentioned this issue Sep 5, 2023

Support for reading and writing data as simple "bytes" #570

Draft

danielballan added this to the v0.1.0 release milestone Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a fallback structure family for "opaque bytes" #434

Support a fallback structure family for "opaque bytes" #434

danielballan commented Apr 18, 2023

danielballan commented Jul 10, 2023 •

edited

Loading

danielballan commented Jul 10, 2023

danielballan commented Jul 10, 2023 •

edited

Loading

padraic-shafer commented Jul 15, 2023

danielballan commented Jul 31, 2023 •

edited

Loading

danielballan commented Jul 31, 2023

padraic-shafer commented Jul 31, 2023

dylanmcreynolds commented Jul 31, 2023

danielballan commented Aug 22, 2023

dylanmcreynolds commented Aug 22, 2023

padraic-shafer commented Aug 22, 2023

danielballan commented Aug 23, 2023 •

edited

Loading

padraic-shafer commented Aug 23, 2023

danielballan commented Aug 23, 2023

dylanmcreynolds commented Aug 23, 2023

danielballan commented Aug 23, 2023

prjemian commented Aug 23, 2023

padraic-shafer commented Aug 30, 2023

danielballan commented Aug 30, 2023

dylanmcreynolds commented Aug 30, 2023 •

edited

Loading

danielballan commented Aug 30, 2023

padraic-shafer commented Aug 30, 2023

jmaruland commented Aug 30, 2023

padraic-shafer commented Aug 30, 2023

danielballan commented Aug 30, 2023

Support a fallback structure family for "opaque bytes" #434

Support a fallback structure family for "opaque bytes" #434

Comments

danielballan commented Apr 18, 2023

danielballan commented Jul 10, 2023 • edited Loading

danielballan commented Jul 10, 2023

danielballan commented Jul 10, 2023 • edited Loading

padraic-shafer commented Jul 15, 2023

danielballan commented Jul 31, 2023 • edited Loading

danielballan commented Jul 31, 2023

padraic-shafer commented Jul 31, 2023

dylanmcreynolds commented Jul 31, 2023

danielballan commented Aug 22, 2023

dylanmcreynolds commented Aug 22, 2023

padraic-shafer commented Aug 22, 2023

danielballan commented Aug 23, 2023 • edited Loading

padraic-shafer commented Aug 23, 2023

danielballan commented Aug 23, 2023

dylanmcreynolds commented Aug 23, 2023

danielballan commented Aug 23, 2023

prjemian commented Aug 23, 2023

padraic-shafer commented Aug 30, 2023

danielballan commented Aug 30, 2023

dylanmcreynolds commented Aug 30, 2023 • edited Loading

danielballan commented Aug 30, 2023

padraic-shafer commented Aug 30, 2023

jmaruland commented Aug 30, 2023

padraic-shafer commented Aug 30, 2023

danielballan commented Aug 30, 2023

danielballan commented Jul 10, 2023 •

edited

Loading

danielballan commented Jul 10, 2023 •

edited

Loading

danielballan commented Jul 31, 2023 •

edited

Loading

danielballan commented Aug 23, 2023 •

edited

Loading

dylanmcreynolds commented Aug 30, 2023 •

edited

Loading