Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to work with new array schema and LoL pydantic generator #14

Merged
merged 20 commits into from
Sep 19, 2024

Conversation

rly
Copy link
Collaborator

@rly rly commented Jul 6, 2024

  • Update dumpers and loaders to work with the new "array" element in the LinkML metamodel
  • Create new abstract base class for dumping a model into a YAML file plus files to store individual arrays (e.g., npy, hdf5)
  • Update tests to use the new pydantic generator for arrays that uses list of lists (numpydantic will be a later PR)
  • Update example inputs to tests
  • Use pytest style tests

@sneakers-the-rat
Copy link

sneakers-the-rat commented Jul 6, 2024

I see the values: file:./path/to/thing.h5 notation and i like it.

I have considered something similar, and i basically worry about the expressiveness and future-proofing: warning this will be more "structural vs. nominal typing" discourse.

  • Sometimes file extension is not enough to know what something is. It works 99% of the time, but the major exceptions would be something like format versioning - what version is that .npz file? zarr is relatively self describing, but other formats might not be.
  • Sometimes we want to access things that are within a file rather than the file itself. HDF5 is a natural example here. A format like file:./my/file.h5:/some/subpath starts to get a little verbose and also becomes its own microsyntax
  • Sometimes we want to provide multiple sources for something - in particular i'm interested in being able to say something like "this comes from this local file, but also can be identified by this content hash or that URL"

So this is nice and tidy and concise

temperature_dataset:
  date:
    values: file:./out/my_temperature.date.values.npy
  day_in_d:
    reference_date: '2020-01-01'
    values: file:./out/my_temperature.day_in_d.values.npy

what about this?

values: { file: ./out/my_temperature.npy }

which is equivalent to

values:
  file: ./out/my_temperature.npy

as a generic "best effort" kind of loading.

that could be optionally expanded like:

values:
  file: ./out/my_temperature.npy
  type: numpy
  version: 1

for example, where each type has its own corresponding model like (pseudocode)

classes:
  Numpy:
   is_a: FileReference
   attributes:
     id: numpy
     file: ...
     version: ....

and then that could also work like

values:
 - file: ./out_my_temperature.npy
 - hash: sha256:awrguiertgsrgsergioul
 - file: ./out_my_temp.h5
   type: hdf5
   dataset: /some/subpath

and then importantly each of those makes room for plugins - to make a new type one just needs to implement some loader class with the relevant schema and whatever kind of plugin hooks we want to provide like a decorator or etc. The loader then tries each of the possible sources, etc.

that would be useful for eg. a case like DANDI where on upload they might automatically augment the file to include the archival links like

values:
  - file: ./my_temperature.npy
  - file: https://dandiarchive.org/some/url/my_temperature.npy

where if you just download the yaml file without the rest of the data, a loader would be able to grab it on access.

ik that kind of spec runs counter to the nominal typing of linkml, which would probably want to do something like

values:
  numpy_file: ./my_temperature.npy
  file_url: https://...

but i think the need for introspection is worth the ability to define plugins and extend the system

@rly
Copy link
Collaborator Author

rly commented Jul 7, 2024

@sneakers-the-rat I agree 100% on all points! I actually have a draft for something very similar to

values:
 - file: ./out_my_temperature.npy
   hash: sha256:awrguiertgsrgsergioul
 - file: ./out_my_temp.h5
   type: hdf5
   dataset: /some/subpath
 - file: https://dandiarchive.org/some/url/my_temperature.npy

in a private repo reimagining HDMF (I know, it's not very useful to have ideas and not share them XD). I especially like this form because an API can try to resolve the values list in order and one could set up a backup remote file location as the second file in case the first file is not available / cached locally.

We could further encode the dtype, shape, compression parameters, and byte locations of each individual chunk of an array so that 1) we can introspect the dataset without loading the file, and 2) retrieve each chunk from a remote data store very quickly. This is what the dataset spec of LINDI does. I think it would be awesome to merge the features of linkml-arrays, numpydantic, NWB, and LINDI all together, somehow, though I am not quite sure how to do the LINDI part just yet...

A values might look something like this:

values:
  dtype: "<i2"  # move fields that *should* be the same across all file formats here
  shape: [203973840, 64]
  source:
    - file: ./out_my_temperature.npy
      hash: sha256:awrguiertgsrgsergioul
    - file: ./out_my_temp.h5
      type: hdf5
      dataset: /some/subpath
    - file: https://dandiarchive.org/some/url/my_temperature.npy  # remote file
    - file: https://dandiarchive.org/some/url/my_temperature.bin  # flat binary file
      type: binary
      order: C
    - chunks: [99597,1]  # lindi format (OR just point to a LINDI JSON file and provide a key just like the "dataset" key for the hdf5 file? but this data YAML serves the same purpose as the LINDI JSON file)
      fill_value: 0
      filters:
        - id: zlib
          level: 4
      order: C
      references:
        - "0.0": ["{{u1}}", 247075087, 157629],
        - "0.1": ["{{u1}}", 247232716, 156928],
        - ...
      templates:
        - u1: https://api.dandiarchive.org/api/dandisets/000939/versions/0.240327.2229/assets/56d875d6-a705-48d3-944c-53394a389c85/download/

The reference "0.0": ["{{u1}}", 247075087, 157629] says that chunk (0,0) can be accessed by going to the dandi url specified in the u1 template, seeking to byte 247075087, reading 157629 bytes, and decompressing the data with zlib (gzip) level 4. The file referred to be u1 can be any file that encodes the array in a compatible format, and it can be local or remote. This form of data specification can be its own LINDI plugin.

I agree - this does not feel very linkml-y, but I think it is worth it and necessary to handle the variety of array data formats that people want to use and variety of ways to optimize data access given the large data sizes and use cases.

It might be a lot to propose all that at once, but I think we should, so we have a public target vision that we can iterate on.

I suggest we keep this PR simple though. I'll change values to be a list of dictionaries with the file key though, so we don't have to change too much as we move forward. What do you think?

All to say, I am all in favor of what you proposed and would like to hash these ideas out further!

@rly rly merged commit bacaf30 into main Sep 19, 2024
3 checks passed
@rly rly deleted the update_metamodel branch September 19, 2024 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants