-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to work with new array schema and LoL pydantic generator #14
Conversation
rly
commented
Jul 6, 2024
- Update dumpers and loaders to work with the new "array" element in the LinkML metamodel
- Create new abstract base class for dumping a model into a YAML file plus files to store individual arrays (e.g., npy, hdf5)
- Update tests to use the new pydantic generator for arrays that uses list of lists (numpydantic will be a later PR)
- Update example inputs to tests
- Use pytest style tests
I see the I have considered something similar, and i basically worry about the expressiveness and future-proofing: warning this will be more "structural vs. nominal typing" discourse.
So this is nice and tidy and concise temperature_dataset:
date:
values: file:./out/my_temperature.date.values.npy
day_in_d:
reference_date: '2020-01-01'
values: file:./out/my_temperature.day_in_d.values.npy what about this? values: { file: ./out/my_temperature.npy } which is equivalent to values:
file: ./out/my_temperature.npy as a generic "best effort" kind of loading. that could be optionally expanded like: values:
file: ./out/my_temperature.npy
type: numpy
version: 1 for example, where each classes:
Numpy:
is_a: FileReference
attributes:
id: numpy
file: ...
version: .... and then that could also work like values:
- file: ./out_my_temperature.npy
- hash: sha256:awrguiertgsrgsergioul
- file: ./out_my_temp.h5
type: hdf5
dataset: /some/subpath and then importantly each of those makes room for plugins - to make a new that would be useful for eg. a case like DANDI where on upload they might automatically augment the file to include the archival links like values:
- file: ./my_temperature.npy
- file: https://dandiarchive.org/some/url/my_temperature.npy where if you just download the ik that kind of spec runs counter to the nominal typing of linkml, which would probably want to do something like values:
numpy_file: ./my_temperature.npy
file_url: https://... but i think the need for introspection is worth the ability to define plugins and extend the system |
@sneakers-the-rat I agree 100% on all points! I actually have a draft for something very similar to values:
- file: ./out_my_temperature.npy
hash: sha256:awrguiertgsrgsergioul
- file: ./out_my_temp.h5
type: hdf5
dataset: /some/subpath
- file: https://dandiarchive.org/some/url/my_temperature.npy in a private repo reimagining HDMF (I know, it's not very useful to have ideas and not share them XD). I especially like this form because an API can try to resolve the We could further encode the dtype, shape, compression parameters, and byte locations of each individual chunk of an array so that 1) we can introspect the dataset without loading the file, and 2) retrieve each chunk from a remote data store very quickly. This is what the dataset spec of LINDI does. I think it would be awesome to merge the features of linkml-arrays, numpydantic, NWB, and LINDI all together, somehow, though I am not quite sure how to do the LINDI part just yet... A values:
dtype: "<i2" # move fields that *should* be the same across all file formats here
shape: [203973840, 64]
source:
- file: ./out_my_temperature.npy
hash: sha256:awrguiertgsrgsergioul
- file: ./out_my_temp.h5
type: hdf5
dataset: /some/subpath
- file: https://dandiarchive.org/some/url/my_temperature.npy # remote file
- file: https://dandiarchive.org/some/url/my_temperature.bin # flat binary file
type: binary
order: C
- chunks: [99597,1] # lindi format (OR just point to a LINDI JSON file and provide a key just like the "dataset" key for the hdf5 file? but this data YAML serves the same purpose as the LINDI JSON file)
fill_value: 0
filters:
- id: zlib
level: 4
order: C
references:
- "0.0": ["{{u1}}", 247075087, 157629],
- "0.1": ["{{u1}}", 247232716, 156928],
- ...
templates:
- u1: https://api.dandiarchive.org/api/dandisets/000939/versions/0.240327.2229/assets/56d875d6-a705-48d3-944c-53394a389c85/download/ The reference I agree - this does not feel very linkml-y, but I think it is worth it and necessary to handle the variety of array data formats that people want to use and variety of ways to optimize data access given the large data sizes and use cases. It might be a lot to propose all that at once, but I think we should, so we have a public target vision that we can iterate on. I suggest we keep this PR simple though. I'll change All to say, I am all in favor of what you proposed and would like to hash these ideas out further! |