-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function for converting NWB file to AssetMeta instance #226
Conversation
Codecov Report
@@ Coverage Diff @@
## master #226 +/- ##
==========================================
- Coverage 82.26% 82.05% -0.22%
==========================================
Files 55 56 +1
Lines 4623 5020 +397
==========================================
+ Hits 3803 4119 +316
- Misses 820 901 +81
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@jwodder - kind of. unless we change get_metadata to already do the mapping, you will likely need a transform or function that computes the asset field given the metadata. so perhaps something like:
and then after it is done you can check validation by: |
FWIW: I merged #206 into master. |
So, overall flow I think should be
I think having an explicit Somewhat of a "tricky" part is that e.g. Digests we estimate during upload (not on the fly yet, like we already do in download) and enhance metadata record with them. So we should indeed allow for As for contentSize: str = Field(nskey="schema")
path: str = Field(None, nskey="dandi") have smth like contentSize: str = Field(nskey="schema", adapter="size")
path: str = Field(None, nskey="dandi", adapter="path") for basic 1-to-1 mappings, but also allow for callables like def adapter_age(metadata):
age = metadata["age"]
# convert age to ISO 8601
# create property value
return property_value and then have age: Optional[PropertyValue] = Field(...allwhatithasnow, adapter=adapter_age) Then for NB. Re
|
makes sense to me. the only reason i would suggest the standalone function for now instead of an adapter is that we may want to push nwb to align as closely as possible in the future. and if ever we move to attrs from pydantic, it already has a notion of converter. the only reason we are using pydantic is to stay close to jsonschema and fastapi for now. |
I have enhanced the code according to the above suggestions. However, I would appreciate it if someone could provide a comprehensive example of NWB metadata (as extracted with |
@jwodder - a starting point would be to extract the nwb metadata for a few files from each of the datasets we have access to rather than something completely comprehensive. to save time, i think @yarikoptic was going to do this on his backup system. |
@satra I'm interested in the Lines 9 to 45 in 00414f5
... where should everything end up in the resulting |
@jwodder - you kind of have to go field by field and ask is this information somewhere in the base metadata. and remember some fields are themselves structures - like biosample. easiest place to start is likely: Line 22 in 00414f5
most of which go into biosample. we may also need to augment the model to allow for other things like number of electrodes, etc.,. there is no specific place for this in the model. other aspects such as experimenter could be mapped to contributor with role Researcher. I think most of the fields have some description, but feel free to ask if something is unclear. |
@satra Specific questions, then:
|
identifier should be the corresponding url from an ontology, name would be the label. more generally we can translate a few of those to enum types as we have done here: https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L87
assayType should come from OBI (which is a large ontology for biomedical investigation). the question here is how to map nwb "datatypes" to this.
correct, with unitText set to "Years from brith" for now or for gestational weeks "Weeks from conception".
let's ask a few folks for hints. @tgbugs - we are back to where do we find enumerations for all kinds of things and @bendichter we are looking at how we get people to put the right info into nwb or get out of existing nwb (which does not have any ontology attached yet). the specific considerations here are:
all types are all listed here: https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L87 and here: https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L159 |
I think modality and measurementTechnique would mostly come from which neurodata_types are being used in the NWB file. For instance, If If If Many experiments are combinations of these. Does this schema allow multiple modalities/measurementTechniques? |
Would you be able to provide an example of assayType or dataType? I'm not sure if NWB stores this information. |
yes, https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L574 -- all those are modality: List[ModalityType] = Field(readonly=True, nskey="dandi")
measurementTechnique: List[MeasurementTechniqueType] = Field(
readonly=True, nskey="schema"
)
variableMeasured: List[PropertyValue] = Field(readonly=True, nskey="schema") I think we have decided to postpone dealing with assets which have no useful data recorded (thus not optional - requires at least a single entry in the list), and I guess it might help to facilitate us to provide heuristics for all which we care about ATM ;) (since cannot just leave empty if we find no relevant one) |
All the techniques and modalities are in InterLex now. @tmsincomb have you had a chance to ingest the subClassOf for the modalities (the modelling of the techniques isn't amenable to ingesting those right at the moment)? |
@tgbugs |
@satra I've added a test case for conversion of a metadata |
@tgbugs The modalities superclasses are in now. |
@tmsincomb thanks! The subclassOf closure for modalities (experimental approaches) is visible here. |
@bendichter at the technique level the ontology distinguishes between intracellular and extracellular. |
FWIW: I have submitted information to get an RRID for dandi-cli. I will report back whenever it gets registered |
I think it might be good time to also introduce here RF to |
* upstream/master: (116 commits) ENH: basic duecredit support DOC: strip away duplicate with the handbook information Update CHANGELOG.md [skip ci] Tweak pubschemata commit message Adjust workflow path triggers Add comment to numpy API warning ignore setting Workflow for publishing model schemata to dandi/schema [#275] Ignore numpy API warning Support h5py 3.0 Add healthchecks for the Postgres and minio Docker containers Include item path in "Multiple files found for item" message Copy files with `cp --reflink=auto` where supported Test the rest of keyring_lookup() More keyring_lookup() tests Test askyesno() Test that keyring backends specified with env vars take precedence over config files Test keyring_lookup() when backend is set via config file Test asking for an API key twice via input() Test keyring_lookup() when backend is set via env var Basic test of getting an API key via input() ...
* upstream/master: Update CHANGELOG.md [skip ci] change from disease to disorder Fix publish-schemata workflow updated just models BF: add h5py.__version__ into the list of tokens for caching
Some schema updates
@jwodder @yarikoptic - do you think we should add the dandiset metadata mapper from old version to new in this PR or a different PR? |
This PR already tunes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's right away make mapping a bit more "robust" by at least catching ambiguous items.
Alternative would be:
- establish some
schema_mappings.yaml
which would be hosted within this repo and contain exact values we encounter in data to be mapped to the corresponding term. We will update it whenever we encounter new loose values - make
dandi-cli
fetch updated version once in a while (or forcefully if requested) and cache it locally on user's drive to be used instead of the copy shipped within. This way we could let people update mapping without updating dandi-cli (although since its releases are quite automated now, we could even not bother about this feature for now)
edit: let's forget about .yaml
file idea, probably would not be scalable etc, so IMHO we should keep it a .py, but may be RF later on so we could indeed fetch the freshier "mapping" file to be used instead of the shipped one.
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
add dandimeta migration
i have added the converter for dandiset metadata as well now. |
@jwodder please
|
@yarikoptic I've satisfied the linters and added an |
Great, thank you @jwodder -- let's proceed |
This PR requires PR #206.
@satra Is this what you had in mind?