Function for converting NWB file to AssetMeta instance #226

jwodder · 2020-08-27T20:37:52Z

This PR requires PR #206.

@satra Is this what you had in mind?

codecov · 2020-08-27T20:39:12Z

Codecov Report

Merging #226 (c6c5329) into master (194ce14) will decrease coverage by 0.21%.
The diff coverage is 82.85%.

@@            Coverage Diff             @@
##           master     #226      +/-   ##
==========================================
- Coverage   82.26%   82.05%   -0.22%     
==========================================
  Files          55       56       +1     
  Lines        4623     5020     +397     
==========================================
+ Hits         3803     4119     +316     
- Misses        820      901      +81

Flag	Coverage Δ
unittests	`82.05% <82.85%> (-0.22%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
dandi/tests/fixtures.py	`96.49% <ø> (ø)`
dandi/metadata.py	`74.17% <78.24%> (+25.39%)`	⬆️
dandi/validate.py	`81.25% <84.21%> (-4.47%)`	⬇️
dandi/models.py	`81.25% <100.00%> (+0.52%)`	⬆️
dandi/pynwb_utils.py	`87.00% <100.00%> (+0.84%)`	⬆️
dandi/tests/test_metadata.py	`100.00% <100.00%> (ø)`
dandi/tests/test_validate.py	`100.00% <100.00%> (ø)`
dandi/utils.py	`76.20% <0.00%> (+0.37%)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 194ce14...c6c5329. Read the comment docs.

satra · 2020-08-27T20:53:13Z

@jwodder - kind of. unless we change get_metadata to already do the mapping, you will likely need a transform or function that computes the asset field given the metadata. so perhaps something like: setattr(asset, field, get_asset_field(metadata, field))

def get_asset_field(metadata, field):
    if field == "wasDerivedFrom": 
        biosample = Biosample.unvalidated()
        for biosample_field in biosample.__fields__:
            setattr(biosample, biosample_field, get_asset_field(metadata, biosample_field))
        biosample = Biosample(**biosample)
    if field == "age":
        age = metadata["age"]
        # convert age to ISO 8601
        # create property value
        return property_value
    ...

and then after it is done you can check validation by: AssetMeta(**asset.dict())

yarikoptic · 2020-08-28T15:14:05Z

FWIW: I merged #206 into master.

yarikoptic · 2020-08-28T15:47:04Z

So, overall flow I think should be

NWB -1-> metadata dict (unharmonized) -2-> AssetMeta

-1-> is pretty much our get_metadata but we might need to extend it with getting more fields from nwb, so that dict would be sufficient for -2->.

I think having an explicit -1-> (instead of just feeding -2-> with the nwb file) would be useful for initial harmonization if later we start also dealing with other file types (neuroimaging etc).

Somewhat of a "tricky" part is that e.g. Digests we estimate during upload (not on the fly yet, like we already do in download) and enhance metadata record with them. So we should indeed allow for -2-> without full validation right away -- or better with validation but allowing some fields to not be defined (such as Digests) before we estimate them.

As for -2-> -- could be indeed the function as @satra suggested, but as we briefly touched upon yesterday, it might be worth right away to set them up within our model classes? e.g. instead of

contentSize: str = Field(nskey="schema")
path: str = Field(None, nskey="dandi")

have smth like

contentSize: str = Field(nskey="schema", adapter="size")
path: str = Field(None, nskey="dandi", adapter="path")

for basic 1-to-1 mappings, but also allow for callables like

def adapter_age(metadata):
        age = metadata["age"]
        # convert age to ISO 8601
        # create property value
        return property_value

and then have

age: Optional[PropertyValue] = Field(...allwhatithasnow, adapter=adapter_age)

Then for DandiBaseModel to define class method def from_metadata(cls, metadata, allow_missing=None) which would just iterate over cls attributes, use specified adapters to compose, validate (allowing for missing) and return the structure. While iterating, it would not fail right away if any KeyError is raised, but collect a set of them to raise at the end with a single informative exception telling which keys (and may be for which attributes) are still required.

NB. from_metadata could have avoided taking care about validation altogether, and we just provide additional helper to validate allowing for missing attributes, but then we would not be able to raise that informative message about which metadata fields are needed, which IMHO would be quite useful especially now while we are just establishing the mappings etc.

Re BioSample: When looping it would discover that it is of (or a list of) type which is subclass of DandiBaseModel (like BioSample is) it would just delegate to that cls.from_metadata(metadata, allow_missing=None).

may be "allow_missing" will need to allow for hierarchical specification to be able to pass into such nested calls, but I do not think it is needed ATM - I know only Digests

@satra @jwodder - does above make sense? ;)

satra · 2020-08-28T20:37:33Z

makes sense to me. the only reason i would suggest the standalone function for now instead of an adapter is that we may want to push nwb to align as closely as possible in the future.

and if ever we move to attrs from pydantic, it already has a notion of converter. the only reason we are using pydantic is to stay close to jsonschema and fastapi for now.

jwodder · 2020-08-31T13:13:20Z

I have enhanced the code according to the above suggestions. However, I would appreciate it if someone could provide a comprehensive example of NWB metadata (as extracted with get_metadata()) and the AssetMeta that it should be converted to so that I know that I'm handling & converting everything correctly.

satra · 2020-08-31T13:54:56Z

@jwodder - a starting point would be to extract the nwb metadata for a few files from each of the datasets we have access to rather than something completely comprehensive. to save time, i think @yarikoptic was going to do this on his backup system.

jwodder · 2020-08-31T13:59:55Z

@satra I'm interested in the AssetMeta, not the base metadata. For example, given a dict with all of the following fields:

dandi-cli/dandi/consts.py

Lines 9 to 45 in 00414f5

    
           metadata_nwb_file_fields = ( 
        
               "experiment_description", 
        
               "experimenter", 
        
               "identifier",  # note: required arg2 of NWBFile 
        
               "institution", 
        
               "keywords", 
        
               "lab", 
        
               "related_publications", 
        
               "session_description",  # note: required arg1 of NWBFile 
        
               "session_id", 
        
               "session_start_time", 
        
           ) 
        
           metadata_nwb_subject_fields = ( 
        
               "age", 
        
               "date_of_birth", 
        
               "genotype", 
        
               "sex", 
        
               "species", 
        
               "subject_id", 
        
           ) 
        
           metadata_nwb_dandi_fields = ("cell_id", "slice_id", "tissue_sample_id", "probe_ids") 
        
           metadata_nwb_computed_fields = ( 
        
               "number_of_electrodes", 
        
               "number_of_units", 
        
               "nwb_version", 
        
               "nd_types", 
        
           ) 
        
           metadata_nwb_fields = ( 
        
               metadata_nwb_file_fields 
        
               + metadata_nwb_subject_fields 
        
               + metadata_nwb_dandi_fields 
        
               + metadata_nwb_computed_fields 
        
           )

... where should everything end up in the resulting AssetMeta? Which fields need special parsing? Etc.

satra · 2020-08-31T14:13:24Z

@jwodder - you kind of have to go field by field and ask is this information somewhere in the base metadata. and remember some fields are themselves structures - like biosample.

easiest place to start is likely:

dandi-cli/dandi/consts.py

Line 22 in 00414f5

metadata_nwb_subject_fields = (

most of which go into biosample.

we may also need to augment the model to allow for other things like number of electrodes, etc.,. there is no specific place for this in the model.

other aspects such as experimenter could be mapped to contributor with role Researcher.

I think most of the fields have some description, but feel free to ask if something is unclear.

jwodder · 2020-08-31T14:32:04Z

@satra Specific questions, then:

How are TypeModel classes like SexType supposed to be used? What should be identifier, and what should be name?
What sort of input values are expected for assayType and anatomy, and how do they map to lists of values?
How should a plain age like "5 years" be converted to a PropertyValue? Would it just be PropertyValue(value="P5Y")?
What should be done for the various required/defaultless fields like BioSample.assayType and AssetMeta.dataType that have no corresponding field in get_metadata()?

satra · 2020-08-31T14:51:25Z

How are TypeModel classes like SexType supposed to be used? What should be identifier, and what should be name?

identifier should be the corresponding url from an ontology, name would be the label. more generally we can translate a few of those to enum types as we have done here:

https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L87

What sort of input values are expected for assayType and anatomy, and how do they map to lists of values?

assayType should come from OBI (which is a large ontology for biomedical investigation). the question here is how to map nwb "datatypes" to this.

How should a plain age like "5 years" be converted to a PropertyValue? Would it just be PropertyValue(value="P5Y")?

correct, with unitText set to "Years from brith" for now or for gestational weeks "Weeks from conception".

What should be done for the various required/defaultless fields like BioSample.assayType and AssetMeta.dataType that have no corresponding field in get_metadata()?

let's ask a few folks for hints.

@tgbugs - we are back to where do we find enumerations for all kinds of things and @bendichter we are looking at how we get people to put the right info into nwb or get out of existing nwb (which does not have any ontology attached yet). the specific considerations here are:

modality
measurementTechnique
(the above two could come from: https://docs.google.com/presentation/d/1OzLDeTvebHXvn07_oagUcSH0xOuxzrP2GSKTdt5jJlg/edit#slide=id.g8daabd40c5_0_0)
assayType (kind of processing that was done to the sample)
dataType (kind of processing that was done to the file)

all types are all listed here:

https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L87

and here:

https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L159

bendichter · 2020-08-31T15:08:18Z

@satra

I think modality and measurementTechnique would mostly come from which neurodata_types are being used in the NWB file. For instance,

If ElectricalSeries is used, it's "electrophysiology." If the file contains an IntracellularElectrode, I guess that's also electrophysiology? Does the ontology differentiate between intracellular and extracellular electrophysiology? NWB does.

If Position is used, it's a "behavioral approach."

If TwoPhotonSeries is used, it's a "cell population imaging."

Many experiments are combinations of these. Does this schema allow multiple modalities/measurementTechniques?

bendichter · 2020-08-31T15:08:52Z

Would you be able to provide an example of assayType or dataType? I'm not sure if NWB stores this information.

yarikoptic · 2020-08-31T18:03:34Z

Does this schema allow multiple modalities/measurementTechniques?

yes, https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L574 -- all those are List:

    modality: List[ModalityType] = Field(readonly=True, nskey="dandi")
    measurementTechnique: List[MeasurementTechniqueType] = Field(
        readonly=True, nskey="schema"
    )
    variableMeasured: List[PropertyValue] = Field(readonly=True, nskey="schema")

I think we have decided to postpone dealing with assets which have no useful data recorded (thus not optional - requires at least a single entry in the list), and I guess it might help to facilitate us to provide heuristics for all which we care about ATM ;) (since cannot just leave empty if we find no relevant one)

tgbugs · 2020-08-31T19:58:36Z

All the techniques and modalities are in InterLex now. @tmsincomb have you had a chance to ingest the subClassOf for the modalities (the modelling of the techniques isn't amenable to ingesting those right at the moment)?

tmsincomb · 2020-08-31T20:07:26Z

@tgbugs
I start up the superclass update for modality. I was going to fix the update first, but I'll do that now to avoid issues.

jwodder · 2020-08-31T20:29:25Z

@satra I've added a test case for conversion of a metadata dict to an AssetMeta to dandi/tests/test_metadata.py. Please take a look at it and tell me if it does what you want.

tmsincomb · 2020-08-31T20:38:47Z

@tgbugs The modalities superclasses are in now.

tgbugs · 2020-08-31T20:47:13Z

@tmsincomb thanks! The subclassOf closure for modalities (experimental approaches) is visible here.

tgbugs · 2020-08-31T20:51:05Z

@bendichter at the technique level the ontology distinguishes between intracellular and extracellular.

yarikoptic · 2020-09-01T14:50:03Z

FWIW: I have submitted information to get an RRID for dandi-cli. I will report back whenever it gets registered

yarikoptic · 2020-09-02T13:56:58Z

I think it might be good time to also introduce here RF to validate.py (and possibly cmd_validate.py) to use this new schema/function. I guess a good start is validate_dandi_nwb. Then new functionality could easily be tested on a sample of real files from dandiarchive dandisets.

* upstream/master: (116 commits) ENH: basic duecredit support DOC: strip away duplicate with the handbook information Update CHANGELOG.md [skip ci] Tweak pubschemata commit message Adjust workflow path triggers Add comment to numpy API warning ignore setting Workflow for publishing model schemata to dandi/schema [#275] Ignore numpy API warning Support h5py 3.0 Add healthchecks for the Postgres and minio Docker containers Include item path in "Multiple files found for item" message Copy files with `cp --reflink=auto` where supported Test the rest of keyring_lookup() More keyring_lookup() tests Test askyesno() Test that keyring backends specified with env vars take precedence over config files Test keyring_lookup() when backend is set via config file Test asking for an API key twice via input() Test keyring_lookup() when backend is set via env var Basic test of getting an API key via input() ...

* upstream/master: Update CHANGELOG.md [skip ci] change from disease to disorder Fix publish-schemata workflow updated just models BF: add h5py.__version__ into the list of tokens for caching

Some schema updates

dandi/metadata.py

* upstream/nwb2asset:

Nwb2asset

dandi/metadata.py

satra · 2020-12-01T23:18:16Z

@jwodder @yarikoptic - do you think we should add the dandiset metadata mapper from old version to new in this PR or a different PR?

yarikoptic · 2020-12-02T01:09:18Z

This PR already tunes validate_dandi_nwb, thus if released we would start enforcing schema upon users trying to validate/upload. I am not sure if we are ready for that yet... we could add temporary treatment of DANDI_SCHEMA env variable and if defined -- use new logic, if not -- old. This would allow then to merge and even let curious folks to use it, and eventually we would just remove that treatment when "we are ready". WDYT?

yarikoptic

Let's right away make mapping a bit more "robust" by at least catching ambiguous items.

Alternative would be:

establish some schema_mappings.yaml which would be hosted within this repo and contain exact values we encounter in data to be mapped to the corresponding term. We will update it whenever we encounter new loose values
make dandi-cli fetch updated version once in a while (or forcefully if requested) and cache it locally on user's drive to be used instead of the copy shipped within. This way we could let people update mapping without updating dandi-cli (although since its releases are quite automated now, we could even not bother about this feature for now)

edit: let's forget about .yaml file idea, probably would not be scalable etc, so IMHO we should keep it a .py, but may be RF later on so we could indeed fetch the freshier "mapping" file to be used instead of the shipped one.

dandi/metadata.py

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>

add dandimeta migration

satra · 2020-12-02T15:30:12Z

i have added the converter for dandiset metadata as well now.

yarikoptic · 2020-12-02T15:45:05Z

@jwodder please

make linters happy
unless there are objections/better ideas, please add an env var as described in Function for converting NWB file to AssetMeta instance #226 (comment) so we could retain current behavior in master while allowing for this new mapping etc.

jwodder · 2020-12-02T16:23:32Z

@yarikoptic I've satisfied the linters and added an if os.environ.get("DANDI_SCHEMA"): block to validate_dandi_nwb() to control whether to use the behavior from before or after this PR.

yarikoptic · 2020-12-02T18:10:10Z

Great, thank you @jwodder -- let's proceed

jwodder force-pushed the nwb2asset branch from 7e0e67e to 54d1448 Compare August 31, 2020 12:52

jwodder force-pushed the nwb2asset branch from 3f2fe15 to 354fa01 Compare August 31, 2020 13:14

jwodder added 5 commits September 4, 2020 09:51

Function for converting NWB file to AssetMeta instance

0867d7e

More metadata to model mapping

edb63fd

Adjustments to model code

6a8a74f

Test cases

d45647a

Move imports to top of file

dbc7203

jwodder force-pushed the nwb2asset branch from 2c40b20 to dbc7203 Compare September 4, 2020 13:52

satra added 2 commits November 30, 2020 10:44

updates to schema

0155c3f

yarikoptic mentioned this pull request Dec 1, 2020

Some schema updates #286

Merged

satra and others added 3 commits December 1, 2020 10:27

Merge remote-tracking branch 'upstream/master' into nwb2asset

bbee349

* upstream/master: Update CHANGELOG.md [skip ci] change from disease to disorder Fix publish-schemata workflow updated just models BF: add h5py.__version__ into the list of tokens for caching

tst: some updates based on model

7be857f

Merge pull request #286 from satra/nwb2asset

e13bc12

Some schema updates

satra reviewed Dec 1, 2020

View reviewed changes

dandi/metadata.py Outdated Show resolved Hide resolved

satra and others added 3 commits December 1, 2020 18:13

mapped some fields: sex, species

bfc26ce

Merge remote-tracking branch 'upstream/nwb2asset' into nwb2asset

a3f4515

* upstream/nwb2asset:

Merge pull request #294 from satra/nwb2asset

123e9f3

Nwb2asset

satra reviewed Dec 1, 2020

View reviewed changes

dandi/metadata.py Show resolved Hide resolved

yarikoptic requested changes Dec 2, 2020

View reviewed changes

dandi/metadata.py Outdated Show resolved Hide resolved

dandi/metadata.py Outdated Show resolved Hide resolved

satra and others added 3 commits December 1, 2020 21:19

Apply suggestions from code review

d01c602

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>

add dandimeta migration

c8cd5e7

Merge pull request #295 from satra/nwb2asset

114b729

add dandimeta migration

yarikoptic assigned jwodder Dec 2, 2020

jwodder and others added 3 commits December 2, 2020 10:57

Delint

90d64c8

fix: access requirements

28cc8ac

Toggle legacy behavior in validate_dandi_nwb() via DANDI_SCHEMA envvar

c6c5329

yarikoptic added enhancement New feature or request minor Increment the minor version when merged labels Dec 2, 2020

yarikoptic merged commit e382cf1 into master Dec 2, 2020

yarikoptic deleted the nwb2asset branch December 2, 2020 18:10

yarikoptic mentioned this pull request Jun 9, 2021

Start populating/using *Asset.dataType ? dandi/dandi-schema#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function for converting NWB file to AssetMeta instance #226

Function for converting NWB file to AssetMeta instance #226

jwodder commented Aug 27, 2020

codecov bot commented Aug 27, 2020 •

edited

Loading

satra commented Aug 27, 2020

yarikoptic commented Aug 28, 2020

yarikoptic commented Aug 28, 2020

satra commented Aug 28, 2020

jwodder commented Aug 31, 2020

satra commented Aug 31, 2020

jwodder commented Aug 31, 2020

satra commented Aug 31, 2020

jwodder commented Aug 31, 2020 •

edited

Loading

satra commented Aug 31, 2020

bendichter commented Aug 31, 2020

bendichter commented Aug 31, 2020

yarikoptic commented Aug 31, 2020

tgbugs commented Aug 31, 2020

tmsincomb commented Aug 31, 2020

jwodder commented Aug 31, 2020

tmsincomb commented Aug 31, 2020

tgbugs commented Aug 31, 2020

tgbugs commented Aug 31, 2020

yarikoptic commented Sep 1, 2020

yarikoptic commented Sep 2, 2020

satra commented Dec 1, 2020

yarikoptic commented Dec 2, 2020

yarikoptic left a comment •

edited

Loading

satra commented Dec 2, 2020

yarikoptic commented Dec 2, 2020

jwodder commented Dec 2, 2020

yarikoptic commented Dec 2, 2020

Function for converting NWB file to AssetMeta instance #226

Function for converting NWB file to AssetMeta instance #226

Conversation

jwodder commented Aug 27, 2020

codecov bot commented Aug 27, 2020 • edited Loading

Codecov Report

satra commented Aug 27, 2020

yarikoptic commented Aug 28, 2020

yarikoptic commented Aug 28, 2020

satra commented Aug 28, 2020

jwodder commented Aug 31, 2020

satra commented Aug 31, 2020

jwodder commented Aug 31, 2020

satra commented Aug 31, 2020

jwodder commented Aug 31, 2020 • edited Loading

satra commented Aug 31, 2020

bendichter commented Aug 31, 2020

bendichter commented Aug 31, 2020

yarikoptic commented Aug 31, 2020

tgbugs commented Aug 31, 2020

tmsincomb commented Aug 31, 2020

jwodder commented Aug 31, 2020

tmsincomb commented Aug 31, 2020

tgbugs commented Aug 31, 2020

tgbugs commented Aug 31, 2020

yarikoptic commented Sep 1, 2020

yarikoptic commented Sep 2, 2020

satra commented Dec 1, 2020

yarikoptic commented Dec 2, 2020

yarikoptic left a comment • edited Loading

Choose a reason for hiding this comment

satra commented Dec 2, 2020

yarikoptic commented Dec 2, 2020

jwodder commented Dec 2, 2020

yarikoptic commented Dec 2, 2020

codecov bot commented Aug 27, 2020 •

edited

Loading

jwodder commented Aug 31, 2020 •

edited

Loading

yarikoptic left a comment •

edited

Loading