-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Struct codec #601
Struct codec #601
Conversation
Codecov Report
@@ Coverage Diff @@
## master #601 +/- ##
==========================================
+ Coverage 87.31% 87.40% +0.08%
==========================================
Files 22 22
Lines 16758 16904 +146
Branches 3281 3315 +34
==========================================
+ Hits 14633 14775 +142
- Misses 1040 1042 +2
- Partials 1085 1087 +2
Continue to review full report at Codecov.
|
I'm not parsing exactly what's allowed here - will SLiM's metadata fit? It sounds like yes except maybe the mutation metadata, that is a variable-length list of fixed structs, but I can't tell? |
The only thing that is not allowed are objects that have missing keys or extra keys, so a variable length list of fixed objects that all have all the keys is fine. |
@jeromekelleher Just needs docs. So I don't have to rewrite docs after changes maybe review the code now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @benjeffery! Some comments above, mostly about testing and breaking the code up a little bit.
One thought - can/should we include padding bytes? I can foresee problems if we don't make some allowance for them. For example, someone might dump a C struct directly into the metadata, and if we can't handle padding we won't be able to work with it.
python/tskit/metadata.py
Outdated
element_decoder = make_decode(sub_schema["items"]) | ||
array_length_size = struct.calcsize("<Q") | ||
|
||
def array_decode(bytes_, offset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment really. I find bytes_
a bit awkward and would tend to use buff
or buffer
for variables like this. No biggie either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer is better, renamed.
python/tskit/metadata.py
Outdated
|
||
elif sub_schema["type"] == "string": | ||
size = struct.calcsize("<" + sub_schema["binaryFormat"]) | ||
return lambda bytes_, offset: ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lambdas and private functions are getting a bit cryptic. Maybe we could make these into a set of small module level functions instead, like _struct_codec_decode_string
or something? This would also make it easier to do a few small unit tests on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken up and individually tested.
python/tskit/metadata.py
Outdated
except KeyError: | ||
raise exceptions.MetadataSchemaValidationError( | ||
f"Unrecognised metadata codec '{schema['codec']}'. " | ||
f"Valid options are {str(list(codec_registry.keys()))}." | ||
) | ||
codec_instance = codec_cls(schema) | ||
# If the codec has modified the schema on __init__, get it back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange - need a bit more explanation of what's going on in the comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to make intent clear in 0617b60
Oh yeah, another nice test: take some SLiM metadata encoded as bytes (can just put the value directly into the source), hand craft a schema to decode it, and then compare decoded value against the known values. Should be easy enough to get an example and the known values using pyslim. This is exciting!!! Genuine interoperability across programs and platforms!!! |
From a call today - for interop need both padding and order - use |
For string type add encoding to schema (default utf-8). |
I've managed to get some SLiM generated metadata decoded in b73c80f However there is an issue - as detailed at https://github.com/tskit-dev/pyslim/blob/master/pyslim/slim_metadata.py#L66 SLiM uses the length of the metadata buffer to encode the length of the mutations array. This is not supported in the current metaschema. We could support this by adding a |
@jeromekelleher I think this is ready for another pass. I at least need a break from it to be able to come back and tell if it is any good! Some aspects we might like to make issues for, which I think are best not rolled into this PR as it is already large!
|
Ping @bhaller: #601 (comment) |
I've already extended the schema, see the SLiM tests in the diff of this PR. https://github.com/tskit-dev/tskit/pull/601/files#diff-f5f11b0325511ec01c11038c469c103fR1316 EDIT: Github won't link to the exact line as it has collapsed the diff. Code looks like this: schema = {
"codec": "struct",
"type": "array",
"noLengthEncodingExhaustBuffer": True,
"items": {
"type": "object",
"properties": {
"mutationTypeID": {
"type": "integer",
"binaryFormat": "i",
"index": 1,
},
"selectionCoeff": {
"type": "number",
"binaryFormat": "f",
"index": 2,
},
"subpopulationID": {
"type": "integer",
"binaryFormat": "i",
"index": 3,
},
"originGeneration": {
"type": "integer",
"binaryFormat": "i",
"index": 4,
},
"nucleotide": {"type": "integer", "binaryFormat": "b", "index": 5},
},
},
} |
Hi @petrelharp @benjeffery . Great; that seems like the right choice to me – we don't want to change the SLiM metadata format unless we really have to, as it would just introduce headaches for backward compatibility etc. I would be interested to have a bit of background about this whole thing. At the top, it says it "fixes #601" but if I click on that link it just takes me back to this PR; so this PR apparently fixes itself. :-> Not sure what's going on with that. Also at the top it says "Implementation of the struct codec for metadata." I guess "the struct codec" is a Python thing? What will this ultimately provide to the user? Anyway, no need to spend time explaining to me now; I imagine all will become clear to me once the doc for this has been written. :-> |
@bhaller some docs here: https://tskit.readthedocs.io/en/latest/metadata.html (I do need to make better descriptions in my PRs though!) |
Nice solution. Minor suggestion: how about |
I wanted to make it long and scary looking as it should only be used in specific legacy situations like this. |
Why is that? It seems pretty natural to allow the number of entries to be determined by the length of the buffer; why wouldn't new users of this stuff want to utilize this option? |
There's a couple of reasons. Firstly an array with this option needs to be the last object read from the buffer. It's easy to configure this wrong - the error message you get in such a situation is not obvious and even worse bad data could be read back silently without error if the user sets this option on two arrays. Secondly a key use-case for the metadata schemas is adding metadata to a tree sequence that already has some, say in a downstream tool or user code. In this scenario the schema is read, modified to add the new fields and then written back. If the existing schema has a @jeromekelleher This second point reminded me of a thought. Maybe we should force the top-level type to be tables.mutations[0].metadata[index] and tables.mutations[0].metadata['SLiM'][index] Is just a change of schema, not binary representation. I think I just convinced myself we should force this, happy to talk it over tomorrow. |
Thanks for the explanation, makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid to me! The docs could do with another pass, but I say we merge now and get some feedback from potential users like @bhaller and @molpopgen
We're not seeing codecov reports - can you confirm the diff coverage is 100% here @benjeffery?
They are at the top of the thread, but yes 100% #601 (comment) |
@jeromekelleher Agreed about the docs, the initial structure has broken down with all the extra parameters that were added. Shall I go ahead to squash, merge and file the 5 issues at #601 (comment) , plus an issue for metadata docs pass? (Sorry this PR has a high R0!) |
Yes, please go ahead and squash/merge this one and file issues for any stuff that's remaining. |
Implementation of the
struct
codec for metadata. Supports encoding an arbitrary nesting of variable-length arrays and fixed-key-set objects up to the python stack depth. Only overhead is array length as keys and types are not stored in the encoded bytes but read from the schema.TODOs:
binaryFormat
strings by regexSchema: Enforceindex
to be unique and numericnull
type handlingindex
with alphabetic key sort(No! Forbid in schema)array
type in JSONSchema allows heterogeneous arrays, support?Support variable length strings?Deferring to another PR if wanted. Variable length strings in struct codec? #613Perf passdeferring to another PR Struct codec perf #614p
type needing an extra byteenum
encoding, bytes, and compression.