Struct codec #601

benjeffery · 2020-05-11T01:41:48Z

Implementation of the struct codec for metadata. Supports encoding an arbitrary nesting of variable-length arrays and fixed-key-set objects up to the python stack depth. Only overhead is array length as keys and types are not stored in the encoded bytes but read from the schema.

TODOs:

codecov · 2020-05-11T02:00:59Z

Codecov Report

Merging #601 into master will increase coverage by 0.08%.
The diff coverage is 97.51%.

@@            Coverage Diff             @@
##           master     #601      +/-   ##
==========================================
+ Coverage   87.31%   87.40%   +0.08%     
==========================================
  Files          22       22              
  Lines       16758    16904     +146     
  Branches     3281     3315      +34     
==========================================
+ Hits        14633    14775     +142     
- Misses       1040     1042       +2     
- Partials     1085     1087       +2

Flag	Coverage Δ
#c_tests	`88.64% <97.51%> (+0.10%)`	⬆️
#python_c_tests	`90.62% <97.51%> (+0.12%)`	⬆️
#python_tests	`98.96% <97.51%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
python/tskit/metadata.py	`98.32% <97.51%> (-1.68%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 610392d...e406ab4. Read the comment docs.

petrelharp · 2020-05-11T14:28:12Z

I'm not parsing exactly what's allowed here - will SLiM's metadata fit? It sounds like yes except maybe the mutation metadata, that is a variable-length list of fixed structs, but I can't tell?

benjeffery · 2020-05-11T14:36:08Z

The only thing that is not allowed are objects that have missing keys or extra keys, so a variable length list of fixed objects that all have all the keys is fine.

benjeffery · 2020-05-13T14:49:26Z

@jeromekelleher Just needs docs. So I don't have to rewrite docs after changes maybe review the code now?

jeromekelleher

Looks great, thanks @benjeffery! Some comments above, mostly about testing and breaking the code up a little bit.

One thought - can/should we include padding bytes? I can foresee problems if we don't make some allowance for them. For example, someone might dump a C struct directly into the metadata, and if we can't handle padding we won't be able to work with it.

docs/metadata.rst

docs/tutorial.rst

python/tests/test_metadata.py

python/tskit/metadata.py

jeromekelleher · 2020-05-14T08:26:59Z

python/tskit/metadata.py

+                element_decoder = make_decode(sub_schema["items"])
+                array_length_size = struct.calcsize("<Q")
+
+                def array_decode(bytes_, offset):


Just a comment really. I find bytes_ a bit awkward and would tend to use buff or buffer for variables like this. No biggie either way.

Buffer is better, renamed.

jeromekelleher · 2020-05-14T08:31:03Z

python/tskit/metadata.py

+
+            elif sub_schema["type"] == "string":
+                size = struct.calcsize("<" + sub_schema["binaryFormat"])
+                return lambda bytes_, offset: (


These lambdas and private functions are getting a bit cryptic. Maybe we could make these into a set of small module level functions instead, like _struct_codec_decode_string or something? This would also make it easier to do a few small unit tests on them.

Broken up and individually tested.

jeromekelleher · 2020-05-14T08:32:57Z

python/tskit/metadata.py

            except KeyError:
                raise exceptions.MetadataSchemaValidationError(
                    f"Unrecognised metadata codec '{schema['codec']}'. "
                    f"Valid options are {str(list(codec_registry.keys()))}."
                )
+            codec_instance = codec_cls(schema)
+            # If the codec has modified the schema on __init__, get it back.


This is strange - need a bit more explanation of what's going on in the comment here.

Refactored to make intent clear in 0617b60

jeromekelleher · 2020-05-14T08:42:46Z

Oh yeah, another nice test: take some SLiM metadata encoded as bytes (can just put the value directly into the source), hand craft a schema to decode it, and then compare decoded value against the known values. Should be easy enough to get an example and the known values using pyslim.

This is exciting!!! Genuine interoperability across programs and platforms!!!

benjeffery · 2020-05-15T13:36:29Z

From a call today - for interop need both padding and order - use index to force ordering. Also array length format should be specifiable.
Extract decoders and test individually.

benjeffery · 2020-05-15T13:41:44Z

For string type add encoding to schema (default utf-8).

benjeffery · 2020-05-21T01:14:42Z

I've managed to get some SLiM generated metadata decoded in b73c80f

However there is an issue - as detailed at https://github.com/tskit-dev/pyslim/blob/master/pyslim/slim_metadata.py#L66 SLiM uses the length of the metadata buffer to encode the length of the mutations array. This is not supported in the current metaschema. We could support this by adding a noLengthEncoding or similarly-named parameter to the array type in the schema, which causes the codec to read elements until the buffer is exhausted.

benjeffery · 2020-05-21T17:03:08Z

@jeromekelleher I think this is ready for another pass. I at least need a break from it to be able to come back and tell if it is any good!

Some aspects we might like to make issues for, which I think are best not rolled into this PR as it is already large!

Variable length strings - easily supported but have an irreducible perf hit due to an extra pack/unpack call. Any structure with a variable string in cannot be decoded in one chunk.
Enumerated types - could support these with the schema defining the values and the codec using the minimum number of bytes to represent them. Note that some of the SLiM metadata is already of this type, e.g. nucleotide, sex.
Storage of bytes. There is no way in the schema to round trip a python bytes string. Also not possible in JSON: TypeError: Object of type 'bytes' is not JSON serializable We could add a type bytes to the metaschema, or add encoding: None to the string type.
Support for numpy arrays.
Compression - the codec could support compression.

petrelharp · 2020-05-21T17:23:02Z

Ping @bhaller: #601 (comment)
since the choice is between extending the metadata schema or changing how SLiM records metadata, since we'll definately want to use this (fantastci) stuff to decode (and encode!) SLiM metadata when implemented.

benjeffery · 2020-05-21T17:28:57Z

since the choice is between extending the metadata schema....

I've already extended the schema, see the SLiM tests in the diff of this PR. https://github.com/tskit-dev/tskit/pull/601/files#diff-f5f11b0325511ec01c11038c469c103fR1316

EDIT: Github won't link to the exact line as it has collapsed the diff. Code looks like this:

schema = {
            "codec": "struct",
            "type": "array",
            "noLengthEncodingExhaustBuffer": True,
            "items": {
                "type": "object",
                "properties": {
                    "mutationTypeID": {
                        "type": "integer",
                        "binaryFormat": "i",
                        "index": 1,
                    },
                    "selectionCoeff": {
                        "type": "number",
                        "binaryFormat": "f",
                        "index": 2,
                    },
                    "subpopulationID": {
                        "type": "integer",
                        "binaryFormat": "i",
                        "index": 3,
                    },
                    "originGeneration": {
                        "type": "integer",
                        "binaryFormat": "i",
                        "index": 4,
                    },
                    "nucleotide": {"type": "integer", "binaryFormat": "b", "index": 5},
                },
            },
        }

bhaller · 2020-05-21T17:46:07Z

since the choice is between extending the metadata schema....

I've already extended the schema, see the SLiM tests in the diff of this PR.

Hi @petrelharp @benjeffery . Great; that seems like the right choice to me – we don't want to change the SLiM metadata format unless we really have to, as it would just introduce headaches for backward compatibility etc.

I would be interested to have a bit of background about this whole thing. At the top, it says it "fixes #601" but if I click on that link it just takes me back to this PR; so this PR apparently fixes itself. :-> Not sure what's going on with that. Also at the top it says "Implementation of the struct codec for metadata." I guess "the struct codec" is a Python thing? What will this ultimately provide to the user? Anyway, no need to spend time explaining to me now; I imagine all will become clear to me once the doc for this has been written. :->

benjeffery · 2020-05-21T18:11:30Z

@bhaller some docs here: https://tskit.readthedocs.io/en/latest/metadata.html

(I do need to make better descriptions in my PRs though!)

petrelharp · 2020-05-21T18:18:10Z

Nice solution. Minor suggestion: how about lengthFromBuffer rather than noLengthEncodingExhaustBuffer?

benjeffery · 2020-05-21T18:31:47Z

I wanted to make it long and scary looking as it should only be used in specific legacy situations like this.

bhaller · 2020-05-21T18:37:39Z

I wanted to make it long and scary looking as it should only be used in specific legacy situations like this.

Why is that? It seems pretty natural to allow the number of entries to be determined by the length of the buffer; why wouldn't new users of this stuff want to utilize this option?

benjeffery · 2020-05-21T21:34:34Z

Why is that? It seems pretty natural to allow the number of entries to be determined by the length of the buffer; why wouldn't new users of this stuff want to utilize this option?

There's a couple of reasons.

Firstly an array with this option needs to be the last object read from the buffer. It's easy to configure this wrong - the error message you get in such a situation is not obvious and even worse bad data could be read back silently without error if the user sets this option on two arrays.

Secondly a key use-case for the metadata schemas is adding metadata to a tree sequence that already has some, say in a downstream tool or user code. In this scenario the schema is read, modified to add the new fields and then written back. If the existing schema has a noLengthEncodingExhaustBuffer array then either the option needs to be removed or the index of the array (and all it's parent objects) forced to be last, this is brittle and not straight forward.

@jeromekelleher This second point reminded me of a thought. Maybe we should force the top-level type to be object as it will mean extra metadata can be added by other tools, without breaking code that was designed for the un-modified metadata, and without the other tool needing to pick a key for the original tool's metadata. One of the cool things about this codec is that the binary representation of an object with one property is identical to the representation of that property, so no change is needed for SLiM, for example, for this to be enforced. For example accessing metadata like this:

tables.mutations[0].metadata[index]

and

tables.mutations[0].metadata['SLiM'][index]

Is just a change of schema, not binary representation. I think I just convinced myself we should force this, happy to talk it over tomorrow.

bhaller · 2020-05-21T22:10:32Z

Thanks for the explanation, makes sense.

jeromekelleher

Looks solid to me! The docs could do with another pass, but I say we merge now and get some feedback from potential users like @bhaller and @molpopgen

We're not seeing codecov reports - can you confirm the diff coverage is 100% here @benjeffery?

python/tskit/metadata.py

benjeffery · 2020-05-22T09:11:16Z

We're not seeing codecov reports - can you confirm the diff coverage is 100% here @benjeffery?

They are at the top of the thread, but yes 100% #601 (comment)

benjeffery · 2020-05-22T09:39:35Z

@jeromekelleher Agreed about the docs, the initial structure has broken down with all the extra parameters that were added. Shall I go ahead to squash, merge and file the 5 issues at #601 (comment) , plus an issue for metadata docs pass? (Sorry this PR has a high R0!)

jeromekelleher · 2020-05-22T11:09:44Z

Yes, please go ahead and squash/merge this one and file issues for any stuff that's remaining.

benjeffery marked this pull request as ready for review May 14, 2020 00:28

jeromekelleher reviewed May 14, 2020

View reviewed changes

jeromekelleher approved these changes May 22, 2020

View reviewed changes

python/tskit/metadata.py Outdated Show resolved Hide resolved

Add struct codec for metadata

e406ab4

benjeffery force-pushed the struct_codec branch from f65bc76 to e406ab4 Compare May 22, 2020 12:12

benjeffery added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 22, 2020

mergify bot merged commit 14b90fe into tskit-dev:master May 22, 2020

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 22, 2020

benjeffery mentioned this pull request May 22, 2020

Enforce that metadata is an object at top level #634

Closed

jeromekelleher mentioned this pull request May 22, 2020

Is struct code sufficiently rich for SLiM and fwdpy11? #635

Closed

benjeffery deleted the struct_codec branch May 22, 2020 13:22

benjeffery mentioned this pull request Jul 6, 2020

struct based encoding for metadata #535

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Struct codec #601

Struct codec #601

benjeffery commented May 11, 2020 •

edited

Loading

codecov bot commented May 11, 2020 •

edited

Loading

petrelharp commented May 11, 2020 •

edited

Loading

benjeffery commented May 11, 2020

benjeffery commented May 13, 2020

jeromekelleher left a comment

jeromekelleher May 14, 2020

benjeffery May 20, 2020

jeromekelleher May 14, 2020

benjeffery May 20, 2020

jeromekelleher May 14, 2020

benjeffery May 20, 2020

jeromekelleher commented May 14, 2020 •

edited

Loading

benjeffery commented May 15, 2020

benjeffery commented May 15, 2020

benjeffery commented May 21, 2020 •

edited

Loading

benjeffery commented May 21, 2020

petrelharp commented May 21, 2020 •

edited

Loading

benjeffery commented May 21, 2020 •

edited

Loading

bhaller commented May 21, 2020

benjeffery commented May 21, 2020 •

edited

Loading

petrelharp commented May 21, 2020

benjeffery commented May 21, 2020

bhaller commented May 21, 2020

benjeffery commented May 21, 2020

bhaller commented May 21, 2020

jeromekelleher left a comment •

edited

Loading

benjeffery commented May 22, 2020

benjeffery commented May 22, 2020

jeromekelleher commented May 22, 2020

Struct codec #601

Struct codec #601

Conversation

benjeffery commented May 11, 2020 • edited Loading

codecov bot commented May 11, 2020 • edited Loading

Codecov Report

petrelharp commented May 11, 2020 • edited Loading

benjeffery commented May 11, 2020

benjeffery commented May 13, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher May 14, 2020

Choose a reason for hiding this comment

benjeffery May 20, 2020

Choose a reason for hiding this comment

jeromekelleher May 14, 2020

Choose a reason for hiding this comment

benjeffery May 20, 2020

Choose a reason for hiding this comment

jeromekelleher May 14, 2020

Choose a reason for hiding this comment

benjeffery May 20, 2020

Choose a reason for hiding this comment

jeromekelleher commented May 14, 2020 • edited Loading

benjeffery commented May 15, 2020

benjeffery commented May 15, 2020

benjeffery commented May 21, 2020 • edited Loading

benjeffery commented May 21, 2020

petrelharp commented May 21, 2020 • edited Loading

benjeffery commented May 21, 2020 • edited Loading

bhaller commented May 21, 2020

benjeffery commented May 21, 2020 • edited Loading

petrelharp commented May 21, 2020

benjeffery commented May 21, 2020

bhaller commented May 21, 2020

benjeffery commented May 21, 2020

bhaller commented May 21, 2020

jeromekelleher left a comment • edited Loading

Choose a reason for hiding this comment

benjeffery commented May 22, 2020

benjeffery commented May 22, 2020

jeromekelleher commented May 22, 2020

benjeffery commented May 11, 2020 •

edited

Loading

codecov bot commented May 11, 2020 •

edited

Loading

petrelharp commented May 11, 2020 •

edited

Loading

jeromekelleher commented May 14, 2020 •

edited

Loading

benjeffery commented May 21, 2020 •

edited

Loading

petrelharp commented May 21, 2020 •

edited

Loading

benjeffery commented May 21, 2020 •

edited

Loading

benjeffery commented May 21, 2020 •

edited

Loading

jeromekelleher left a comment •

edited

Loading