Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struct codec #601

Merged
merged 1 commit into from
May 22, 2020
Merged

Struct codec #601

merged 1 commit into from
May 22, 2020

Conversation

benjeffery
Copy link
Member

@benjeffery benjeffery commented May 11, 2020

Implementation of the struct codec for metadata. Supports encoding an arbitrary nesting of variable-length arrays and fixed-key-set objects up to the python stack depth. Only overhead is array length as keys and types are not stored in the encoded bytes but read from the schema.

TODOs:

  • Schema: Enforce all object keys must be required and no additional
  • Schema: Enforce binaryFormat strings by regex
  • Schema: Enforce index to be unique and numeric
  • Schema: No union types
  • JSONSchema null type handling
  • Investigate replacing index with alphabetic key sort
  • array type in JSONSchema allows heterogeneous arrays, support? (No! Forbid in schema)
  • Force endian-ness
  • Support variable length strings? Deferring to another PR if wanted. Variable length strings in struct codec? #613
  • Perf pass deferring to another PR Struct codec perf #614
  • Docs
  • Use iterator in decode
  • Support padding
  • Reinstate index as an optional parameter
  • Allow specifying array length format
  • Add string encoding parameter
  • Test with SLiM metadata
  • Extract encode/decode functions and test individually
  • Don't catch all types, but error on unsupported.
  • Add note about p type needing an extra byte
  • File issues about numpy encoding, variable length strings, efficient enum encoding, bytes, and compression.

@codecov
Copy link

codecov bot commented May 11, 2020

Codecov Report

Merging #601 into master will increase coverage by 0.08%.
The diff coverage is 97.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #601      +/-   ##
==========================================
+ Coverage   87.31%   87.40%   +0.08%     
==========================================
  Files          22       22              
  Lines       16758    16904     +146     
  Branches     3281     3315      +34     
==========================================
+ Hits        14633    14775     +142     
- Misses       1040     1042       +2     
- Partials     1085     1087       +2     
Flag Coverage Δ
#c_tests 88.64% <97.51%> (+0.10%) ⬆️
#python_c_tests 90.62% <97.51%> (+0.12%) ⬆️
#python_tests 98.96% <97.51%> (-0.08%) ⬇️
Impacted Files Coverage Δ
python/tskit/metadata.py 98.32% <97.51%> (-1.68%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 610392d...e406ab4. Read the comment docs.

@petrelharp
Copy link
Contributor

petrelharp commented May 11, 2020

I'm not parsing exactly what's allowed here - will SLiM's metadata fit? It sounds like yes except maybe the mutation metadata, that is a variable-length list of fixed structs, but I can't tell?

@benjeffery
Copy link
Member Author

The only thing that is not allowed are objects that have missing keys or extra keys, so a variable length list of fixed objects that all have all the keys is fine.

@benjeffery
Copy link
Member Author

@jeromekelleher Just needs docs. So I don't have to rewrite docs after changes maybe review the code now?

@benjeffery benjeffery marked this pull request as ready for review May 14, 2020 00:28
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @benjeffery! Some comments above, mostly about testing and breaking the code up a little bit.

One thought - can/should we include padding bytes? I can foresee problems if we don't make some allowance for them. For example, someone might dump a C struct directly into the metadata, and if we can't handle padding we won't be able to work with it.

docs/metadata.rst Outdated Show resolved Hide resolved
docs/metadata.rst Outdated Show resolved Hide resolved
docs/metadata.rst Outdated Show resolved Hide resolved
docs/tutorial.rst Outdated Show resolved Hide resolved
python/tests/test_metadata.py Show resolved Hide resolved
python/tskit/metadata.py Show resolved Hide resolved
python/tskit/metadata.py Show resolved Hide resolved
element_decoder = make_decode(sub_schema["items"])
array_length_size = struct.calcsize("<Q")

def array_decode(bytes_, offset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment really. I find bytes_ a bit awkward and would tend to use buff or buffer for variables like this. No biggie either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer is better, renamed.


elif sub_schema["type"] == "string":
size = struct.calcsize("<" + sub_schema["binaryFormat"])
return lambda bytes_, offset: (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lambdas and private functions are getting a bit cryptic. Maybe we could make these into a set of small module level functions instead, like _struct_codec_decode_string or something? This would also make it easier to do a few small unit tests on them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken up and individually tested.

except KeyError:
raise exceptions.MetadataSchemaValidationError(
f"Unrecognised metadata codec '{schema['codec']}'. "
f"Valid options are {str(list(codec_registry.keys()))}."
)
codec_instance = codec_cls(schema)
# If the codec has modified the schema on __init__, get it back.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange - need a bit more explanation of what's going on in the comment here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to make intent clear in 0617b60

@jeromekelleher
Copy link
Member

jeromekelleher commented May 14, 2020

Oh yeah, another nice test: take some SLiM metadata encoded as bytes (can just put the value directly into the source), hand craft a schema to decode it, and then compare decoded value against the known values. Should be easy enough to get an example and the known values using pyslim.

This is exciting!!! Genuine interoperability across programs and platforms!!!

@benjeffery
Copy link
Member Author

From a call today - for interop need both padding and order - use index to force ordering. Also array length format should be specifiable.
Extract decoders and test individually.

@benjeffery
Copy link
Member Author

For string type add encoding to schema (default utf-8).

@benjeffery
Copy link
Member Author

benjeffery commented May 21, 2020

I've managed to get some SLiM generated metadata decoded in b73c80f

However there is an issue - as detailed at https://github.com/tskit-dev/pyslim/blob/master/pyslim/slim_metadata.py#L66 SLiM uses the length of the metadata buffer to encode the length of the mutations array. This is not supported in the current metaschema. We could support this by adding a noLengthEncoding or similarly-named parameter to the array type in the schema, which causes the codec to read elements until the buffer is exhausted.

@benjeffery
Copy link
Member Author

@jeromekelleher I think this is ready for another pass. I at least need a break from it to be able to come back and tell if it is any good!

Some aspects we might like to make issues for, which I think are best not rolled into this PR as it is already large!

  1. Variable length strings - easily supported but have an irreducible perf hit due to an extra pack/unpack call. Any structure with a variable string in cannot be decoded in one chunk.
  2. Enumerated types - could support these with the schema defining the values and the codec using the minimum number of bytes to represent them. Note that some of the SLiM metadata is already of this type, e.g. nucleotide, sex.
  3. Storage of bytes. There is no way in the schema to round trip a python bytes string. Also not possible in JSON: TypeError: Object of type 'bytes' is not JSON serializable We could add a type bytes to the metaschema, or add encoding: None to the string type.
  4. Support for numpy arrays.
  5. Compression - the codec could support compression.

@petrelharp
Copy link
Contributor

petrelharp commented May 21, 2020

Ping @bhaller: #601 (comment)
since the choice is between extending the metadata schema or changing how SLiM records metadata, since we'll definately want to use this (fantastci) stuff to decode (and encode!) SLiM metadata when implemented.

@benjeffery
Copy link
Member Author

benjeffery commented May 21, 2020

since the choice is between extending the metadata schema....

I've already extended the schema, see the SLiM tests in the diff of this PR. https://github.com/tskit-dev/tskit/pull/601/files#diff-f5f11b0325511ec01c11038c469c103fR1316

EDIT: Github won't link to the exact line as it has collapsed the diff. Code looks like this:

schema = {
            "codec": "struct",
            "type": "array",
            "noLengthEncodingExhaustBuffer": True,
            "items": {
                "type": "object",
                "properties": {
                    "mutationTypeID": {
                        "type": "integer",
                        "binaryFormat": "i",
                        "index": 1,
                    },
                    "selectionCoeff": {
                        "type": "number",
                        "binaryFormat": "f",
                        "index": 2,
                    },
                    "subpopulationID": {
                        "type": "integer",
                        "binaryFormat": "i",
                        "index": 3,
                    },
                    "originGeneration": {
                        "type": "integer",
                        "binaryFormat": "i",
                        "index": 4,
                    },
                    "nucleotide": {"type": "integer", "binaryFormat": "b", "index": 5},
                },
            },
        }

@bhaller
Copy link

bhaller commented May 21, 2020

since the choice is between extending the metadata schema....

I've already extended the schema, see the SLiM tests in the diff of this PR.

Hi @petrelharp @benjeffery . Great; that seems like the right choice to me – we don't want to change the SLiM metadata format unless we really have to, as it would just introduce headaches for backward compatibility etc.

I would be interested to have a bit of background about this whole thing. At the top, it says it "fixes #601" but if I click on that link it just takes me back to this PR; so this PR apparently fixes itself. :-> Not sure what's going on with that. Also at the top it says "Implementation of the struct codec for metadata." I guess "the struct codec" is a Python thing? What will this ultimately provide to the user? Anyway, no need to spend time explaining to me now; I imagine all will become clear to me once the doc for this has been written. :->

@benjeffery
Copy link
Member Author

benjeffery commented May 21, 2020

@bhaller some docs here: https://tskit.readthedocs.io/en/latest/metadata.html

(I do need to make better descriptions in my PRs though!)

@petrelharp
Copy link
Contributor

Nice solution. Minor suggestion: how about lengthFromBuffer rather than noLengthEncodingExhaustBuffer?

@benjeffery
Copy link
Member Author

I wanted to make it long and scary looking as it should only be used in specific legacy situations like this.

@bhaller
Copy link

bhaller commented May 21, 2020

I wanted to make it long and scary looking as it should only be used in specific legacy situations like this.

Why is that? It seems pretty natural to allow the number of entries to be determined by the length of the buffer; why wouldn't new users of this stuff want to utilize this option?

@benjeffery
Copy link
Member Author

Why is that? It seems pretty natural to allow the number of entries to be determined by the length of the buffer; why wouldn't new users of this stuff want to utilize this option?

There's a couple of reasons.

Firstly an array with this option needs to be the last object read from the buffer. It's easy to configure this wrong - the error message you get in such a situation is not obvious and even worse bad data could be read back silently without error if the user sets this option on two arrays.

Secondly a key use-case for the metadata schemas is adding metadata to a tree sequence that already has some, say in a downstream tool or user code. In this scenario the schema is read, modified to add the new fields and then written back. If the existing schema has a noLengthEncodingExhaustBuffer array then either the option needs to be removed or the index of the array (and all it's parent objects) forced to be last, this is brittle and not straight forward.

@jeromekelleher This second point reminded me of a thought. Maybe we should force the top-level type to be object as it will mean extra metadata can be added by other tools, without breaking code that was designed for the un-modified metadata, and without the other tool needing to pick a key for the original tool's metadata. One of the cool things about this codec is that the binary representation of an object with one property is identical to the representation of that property, so no change is needed for SLiM, for example, for this to be enforced. For example accessing metadata like this:

tables.mutations[0].metadata[index]

and

tables.mutations[0].metadata['SLiM'][index] 

Is just a change of schema, not binary representation. I think I just convinced myself we should force this, happy to talk it over tomorrow.

@bhaller
Copy link

bhaller commented May 21, 2020

Thanks for the explanation, makes sense.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid to me! The docs could do with another pass, but I say we merge now and get some feedback from potential users like @bhaller and @molpopgen

We're not seeing codecov reports - can you confirm the diff coverage is 100% here @benjeffery?

python/tskit/metadata.py Outdated Show resolved Hide resolved
@benjeffery
Copy link
Member Author

We're not seeing codecov reports - can you confirm the diff coverage is 100% here @benjeffery?

They are at the top of the thread, but yes 100% #601 (comment)

@benjeffery
Copy link
Member Author

@jeromekelleher Agreed about the docs, the initial structure has broken down with all the extra parameters that were added. Shall I go ahead to squash, merge and file the 5 issues at #601 (comment) , plus an issue for metadata docs pass? (Sorry this PR has a high R0!)

@jeromekelleher
Copy link
Member

Yes, please go ahead and squash/merge this one and file issues for any stuff that's remaining.

@benjeffery benjeffery added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 22, 2020
@mergify mergify bot merged commit 14b90fe into tskit-dev:master May 22, 2020
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants