Conversation
missioncontrol_client/__init__.py
Outdated
| l.append(self._compressor.flush()) | ||
| return b''.join(l) | ||
|
|
||
| def _calculate_hash(self): |
There was a problem hiding this comment.
This is used in super().__init__()
I've added the comment below to explain why we need to override it (to hash the raw file, not the gzip stream).
missioncontrol_client/__init__.py
Outdated
|
|
||
| def _calculate_hash(self): | ||
| '''ensure the hash is over the raw file, not the gzip steam''' | ||
| b2 = blake2b(digest_size=16) |
There was a problem hiding this comment.
Is digest_size required here? Or just to ensure we're consistent?
There was a problem hiding this comment.
yeah, I we need both a hashing algorithm and a standard way to call it, otherwise we might have the same file twice with different hash lengths.
Actually, I'll probably encode the hash type and length with something like pymultihash.
missioncontrol_client/__init__.py
Outdated
| import zlib | ||
| import socket | ||
| import datetime | ||
| from pyblake2 import blake2b |
There was a problem hiding this comment.
Why not builtin? https://docs.python.org/3/library/hashlib.html#hashlib.blake2b
There was a problem hiding this comment.
I'm relying on planetlabs/datalake, which uses pyblake2
There was a problem hiding this comment.
I guess I could vendor this functionality actually and make the modifications directly.
missioncontrol_client/__init__.py
Outdated
| start = UTC("now").iso | ||
|
|
||
| if where is None: | ||
| where = socket.gethostname() |
There was a problem hiding this comment.
good point - I'll change to getfqdn
missioncontrol_client/__init__.py
Outdated
|
|
||
| cid = f.metadata["hash"] | ||
|
|
||
| fleetmeta = { |
There was a problem hiding this comment.
I'm inheriting from planetlabs/datalake File class, but we've changed our metadata structure, so this converts to the "fleet" version of metadata.
Ideally we'd fork/enhance this library with metadata 2.0 and distribute it, but using it as is provides a lot of value without much effort in the near term.
There was a problem hiding this comment.
Ah right.
At this point I don't actually see what this File class adds to us? Can you explain a bit?
There was a problem hiding this comment.
Yeah - I'm trying to leverage as much of the datalake infrastructure as possible, as it includes lessons learned and features developed over several years.
For example, datalake files have a tar bundle format, and an inotify-based auto-uploaded service. If we can re-use a lot of this work, we won't have to rewrite it from scratch.
There was a problem hiding this comment.
Except currently we're just doing a POST directly anyway?
I'm wondering what specifically this PR users from the File class of datalake
Maybe the answer is 'nothing yet'?
There was a problem hiding this comment.
I think the mismatch is that I was overriding the methods used in this diff to get streaming gzip working.
I've since moved that into a datalake fork. I think it's worth factoring out the metadata and tooling as it's a bit more complex than just normal REST api stuff.
Ideally this library would be very lean and not doing too much magic.
36689b5 to
c19b4db
Compare
This adds client-side functionality for interfacing with the missioncontrol files endpoints.
Note this requires us to land the changes into missioncontrol first.