Conversation
|
Thanks Chris, I'll comment on as i try it out, for now I think datalake API's shouldn't be asnyc as it forces users to manage the async context. e.g. : putting await in front of datalake.initialize() is not ideal as well because caller funct/method is forced to be asnyc and run through asyncio.run() etc. So whenever someone needs to call datalake, they need to create/manage an async context. ex. within init: If there are true asnyc methods like add_datum that could benefit from async, I think we can create a sync wrapper for them or expose async version too |
| - "bbox": Must be a dict with "bbox" key containing a list of lists of 4 floats | ||
| - "default": Any data type | ||
| metadata: Metadata dictionary associated with the datum | ||
| contract: Optional contract type specifying the data format. If None, defaults to "default". |
There was a problem hiding this comment.
"type" might be better naming for this usage?
There was a problem hiding this comment.
For me "type" has a precise meaning in Python which these "contract"s are very closely related to but not exactly, and so it's useful to use a different word?
There was a problem hiding this comment.
oh ok, I thought contracts were type primitives(classification etc), thus suggested as such. In that case, does it make sense coupling some stuff under single contract or in cases where we produce multiple outputs from a model, should we commit multiple datums or aggregate them within a single contract and data field? like if a visual inspection model produces 3 classification and 1 severity field, what should we do in general?
| Exception: If database initialization fails | ||
| """ | ||
| await self.datum_database.initialize() | ||
| await self.dataset_database.initialize() |
There was a problem hiding this comment.
There was a problem hiding this comment.
hmm, I'm not super clear on how the Database module works on the back end. Having a single DB with two different docs seems like what we want but not what's currently doable with the MT Database?
There was a problem hiding this comment.
@canelbirlik Sorry just seen this comment,, The initialize is deprecated now. Still works, but deprecated. Will make sure to look into this and support it.
|
I have commented on individual parts, conceptually I think core logic is good but it's a bit too general imo. I think if we put more opinions into certain parts, it could be much easier(for our use cases)
Overall looks good, my suggestions are related with overall direction and not this pr specifically. I'll play with it some more to report issues/bugs if i find any that's related with current pr. I think once we have the GCS registry in, easiest acid test would be to field test it with some big dataset that's internally used and try to see if we cover the whole lifecycle which would be: |
| Exception: If database or registry operations fail | ||
| """ | ||
|
|
||
| if contract is None: |
There was a problem hiding this comment.
I think on datum we can have a required project_id/line_id key to help discriminate between unrelated tasks so we dont accidently get it while querying or potentially to speed up queries. Datastets would have them too im assuming
There was a problem hiding this comment.
yep, keep meaning to do this, done.
| ValueError: If datums in the same column have different contracts | ||
| Exception: If data retrieval from the datalake fails | ||
| """ | ||
| return await asyncio.gather(*[self.load_row(datalake, row) for row in self.datum_ids]) |
There was a problem hiding this comment.
this might be too many calls, as load_row calls get_data per row on datalake which istself is
return await asyncio.gather(*[self.get_datum(datum_id) for datum_id in datum_ids])
-> I think we wessentially need a outer join equivalent here, which maybe possible with batch get_data?
This PR adds a concept of a "Dataset", the ability to create one from the outputs of
Datalake.query_data(), and the ability to convert a Dataset to HF format.To do this it adds a concept of a "contract" to the Datalake Datum, guaranteeing the data form so HF can be sure what to expect. For now these are hard-coded into the datalake, but perhaps in the future it could be possible for users to specify contracts.
The specific forms of the contract are also not set in stone - this should be decided based on the models we use and the data formats they want, and the data formats that our input data come in. I wasn't sure what format the contour segmentation and regression contracts needed to take but this is an easy thing to add.
There are definitely things that need doing before this could be merged but creating the PR so Can can confirm that the general structure is suitable.