-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More features around Beam. #731
Conversation
marcenacp
commented
Sep 4, 2024
•
edited
Loading
edited
- Simplify pickling/hashing for nodes.
- Better public API for ReadFromCroissant.
- Add indices to each example for TFDS to use.
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
52dc04b
to
14f3e8c
Compare
0ce56ca
to
f9d15fe
Compare
f9d15fe
to
0d11618
Compare
Only streamable datasets can be used with Beam. A streamable dataset is a dataset | ||
that can be generated by a linear sequence of operations - without joins for | ||
example. This is the case for Hugging Face datasets. If there are branches, we'd | ||
need a more complex Beam pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC we will change this in future work to work with non-streamable datasets (e.g. future versions of HF croissants) -- I feel if this is right, this would deserve a mention here, or a link to an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"WARNING: This was very unlikely, but it seems we just hit this limit" | ||
" in the code. Find another way to optimize execute_operations_in_beam." | ||
" Please, open a PR on GitHub to make the maintainers aware of this" | ||
" issue. An actual easy fix is to compute the actual shard_sizes rather" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm first "actual"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -304,8 +304,8 @@ def there_exists_at_least_one_property(self, possible_properties: list[str]): | |||
return False | |||
|
|||
def __hash__(self): | |||
"""Hashes all immutable arguments.""" | |||
return hash(self.uuid) | |||
"""Re-uses parent's hash function.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we would use the parent's hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, the parent's object.
while queue_of_operations: | ||
operation = queue_of_operations.popleft() | ||
if isinstance(operation, ReadFields): | ||
beam_operation = beam.ParDo(operation) | ||
beam_operation = beam.ParDo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are 100% sure that only ReadFields operations can be leaves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
But it's more linked to the fact that ReadFields is a generator.
|
||
# We don't know in advance the number of records per shards. So we just allocate the | ||
# maximum number which is `sys.maxsize // num_shards`. For a large evenly | ||
# distributed dataset, we have: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Maybe:
"Taking the practical case of large evenly distributed datasets on HuggingFace, we can compute the following:"
or something similar? Currently it is a bit confusing for a user reading about very specific HF numbers in the context of a generic function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!!!