Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More features around Beam. #731

Merged
merged 4 commits into from
Sep 5, 2024
Merged

Conversation

marcenacp
Copy link
Contributor

@marcenacp marcenacp commented Sep 4, 2024

  • Simplify pickling/hashing for nodes.
  • Better public API for ReadFromCroissant.
  • Add indices to each example for TFDS to use.

@marcenacp marcenacp requested a review from a team as a code owner September 4, 2024 12:55
Copy link

github-actions bot commented Sep 4, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@marcenacp marcenacp force-pushed the feature/make-operations-pickable branch from 52dc04b to 14f3e8c Compare September 4, 2024 13:50
@ccl-core ccl-core self-requested a review September 4, 2024 13:55
@marcenacp marcenacp force-pushed the feature/make-operations-pickable branch 2 times, most recently from 0ce56ca to f9d15fe Compare September 4, 2024 14:14
@marcenacp marcenacp force-pushed the feature/make-operations-pickable branch from f9d15fe to 0d11618 Compare September 4, 2024 14:15
Only streamable datasets can be used with Beam. A streamable dataset is a dataset
that can be generated by a linear sequence of operations - without joins for
example. This is the case for Hugging Face datasets. If there are branches, we'd
need a more complex Beam pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC we will change this in future work to work with non-streamable datasets (e.g. future versions of HF croissants) -- I feel if this is right, this would deserve a mention here, or a link to an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"WARNING: This was very unlikely, but it seems we just hit this limit"
" in the code. Find another way to optimize execute_operations_in_beam."
" Please, open a PR on GitHub to make the maintainers aware of this"
" issue. An actual easy fix is to compute the actual shard_sizes rather"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm first "actual"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -304,8 +304,8 @@ def there_exists_at_least_one_property(self, possible_properties: list[str]):
return False

def __hash__(self):
"""Hashes all immutable arguments."""
return hash(self.uuid)
"""Re-uses parent's hash function."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we would use the parent's hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the parent's object.

while queue_of_operations:
operation = queue_of_operations.popleft()
if isinstance(operation, ReadFields):
beam_operation = beam.ParDo(operation)
beam_operation = beam.ParDo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are 100% sure that only ReadFields operations can be leaves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

But it's more linked to the fact that ReadFields is a generator.


# We don't know in advance the number of records per shards. So we just allocate the
# maximum number which is `sys.maxsize // num_shards`. For a large evenly
# distributed dataset, we have:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Maybe:

"Taking the practical case of large evenly distributed datasets on HuggingFace, we can compute the following:"

or something similar? Currently it is a bit confusing for a user reading about very specific HF numbers in the context of a generic function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@ccl-core ccl-core left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!!

@marcenacp marcenacp merged commit 3a0974a into main Sep 5, 2024
14 checks passed
@marcenacp marcenacp deleted the feature/make-operations-pickable branch September 5, 2024 08:25
@github-actions github-actions bot locked and limited conversation to collaborators Sep 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants