Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement filtering in the case of filename regular expression and add a test for this feature. #716

Merged
merged 3 commits into from
Jul 19, 2024

Conversation

marcenacp
Copy link
Contributor

@marcenacp marcenacp commented Jul 18, 2024

In this PR:

  • I introduce in mlcroissant API a light version of filtering by allowing a dictionary that maps a field's ID to the value we want to filter.
  • I add a new test dataset that contains 2 parquet files (train.parquet for train, test.parquet for test). The regular expression is still *.parquet, so we test that passing the filter {'data/split': 'train'} still only reads elements from train.parquet.
  • For now, the coverage is very limited (basically only the cases we could encounter in Hugging Face datasets), so I just throw a bunch of NotImplementedErrors.

@marcenacp marcenacp requested a review from a team as a code owner July 18, 2024 14:57
Copy link

github-actions bot commented Jul 18, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@ccl-core
Copy link
Contributor

That's great, thanks Pierre!

NIT, as IIUC currently it works only for single values (filters can't be {'split': ['test1', 'test2']}) maybe it makes sense to add some checks to ensure only single values are passed? So that it doesn't fail at _regex_from_value when you re.escape(value)

@ccl-core ccl-core self-requested a review July 19, 2024 08:38
Copy link
Contributor

@ccl-core ccl-core left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@marcenacp
Copy link
Contributor Author

NIT, as IIUC currently it works only for single values (filters can't be {'split': ['test1', 'test2']}) maybe it makes sense to add some checks to ensure only single values are passed? So that it doesn't fail at _regex_from_value when you re.escape(value)

Done.

@marcenacp marcenacp merged commit e1a8380 into main Jul 19, 2024
12 of 14 checks passed
@marcenacp marcenacp deleted the poc/splits branch July 19, 2024 09:24
@github-actions github-actions bot locked and limited conversation to collaborators Jul 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants