Fix discrepancies with the specs #742

ccl-core · 2024-09-23T13:09:45Z

ids and names are the same for Fields and RecordSets (see migration 202409231500.py): updated metadata and output;
In the get_column method of Source, we return the node's uuid if no extract method is specified;
for RecordSet specifying data, we look at field.id and not field.name to get the expected keys;
also added a filters flag to load.py.

github-actions · 2024-09-23T13:09:57Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

marcenacp

Thanks!

marcenacp · 2024-09-24T06:57:54Z

python/mlcroissant/mlcroissant/scripts/migrations/previous/202409231500.py

+    json_copy = json_ld.copy()
+    for _, record_set in enumerate(json_copy.get("recordSet", [])):
+        if record_set["@id"] != record_set["name"]:
+            record_set["name"] = record_set["@id"]


nit: this is not a requirement of the specs.

You are right, it is not a requirement... but it seems to be the recommended way though, given that it is so for all given examples :)

marcenacp · 2024-09-24T07:02:47Z

datasets/1.0/simple-parquet/output/persons.jsonl

-{"name": "person7", "age": 7}
-{"name": "person8", "age": 8}
-{"name": "person9", "age": 9}
+{"persons/name": "person0", "persons/age": 0}


I think it was feature to not have the RecordSet ID here - otherwise, it's repeated:

The user is asking for the persons RecordSet

=> All fields will start with persons

Maybe, it's a consequence of my other comment where you set name == @id for all datasets. It's the case for Hugging Face datasets, but it may also not be the case per the specs.

So it could be good to keep at least one dataset with name != @id for testing purposes. What do you think?

Sure! I updated all datasets because of the migration script, but good point to keep one different.

Updated comment in the migration script and restore one dataset with names != ids.

data in recordsets is treated according to specs

6c05d10

ccl-core requested a review from a team as a code owner September 23, 2024 13:09

ccl-core added 6 commits September 23, 2024 13:10

remove stats

4da67de

Update metadata and output of the datasets.

5618d4b

Fix formatting errors.

05c2382

Fix mypy

5dbb496

Updated huggingface-c4. Filters are still broken.

6eb21e2

Fix typo

04e2c9e

ccl-core changed the title ~~data in recordsets is treated according to specs~~ Fix discrepancies with the specs Sep 23, 2024

Flake tests.

868b912

ccl-core requested a review from marcenacp September 23, 2024 21:21

marcenacp approved these changes Sep 24, 2024

View reviewed changes

Address comments.

50b34a4

ccl-core merged commit 6c79dc0 into main Sep 24, 2024
14 checks passed

github-actions bot locked and limited conversation to collaborators Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix discrepancies with the specs #742

Fix discrepancies with the specs #742

ccl-core commented Sep 23, 2024 •

edited

Loading

github-actions bot commented Sep 23, 2024 •

edited

Loading

marcenacp left a comment

marcenacp Sep 24, 2024

ccl-core Sep 24, 2024

marcenacp Sep 24, 2024

ccl-core Sep 24, 2024

ccl-core Sep 24, 2024

Fix discrepancies with the specs #742

Fix discrepancies with the specs #742

Conversation

ccl-core commented Sep 23, 2024 • edited Loading

github-actions bot commented Sep 23, 2024 • edited Loading

marcenacp left a comment

Choose a reason for hiding this comment

marcenacp Sep 24, 2024

Choose a reason for hiding this comment

ccl-core Sep 24, 2024

Choose a reason for hiding this comment

marcenacp Sep 24, 2024

Choose a reason for hiding this comment

ccl-core Sep 24, 2024

Choose a reason for hiding this comment

ccl-core Sep 24, 2024

Choose a reason for hiding this comment

ccl-core commented Sep 23, 2024 •

edited

Loading

github-actions bot commented Sep 23, 2024 •

edited

Loading