Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix discrepancies with the specs #742

Merged
merged 9 commits into from
Sep 24, 2024
Merged

Fix discrepancies with the specs #742

merged 9 commits into from
Sep 24, 2024

Conversation

ccl-core
Copy link
Contributor

@ccl-core ccl-core commented Sep 23, 2024

  1. ids and names are the same for Fields and RecordSets (see migration 202409231500.py): updated metadata and output;
  2. In the get_column method of Source, we return the node's uuid if no extract method is specified;
  3. for RecordSet specifying data, we look at field.id and not field.name to get the expected keys;
  4. also added a filters flag to load.py.

@ccl-core ccl-core requested a review from a team as a code owner September 23, 2024 13:09
Copy link

github-actions bot commented Sep 23, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@ccl-core ccl-core changed the title data in recordsets is treated according to specs Fix discrepancies with the specs Sep 23, 2024
Copy link
Contributor

@marcenacp marcenacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

json_copy = json_ld.copy()
for _, record_set in enumerate(json_copy.get("recordSet", [])):
if record_set["@id"] != record_set["name"]:
record_set["name"] = record_set["@id"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not a requirement of the specs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it is not a requirement... but it seems to be the recommended way though, given that it is so for all given examples :)

{"name": "person7", "age": 7}
{"name": "person8", "age": 8}
{"name": "person9", "age": 9}
{"persons/name": "person0", "persons/age": 0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was feature to not have the RecordSet ID here - otherwise, it's repeated:

  • The user is asking for the persons RecordSet
  • => All fields will start with persons

Maybe, it's a consequence of my other comment where you set name == @id for all datasets. It's the case for Hugging Face datasets, but it may also not be the case per the specs.

So it could be good to keep at least one dataset with name != @id for testing purposes. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I updated all datasets because of the migration script, but good point to keep one different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment in the migration script and restore one dataset with names != ids.

@ccl-core ccl-core merged commit 6c79dc0 into main Sep 24, 2024
14 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Sep 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants