Skip to content

name, description language-tagged support#932

Merged
benjelloun merged 12 commits intomlcommons:mainfrom
wardi:multilingual-name-desc
Aug 25, 2025
Merged

name, description language-tagged support#932
benjelloun merged 12 commits intomlcommons:mainfrom
wardi:multilingual-name-desc

Conversation

@wardi
Copy link
Copy Markdown
Contributor

@wardi wardi commented Aug 4, 2025

This proof of concept allows JSON-LD language-tagged strings for name and description fields, fixing #924

It does not add support for general JSON-LD property-based indexing such as id-maps or type-maps, nor does it add support for multiple titles. i.e.

{
  "name": [
    {"@value": "The Queen", "@language": "en"},
    {"@value": "Die Königin", "@language": "de"}
  ]
}

and

{
  "name": {"en": "The Queen", "de": "Die Königin"}
}

are supported, but

{
  "name": ["Die Königin", "Ihre Majestät"]
}

is not.

This implementation extends (and possibly abuses?) field.cardinality to add "LANGUAGE-TAGGED" as a new cardinality for name and description.

In Python and in the generated JSON-LD multilingual fields are always represented as a language map (dict) so that users can reference the language versions by their BCP-47 key and not have to iterate over a list comparing "@language" values.

Looking for comments on the approach before adding tests, updating the spec, ttl etc.

@wardi wardi requested a review from a team as a code owner August 4, 2025 23:42
@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 4, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown
Contributor

@benjelloun benjelloun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

It would be good to have some tests for this functionality.

Side note: Looks like your CLA / joining the MLCommons org still hasn't gone through.

"data": {"@id": "cr:data", "@type": "@json"},
"dataType": {"@id": "cr:dataType", "@type": "@vocab"},
"dct": "http://purl.org/dc/terms/",
"description": {"@container": "@language"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this force the description to be a language map, or does it make it optional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is JSON-LD for:

when you encounter a map (container) in a description field convert it to a list with
{"@language": key, "@value": value} elements.

So it doesn't force a language map it only sets the RDF interpretation of the keys when a map is found.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conversion back from list form happens here: https://github.com/mlcommons/croissant/pull/932/files#diff-43214fc47248088a8c6b83e615b24058c137f57b158398855c109b5dc63677fbR180-R184

Again, this is optional and only happens when a list value is found for these fields.

"jsonPath": "cr:jsonPath",
"key": "sc:key" if ctx is not None and ctx.is_v0() else "cr:key",
"md5": "sc:md5" if ctx is not None and ctx.is_v0() else "cr:md5",
"name": {"@container": "@language"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question

@wardi
Copy link
Copy Markdown
Contributor Author

wardi commented Aug 5, 2025

recheck

Copy link
Copy Markdown
Contributor

@marcenacp marcenacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, thanks!

if not isinstance(name, str):
self.add_error(f"The name should be a string. Got: {type(name)}.")
if not isinstance(name, (str, dict)):
self.add_error(f"The name should be a string or dict. Got: {type(name)}.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add test cases for these errors?

We use snapshot test cases:

def test_static_analysis(version, folder):

actual_jsonld_type = value.get("@type")
if actual_jsonld_type == jsonld_type:
return input_type.from_jsonld(ctx, value)
elif isinstance(value, dict) and field.cardinality == "LANGUAGE-TAGGED":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a test case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a snapshot test case and a specific test for this new message

"""Overloads dataclasses.field with specific attributes."""
if cardinality not in ["ONE", "MANY"]:
raise ValueError(f"cardinality should be ONE or MANY. Got {cardinality}")
if cardinality not in ["ONE", "MANY", "LANGUAGE-TAGGED"]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we need ONE-LANGUAGE-TAGGED and MANY-LANGUAGE-TAGGED?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, here "LANGUAGE-TAGGED" is short for "either one value or a dict of one or more language-tagged values"

"data": {"@id": "cr:data", "@type": "@json"},
"dataType": {"@id": "cr:dataType", "@type": "@vocab"},
"dct": "http://purl.org/dc/terms/",
"description": {"@container": "@language"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After changing the context, you probably want to modify all datasets in the 1.1 folder to follow it. You can do this using the migration script python/mlcroissant/mlcroissant/scripts/migrations/migrate.py.

@ccl-core
Copy link
Copy Markdown
Contributor

Great, thank you @wardi for adding this! Do we already have a specific usecase in mind? In that case, we might want to add a dataset to showcase the new feature under datasets/1.1. You can add e2e testing for new datasets in python/mlcroissant/mlcroissant/_src/datasets_test.py (hermetic) or python/mlcroissant/mlcroissant/_src/datasets_nonhermetic_test.py

@wardi
Copy link
Copy Markdown
Contributor Author

wardi commented Aug 15, 2025

@ccl-core thank you I will do that.

recheck

@wardi
Copy link
Copy Markdown
Contributor Author

wardi commented Aug 19, 2025

recheck

@benjelloun
Copy link
Copy Markdown
Contributor

Can you please take a look at the test failures in the CI?

@wardi
Copy link
Copy Markdown
Contributor Author

wardi commented Aug 21, 2025

@ccl-core I've added a dataset to showcase multilingual metadata based on the 1.0 version of recipes/minimal_recommended.json is there anything else this PR needs?

@wardi wardi changed the title Proof of concept: name, description language-tagged support name, description language-tagged support Aug 21, 2025
@ccl-core ccl-core self-requested a review August 22, 2025 08:49
@ccl-core
Copy link
Copy Markdown
Contributor

Hi @wardi , thank you for the PR! LGTM

@ccl-core ccl-core requested a review from marcenacp August 22, 2025 08:50
Copy link
Copy Markdown
Contributor

@marcenacp marcenacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, thanks!

@ccl-core
Copy link
Copy Markdown
Contributor

Hi @wardi , thank you for this PR! Everything looks good on my side, I think you can safely click on Squash and merge.

@wardi
Copy link
Copy Markdown
Contributor Author

wardi commented Aug 25, 2025

@ccl-core thanks, but someone with write access will need to merge.

Is it helpful for me to squash and force push to this branch first?

@benjelloun benjelloun merged commit cc68c2a into mlcommons:main Aug 25, 2025
12 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 25, 2025
@benjelloun
Copy link
Copy Markdown
Contributor

@ccl-core thanks, but someone with write access will need to merge.

Is it helpful for me to squash and force push to this branch first?

Done

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

4 participants