name, description language-tagged support by wardi · Pull Request #932 · mlcommons/croissant

wardi · 2025-08-04T23:42:39Z

This proof of concept allows JSON-LD language-tagged strings for name and description fields, fixing #924

It does not add support for general JSON-LD property-based indexing such as id-maps or type-maps, nor does it add support for multiple titles. i.e.

{
  "name": [
    {"@value": "The Queen", "@language": "en"},
    {"@value": "Die Königin", "@language": "de"}
  ]
}

and

{
  "name": {"en": "The Queen", "de": "Die Königin"}
}

are supported, but

{
  "name": ["Die Königin", "Ihre Majestät"]
}

is not.

This implementation extends (and possibly abuses?) field.cardinality to add "LANGUAGE-TAGGED" as a new cardinality for name and description.

In Python and in the generated JSON-LD multilingual fields are always represented as a language map (dict) so that users can reference the language versions by their BCP-47 key and not have to iterate over a list comparing "@language" values.

Looking for comments on the approach before adding tests, updating the spec, ttl etc.

github-actions · 2025-08-04T23:42:47Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

benjelloun

Thanks for adding this!

It would be good to have some tests for this functionality.

Side note: Looks like your CLA / joining the MLCommons org still hasn't gone through.

benjelloun · 2025-08-05T08:51:24Z

python/mlcroissant/mlcroissant/_src/core/rdf.py

        "data": {"@id": "cr:data", "@type": "@json"},
        "dataType": {"@id": "cr:dataType", "@type": "@vocab"},
        "dct": "http://purl.org/dc/terms/",
+        "description": {"@container": "@language"},


Does this force the description to be a language map, or does it make it optional?

IIUC this is JSON-LD for:

when you encounter a map (container) in a description field convert it to a list with
{"@language": key, "@value": value} elements.

So it doesn't force a language map it only sets the RDF interpretation of the keys when a map is found.

conversion back from list form happens here: https://github.com/mlcommons/croissant/pull/932/files#diff-43214fc47248088a8c6b83e615b24058c137f57b158398855c109b5dc63677fbR180-R184

Again, this is optional and only happens when a list value is found for these fields.

benjelloun · 2025-08-05T08:51:32Z

python/mlcroissant/mlcroissant/_src/core/rdf.py

        "jsonPath": "cr:jsonPath",
        "key": "sc:key" if ctx is not None and ctx.is_v0() else "cr:key",
        "md5": "sc:md5" if ctx is not None and ctx.is_v0() else "cr:md5",
+        "name": {"@container": "@language"},


Same question

wardi · 2025-08-05T14:06:03Z

recheck

marcenacp

Nice PR, thanks!

marcenacp · 2025-08-05T14:59:47Z

python/mlcroissant/mlcroissant/_src/structure_graph/base_node.py

-        if not isinstance(name, str):
-            self.add_error(f"The name should be a string. Got: {type(name)}.")
+        if not isinstance(name, (str, dict)):
+            self.add_error(f"The name should be a string or dict. Got: {type(name)}.")


Can you please add test cases for these errors?

We use snapshot test cases:

croissant/python/mlcroissant/mlcroissant/_src/datasets_test.py

Line 49 in dcc4b5d

def test_static_analysis(version, folder):

marcenacp · 2025-08-05T15:00:47Z

python/mlcroissant/mlcroissant/_src/structure_graph/base_node.py

            actual_jsonld_type = value.get("@type")
            if actual_jsonld_type == jsonld_type:
                return input_type.from_jsonld(ctx, value)
+        elif isinstance(value, dict) and field.cardinality == "LANGUAGE-TAGGED":


Can you please add a test case?

added a snapshot test case and a specific test for this new message

marcenacp · 2025-08-05T15:01:25Z

python/mlcroissant/mlcroissant/_src/core/dataclasses.py

    """Overloads dataclasses.field with specific attributes."""
-    if cardinality not in ["ONE", "MANY"]:
-        raise ValueError(f"cardinality should be ONE or MANY. Got {cardinality}")
+    if cardinality not in ["ONE", "MANY", "LANGUAGE-TAGGED"]:


Question: do we need ONE-LANGUAGE-TAGGED and MANY-LANGUAGE-TAGGED?

No, here "LANGUAGE-TAGGED" is short for "either one value or a dict of one or more language-tagged values"

ccl-core · 2025-08-11T12:57:52Z

python/mlcroissant/mlcroissant/_src/core/rdf.py

        "data": {"@id": "cr:data", "@type": "@json"},
        "dataType": {"@id": "cr:dataType", "@type": "@vocab"},
        "dct": "http://purl.org/dc/terms/",
+        "description": {"@container": "@language"


After changing the context, you probably want to modify all datasets in the 1.1 folder to follow it. You can do this using the migration script python/mlcroissant/mlcroissant/scripts/migrations/migrate.py.

ccl-core · 2025-08-11T13:00:26Z

Great, thank you @wardi for adding this! Do we already have a specific usecase in mind? In that case, we might want to add a dataset to showcase the new feature under datasets/1.1. You can add e2e testing for new datasets in python/mlcroissant/mlcroissant/_src/datasets_test.py (hermetic) or python/mlcroissant/mlcroissant/_src/datasets_nonhermetic_test.py

wardi · 2025-08-15T16:21:54Z

@ccl-core thank you I will do that.

recheck

wardi · 2025-08-19T19:22:42Z

recheck

benjelloun · 2025-08-20T16:02:29Z

Can you please take a look at the test failures in the CI?

wardi · 2025-08-21T22:18:27Z

@ccl-core I've added a dataset to showcase multilingual metadata based on the 1.0 version of recipes/minimal_recommended.json is there anything else this PR needs?

datasets/1.1/recipes/minimal_multilingual.json

ccl-core · 2025-08-22T08:50:12Z

Hi @wardi , thank you for the PR! LGTM

marcenacp

Nice PR, thanks!

ccl-core · 2025-08-25T06:30:23Z

Hi @wardi , thank you for this PR! Everything looks good on my side, I think you can safely click on Squash and merge.

wardi · 2025-08-25T12:08:51Z

@ccl-core thanks, but someone with write access will need to merge.

Is it helpful for me to squash and force push to this branch first?

benjelloun · 2025-08-25T12:11:21Z

@ccl-core thanks, but someone with write access will need to merge.

Is it helpful for me to squash and force push to this branch first?

Done

name, description language tagged support POC

13c046f

wardi requested a review from a team as a code owner August 4, 2025 23:42

benjelloun assigned wardi Aug 5, 2025

benjelloun added this to New Croissant spec features Aug 5, 2025

benjelloun reviewed Aug 5, 2025

View reviewed changes

benjelloun approved these changes Aug 5, 2025

View reviewed changes

marcenacp reviewed Aug 5, 2025

View reviewed changes

wardi added 3 commits August 5, 2025 23:27

existing tests passing

55a0730

multilingual test

9aa58cb

test formatting of language tagged values

3525826

ccl-core reviewed Aug 11, 2025

View reviewed changes

wardi added 7 commits August 20, 2025 18:26

black reformatting

1af36a3

mypy fixes

099d8c5

pytype fix

e2352de

black reformatting with --preview

d78ea7a

isort

6e2be57

multilingual dataset example

29cdc57

minimal_multilingual: source for 1.1

ab5ac04

wardi changed the title ~~Proof of concept: name, description language-tagged support~~ name, description language-tagged support Aug 21, 2025

ccl-core reviewed Aug 22, 2025

View reviewed changes

datasets/1.1/recipes/minimal_multilingual.json Outdated Show resolved Hide resolved

ccl-core reviewed Aug 22, 2025

View reviewed changes

datasets/1.1/recipes/minimal_multilingual.json Outdated Show resolved Hide resolved

ccl-core self-requested a review August 22, 2025 08:49

ccl-core requested a review from marcenacp August 22, 2025 08:50

ccl-core approved these changes Aug 22, 2025

View reviewed changes

marcenacp approved these changes Aug 22, 2025

View reviewed changes

improved translations from @ccl-core

2e381cd

benjelloun merged commit cc68c2a into mlcommons:main Aug 25, 2025
12 checks passed

github-project-automation bot moved this to Done in New Croissant spec features Aug 25, 2025

github-actions bot locked and limited conversation to collaborators Aug 25, 2025

Conversation

wardi commented Aug 4, 2025

Uh oh!

github-actions bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjelloun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wardi commented Aug 5, 2025

Uh oh!

marcenacp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccl-core commented Aug 11, 2025

Uh oh!

wardi commented Aug 15, 2025

Uh oh!

wardi commented Aug 19, 2025

Uh oh!

benjelloun commented Aug 20, 2025

Uh oh!

wardi commented Aug 21, 2025

Uh oh!

Uh oh!

Uh oh!

ccl-core commented Aug 22, 2025

Uh oh!

marcenacp left a comment

Choose a reason for hiding this comment

Uh oh!

ccl-core commented Aug 25, 2025

Uh oh!

wardi commented Aug 25, 2025

Uh oh!

Uh oh!

benjelloun commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Aug 4, 2025 •

edited

Loading