name, description language-tagged support#932
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
benjelloun
left a comment
There was a problem hiding this comment.
Thanks for adding this!
It would be good to have some tests for this functionality.
Side note: Looks like your CLA / joining the MLCommons org still hasn't gone through.
| "data": {"@id": "cr:data", "@type": "@json"}, | ||
| "dataType": {"@id": "cr:dataType", "@type": "@vocab"}, | ||
| "dct": "http://purl.org/dc/terms/", | ||
| "description": {"@container": "@language"}, |
There was a problem hiding this comment.
Does this force the description to be a language map, or does it make it optional?
There was a problem hiding this comment.
IIUC this is JSON-LD for:
when you encounter a map (container) in a
descriptionfield convert it to a list with
{"@language": key, "@value": value}elements.
So it doesn't force a language map it only sets the RDF interpretation of the keys when a map is found.
There was a problem hiding this comment.
conversion back from list form happens here: https://github.com/mlcommons/croissant/pull/932/files#diff-43214fc47248088a8c6b83e615b24058c137f57b158398855c109b5dc63677fbR180-R184
Again, this is optional and only happens when a list value is found for these fields.
| "jsonPath": "cr:jsonPath", | ||
| "key": "sc:key" if ctx is not None and ctx.is_v0() else "cr:key", | ||
| "md5": "sc:md5" if ctx is not None and ctx.is_v0() else "cr:md5", | ||
| "name": {"@container": "@language"}, |
|
recheck |
| if not isinstance(name, str): | ||
| self.add_error(f"The name should be a string. Got: {type(name)}.") | ||
| if not isinstance(name, (str, dict)): | ||
| self.add_error(f"The name should be a string or dict. Got: {type(name)}.") |
There was a problem hiding this comment.
Can you please add test cases for these errors?
We use snapshot test cases:
| actual_jsonld_type = value.get("@type") | ||
| if actual_jsonld_type == jsonld_type: | ||
| return input_type.from_jsonld(ctx, value) | ||
| elif isinstance(value, dict) and field.cardinality == "LANGUAGE-TAGGED": |
There was a problem hiding this comment.
Can you please add a test case?
There was a problem hiding this comment.
added a snapshot test case and a specific test for this new message
| """Overloads dataclasses.field with specific attributes.""" | ||
| if cardinality not in ["ONE", "MANY"]: | ||
| raise ValueError(f"cardinality should be ONE or MANY. Got {cardinality}") | ||
| if cardinality not in ["ONE", "MANY", "LANGUAGE-TAGGED"]: |
There was a problem hiding this comment.
Question: do we need ONE-LANGUAGE-TAGGED and MANY-LANGUAGE-TAGGED?
There was a problem hiding this comment.
No, here "LANGUAGE-TAGGED" is short for "either one value or a dict of one or more language-tagged values"
| "data": {"@id": "cr:data", "@type": "@json"}, | ||
| "dataType": {"@id": "cr:dataType", "@type": "@vocab"}, | ||
| "dct": "http://purl.org/dc/terms/", | ||
| "description": {"@container": "@language" |
There was a problem hiding this comment.
After changing the context, you probably want to modify all datasets in the 1.1 folder to follow it. You can do this using the migration script python/mlcroissant/mlcroissant/scripts/migrations/migrate.py.
|
Great, thank you @wardi for adding this! Do we already have a specific usecase in mind? In that case, we might want to add a dataset to showcase the new feature under |
|
@ccl-core thank you I will do that. recheck |
|
recheck |
|
Can you please take a look at the test failures in the CI? |
|
@ccl-core I've added a dataset to showcase multilingual metadata based on the 1.0 version of |
|
Hi @wardi , thank you for the PR! LGTM |
|
Hi @wardi , thank you for this PR! Everything looks good on my side, I think you can safely click on |
|
@ccl-core thanks, but someone with write access will need to merge. Is it helpful for me to squash and force push to this branch first? |
Done |
This proof of concept allows JSON-LD language-tagged strings for
nameanddescriptionfields, fixing #924It does not add support for general JSON-LD property-based indexing such as id-maps or type-maps, nor does it add support for multiple titles. i.e.
{ "name": [ {"@value": "The Queen", "@language": "en"}, {"@value": "Die Königin", "@language": "de"} ] }and
{ "name": {"en": "The Queen", "de": "Die Königin"} }are supported, but
{ "name": ["Die Königin", "Ihre Majestät"] }is not.
This implementation extends (and possibly abuses?)
field.cardinalityto add"LANGUAGE-TAGGED"as a new cardinality fornameanddescription.In Python and in the generated JSON-LD multilingual fields are always represented as a language map (dict) so that users can reference the language versions by their BCP-47 key and not have to iterate over a list comparing
"@language"values.Looking for comments on the approach before adding tests, updating the spec, ttl etc.