Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language-tagged strings #22

Open
dbooth-boston opened this issue Dec 7, 2018 · 7 comments
Open

Language-tagged strings #22

dbooth-boston opened this issue Dec 7, 2018 · 7 comments
Labels
Category: language features For language features of RDF itself -- model and syntax standards Standardization should address this

Comments

@dbooth-boston
Copy link
Collaborator

They currently have a special status in RDF. "RDF 1.1 Concepts and Abstract Syntax currently contains many caveats to accommodate the idiosyncratic nature of language-tagged strings"
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0090.html

"It is a real pain to create these 3 component literals and to query for different languages and datatypes in SPARQL.
And worse still, if you want to query for strings that may or may not have language tags on, you need to do some real messing about."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0098.html

"Using a general way to make statements about literals sounds good to me.
For geographical data I also see too many statements being squashed into a
single literal. It is difficult to process and to store. . . . Why have a standard provision for
indicating the language of a text string and not its pronunciation for
example?"
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0102.html

"language codes do matter, but are pretty inconvenient for multiple reasons:

  • comparability with untyped/plain strings (of course, and most obviously
    and counter-intuitive to RDF novices),
  • complexity (BCP47 defines (a) complex selection rules among ISO 639
    language tags, and (b) complex rules for composition, e.g., with script and
    region codes), and
  • confusability (having 2-letter codes aside with 3-letter codes for the
    same language can let people used to work with 3-letter codes chose
    2-letter codes, which is an easy error to make, but can result in failure
    to compare, e.g., "cat"@eng and "cat"@en. Not sure what should happen when
    you compare "рука"@sr-Cyrl with "рука"@sr. Both are identical, the first is
    just more explicit in stating that this is Cyrillic.)
  • coverage (for many applications, ISO639 simply isn't fine-grained or
    well-defined enough, and its extension is slow, bureaucratic and doubtful)."
    https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0116.html

"RDF seems to violate its own doctrine by having separate
systems for data types and languages of literals."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0143.html

IDEA: Eliminate the special status of language-tagged strings

"would it be possible to do away with the special status of language-tagged strings? . . . Would it be possible to define a regular lexical space, e.g., containing "hello@en"^^rdf:langString, together with a value-2-lexical and a lexical-2-value mapping? The N3 and SPARQL notation "hello"@en will of course still be available, and will be syntactic sugar for "hello@en"^^rdf:langString."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0090.html

"Surely languages and datatypes should simply be RDF properties of Literals, which are 1 component things?
Much easier to explain to developers, and for them to use."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0098.html

"That also fits in nicely with making it easier to represent property graphs."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0101.html

"it would be much more efficient to declare the language used only once, at the class and/or metadata level. Using plain properties to indicate language enables doing that."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0145.html

CONCERN: "The RDF 1.1 WG did spend some time [on language tags] - both on putting the langtag
into the lexical space and putting the lang tag into the datatype. Both
are not so easy; in the end the rdf@langString at least meant all
literals had a datatype."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0097.html

CONCERN: "chat"@en and "chat"@fr are different.
"chat" rdf:lang "en" .
"chat" rdf:lang "fr" .
makes every use of "chat" both @en and @fr.
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0148.html

"I think the only way to avoid this would be if subject literals are be
taken as a notational short-hand for a blank node that carries the literal
as an rdf:value. (And, in a separate step, a problem-specific bnode
skolemization routine could be provided to give it a proper URI.)"
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0156.html

"I really don't have a problem with every instance of "chat"^^xsd:string being both en and fr if someone has asserted that using rdf:lang. . . . Basically I think language tags are trying to avoid having to say in RDF what should be in the RDF."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0164.html

IDEA: Use W3C OntoLex / Lemony as a basis for language tagging

"[It] is possible already [to declare language only once, at the class and/or metadata level] (using the pointers to ISO639 URIs in my earlier mail), and it is recommended practice to do so in OntoLex/lemon . . . . OntoLex is . . . a W3C community group report, but it would
be the most suitable basis for future standardization efforts in this direction."
https://www.w3.org/2016/05/ontolex/#lexicon-and-lexicon-metadata
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0145.html

IDEA: Use URIs to identify language

"A much more convenient solution would be to identify the language by means
of a URI. This can be an ISO 639 category (see under
http://id.loc.gov/vocabulary/iso639-2.html and
http://id.loc.gov/vocabulary/iso639-1.html; for ISO 639, cf.
http://www.lexvo.org/), or provided by another authority (e.g.,
https://glottolog.org/). Other properties (e.g., xsd datatypes) could also
be stated about a literal. Two strings could be considered identical if the
values are the same and the properties of one are a proper subset of the
properties of the other."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0116.html

"a downward-compatible notation is possible:

  • take @ as a short-hand for ^^xsd:string, with language identifiers
    following
  • if the language identifier is not a URI, it must be BCP47
  • BCP47 codes can be decomposed in the background into their sub-properties
  • permit multiple language URIs/BCP47 codes (if you want to provide both a
    BCP47 code [indicating region and script] and a URI [unambiguously
    identifying the language])
  • let plain literals be untyped"
    https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0119.html

CONCERN: "No. All literals MUST have a type, so that queries can have a
unique response when they ask for the type or specify the type.
The RDF 1.1 WG spent a lot of time and effort on this. Allowing
untyped plain literals in RDF 2004 was a bug. Please do not screw
this up again. Plain literals are syntactically legal (to
preserve backward compatibility) but they now have type xsd:string."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0149.html

"But this only means that "рука" entails [a xsd:string] . . . .
As far as comparisons between strings are concerned, this makes no
difference to the example, as the subset relation between the (implicit)
properties of "рука"@sr and "рука" still holds"
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0152.html

@dbooth-boston dbooth-boston added the Category: language features For language features of RDF itself -- model and syntax label Dec 8, 2018
@kasei
Copy link

kasei commented Dec 10, 2018

I'm concerned that a lot of the proposed solutions would bring with them just as many drawbacks. If a big part of the issue here is the challenge of querying language data, perhaps we should be looking first at the ergonomics of using SPARQL on this data and how that could be improved.

@iherman
Copy link
Member

iherman commented Dec 14, 2018

I think that looking at language tags in isolation may not be the right approach. To be really international in nature, there are a number of things one may want to "say" about the text, and the natural language is only one of those. For example:

etc. The experience I have with other specifications that rely on RDF literals (though serializations like JSON-LD) is that these issues come and bite you all the time.

Yes, this all may converge towards the separate issue on literal as subject (#21), and may force us to fundamentally re-think how RDF treats literals.

@draggett
Copy link
Member

A given word may be applicable to multiple languages, especially for loan words where one language borrows from another. Together with @iherman points about other related kinds of properties, this suggests that we need a means to model the combination of a string literal with a given set of properties.

I suspect this also fits in with the desire to be able to model property graphs where nodes and links can be associated with sets of property-value pairs, where the values can themselves be sets of property-values and so forth recursively.

Once you have that, it is straightforward to model a value as being a word in a given language with a given pronunciation, writing direction and so forth. We would still need a small set of core data types, e.g. string, number, boolean, ID, link, but others could be layered on top with properties as annotations. A node that is used for a natural language word or phrase could have one property for the string value, another for the language, and another for the pronunciation.

I will expand on this further in another issue.

@HughGlaser
Copy link
Collaborator

Hi @iherman,
If I understand you correctly, I think I agree.
The more knowledge we put into the representation of literals directly, rather than as properties, the harder it is to process it in RDF/SPARQL.
A couple of comments on your bullets:

  • base direction - I would find it strange to add that to a Literal. It seems to me that the direction is an aspect of the script. So the IANA guidelines say that the scripts are suppressed for Latn, Hebr etc., but they do have scripts, which I assume have default direction. Adding anything so that we can have right-to-left Latn or left-to-right Arab seems a bit over the top.

  • Pronunciation hints - I really find that indigestible :-). It is bad enough (in my view!) deciding that a Literal (collection of symbols) such as "chat" has a specific language associated with it at all. It is much worse to say not only is it associated with a particular language, but that it is even associated with pronouncing that symbol collection in a particular way. What about "read"@en-GB (reed & red)? Or "potato"@en? This sort of stuff really should be attached to a URI that is the word, as in Wordnet or whatever.

@iherman
Copy link
Member

iherman commented Dec 14, 2018

Hi @HughGlaser

  • On the base direction: what I have learnt is that bidirectional text are sometimes insanely complicated, and that is one of the reason why HTML keeps the dir attribute... Worth looking at the tutorials prepared by the I18N activity at W3C (there are links from that page above).
  • For the pronunciation: think of a text written in Japanese Kanji and then a version of the text in Hiragana to make it easier to see the pronunciation. Or the usage of bopomofo in traditional chinese or ruby in Japanese... These may all be necessary for a piece of text and currently it is a mess to add them in a consistent way when RDF/JSON-LD is used for metadata.

I do not want to present myself as an i18n expert, I am very very far from it. I just think that RDF literals may have serious i18n issues, and if we want to review literals in general, we will have to seriously look at that, too...

@HughGlaser
Copy link
Collaborator

Yeah @iherman no problem with any of that.
The language issues are big, and should be worried about.
My worry is if people want to push things into the language tag world, rather than a more comprehensive/cleaner solution where the knowledge about the symbols is represented in RDF.
Possibly just to avoid a triple or two that actually says the right thing.
(I've never tried to do a lot in JSON-LD, so can't comment on those issues - I tend to prefer N3 or whatever.)

@dbooth-boston
Copy link
Collaborator Author

My personal hope is that we:

  • define a standard form of n-ary relation to represent language-tagged strings as a molecule of triples, consistent with features that i18n experts have found important;
  • offer a convenient syntactic sugar for writing them in a higher-level RDF language; and
  • define a mapping to/from the existing RDF 1.1 language-tagging mechanism, to retain backward compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: language features For language features of RDF itself -- model and syntax standards Standardization should address this
Projects
None yet
Development

No branches or pull requests

5 participants