Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying foreign language in case of code switching #776

Open
dan-zeman opened this issue Apr 20, 2021 · 25 comments
Open

Specifying foreign language in case of code switching #776

dan-zeman opened this issue Apr 20, 2021 · 25 comments

Comments

@dan-zeman
Copy link
Member

Sometimes it is desirable to be able to say that a token is in a language different from the main language of the file, and to specify the foreign language. Some corpora have occasional code switching, others have a lot of it. And if the annotators decide to actually annotate the foreign segment following the foreign language rules, the validator needs to know that it should temporarily switch to a different list of auxiliaries and morphological features.

I have now modified the validator so that if it sees Lang=en in the MISC column, it will switch to English for the current token (it affects the auxiliary list, copula list, and feature-value-UPOS combinations). The value is the ISO 639 code as registered for the language in UD (either two-letter ISO 639-1, or three-letter ISO 639-3); it must be lowercased. I have also documented it here, here, and here. So if you need to specify the foreign language for the validator, you can now do so.

Nevertheless, I also found out that various treebanks already try to indicate the language in MISC, and given the lack of standard so far, various approaches are taken:

  • French GSD and Naija NSC use the format I use for the validator (but in the case of Naija, maybe it is meant to indicate where the word has been taken from, rather than to say that it is not a Naija word).
  • Komi Zyrian uses Lang=Mixed, Lang=Rus (the values are not UD-registered ISO codes).
  • Frisian-Dutch uses lang=fy, lang=nl (the name of the attribute is not capitalized).
  • Turkish-German uses LangID=TR, LangID=DE, LangID=OTHER (different attribute name; uppercase language codes will not be recognized).
  • Hindi-English uses just hi or en, without saying that these are language codes.

It would be nice if we could harmonize these annotations in future releases. Although not every treebank needs the validator to recognize them (e.g., the massively code-switching treebanks are registered under special user-defined language codes, such as qtd for Turkish-German, so they have their own set of language-specific guidelines).

@KonstantinosSampanis
Copy link

On the long run this will also be useful for our Cappadocian Greek (CG) Corpus (AMGiC). Right now, tentatively, I marked all elements of Turkish origin with #. I am just wondering if it is theoretically justifiable to treat as "code-switching" any lexical item of Turkish origin already incorporated into CG as an integral component. This is actually sth we will have to examine for our Corpus before we move on.

@dan-zeman
Copy link
Member Author

I am just wondering if it is theoretically justifiable to treat as "code-switching" any lexical item of Turkish origin already incorporated into CG as an integral component.

I suppose there is a scale and you will have to decide for yourself where exactly to draw the line. But if a foreign word has been integrated in the host language, then it is a loanword rather than code switching, that is, we treat it the same way as native words. (You could still define another MISC attribute that would mark the origin of loanwords, but it would not affect validation. And vice versa, if you mark a foreign word with Lang=tr, you will actually have to make sure it adhers to the UD guidelines for Turkish.)

@amir-zeldes
Copy link
Contributor

This same question applies to UD_Coptic: we actually have extensive source language annotations for all loan words in the data, but 99% are what you would call integrated (so just individual loan words, not a Greek sentence in the middle of everything).

We've considered making them Foreign=Yes and leaving them at that, but if there is a recommended way of including the source language that's actually something we know, so we could add it.

@nikopartanen
Copy link
Contributor

This is a very welcome development, and with the Komi treebanks we are more than happy to harmonize the tags. I'll change Lang=Rus to Lang=ru. I used now Mixed for situations where there is a Russian stem but Komi morphology, which also happens for words that are not very strongly established as loanwords. But we can still treat them as Komi for UD purposes. Foreign=Yes, or some variant of it, could also be an option. Or we can store that information at lemma level somewhere else, if it doesn't belong to the treebanks.

@dan-zeman
Copy link
Member Author

Yes, if it takes target language morphology (which is different from the source language morphology), then we have to treat it as a target language word.

May I propose OrigLang as the optional MISC attribute that indicates language of origin without instructing the validator to treat the word according to the source language guidelines? This is just an attempt to have a standard-ish solution in case multiple people want to encode something like that. The values could be again the lowercase language codes, but it could be also languages that are not yet covered by UD, and perhaps some other strings if needed.

@Stormur
Copy link
Contributor

Stormur commented Apr 21, 2021

It is a very nice addition! We have recently uploaded a new Latin treebank where there are quite a lot of "foreign" words and pieces of sentences, and it would be nice to annotate them, too. Also, in general, in Classical Latin texts it is not so uncommon to have entire phrases in Greek, and since their "morphological proximity" it would be nice to integrate their annotation in the treebank, too.

Now, is this "code switching" limited to languages already present in UD? I wonder if it is possible to use also other codes, and if there is room too for dialectal and diachronic variation (and maybe both at the same time).

@dan-zeman
Copy link
Member Author

The Lang attribute is limited to languages known by the UD infrastructure. If you use a valid but unknown ISO 639-3 code, the validator will tell you that there are no morphological features allowed in this language. However, it is not required that there is a UD treebank of this language. So if you really need this for a language not yet covered by UD (and if you are willing to take care of documentation and validation rules for that language), I can register it without a treebank.

Dialectal and diachronic variation is occasionally supported (with unsatisfactory level of granularity) by the ISO codes, so you can distinguish Old French (fro) from French (fr). We treat them as separate languages. Any finer distinctions have to be solved by treebank providers within the language-specific guidelines. I have seen a treebank where dialect-specific or historical spelling was normalized, and the original spelling was shown only as a sentence-level attribute (# text_orig = or something like that). The other option is to keep the non-standard spelling in the FORM column, and only normalize LEMMA (+ maybe show the standard form in MISC; see also https://universaldependencies.org/u/overview/typos.html#historical-spelling). If the lemma is not normalized or if the language uses multiple writing systems, then the validator will have to be given all existing lemmas of copulas, for example (as we now have it for Bokmål/Nynorsk in Norwegian: være vs. vere).

@amir-zeldes
Copy link
Contributor

OK, I'm ready to add this for UD_Coptic, but the validator says:

[Line 11 Sent shenoute_a22-a22_YA421-428_s0001]: [L4 Morpho feature-upos-not-permitted] Feature Foreign is not permitted with UPOS CCONJ in language [cop].
[Line 44 Sent shenoute_a22-a22_YA421-428_s0001]: [L4 Morpho feature-upos-not-permitted] Feature Foreign is not permitted with UPOS SCONJ in language [cop].
[Line 91 Sent shenoute_a22-a22_YA421-428_s0002]: [L4 Morpho feature-upos-not-permitted] Feature Foreign is not permitted with UPOS NOUN in language [cop].
...

I thought Foreign is a universal feature so it doesn't need to be added in a special way, no? I also noticed it's not saying it doesn't know the feature name, it says it's not allowed with each upos. @dan-zeman could you advise on how to allow Foreign for this data?

@dan-zeman
Copy link
Member Author

Here. But I enabled the feature for all UPOS categories in Coptic now, so the errors should disappear.

The feature indeed is known because it is documented globally. But even universal features are now configured individually because some languages do not use some of them (e.g., Evident), some languages use only some values (e.g., English has two values of Case while Hungarian has over 20), and some languages use some values only for some UPOS categories (e.g. English uses Case only with pronouns). Foreign could be a special case (together with Abbr and Typo) because it has only one value and it makes sense with any UPOS in any language, but for simplicity it is maintained the same way as other features. It has been initialized according to the actual usage in the data when this feature registration system was introduced, that's why in Coptic it was permitted only with X until today.

@amir-zeldes
Copy link
Contributor

Perfect, thanks! UD_Coptic now has Foreign and the proposed OrigLang. It looks like about 8% of tokens in the treebank are loanwords (Coptic is a contact language with strong Greek influence), so this adds a lot of information.

BTW I think I asked about this in some issue somewhere, but is there a recommendation on how to handle 'partly' loaned words? These all got the same Foreign=Yes right now, but we can actually distinguish partly foreign words in the source data for the corpus, so we could make the distinction. I mean derivations or incorporations like "Zeitgeistyness" in English (German base, English derivations). There are quite a few of these in the data.

@dan-zeman
Copy link
Member Author

is there a recommendation on how to handle 'partly' loaned words?

I don't know about any recommendation in the guidelines, although there might be some discussion somewhere in the issues. For me, Zeitgeistyness is an English word (because it is no longer German), i.e., Foreign=Yes does not apply.

@amir-zeldes
Copy link
Contributor

I don't need for these words to be called Foreign=Yes in particular, but they are of great interest to people working on language contact, since they are the most integrated examples of borrowing, and so I think they should be made findable (especially since we already have the annotations). Does anyone have something similar or a suggestion what to call this? And should it be in FEATS or MISC? Another option is a value Foreign=Partly or Foreign=Base, or "derived" or something else.

@KonstantinosSampanis
Copy link

For the time being, i.e. tentatively, in AMGiC we mark elements of Turkish origin occurring in Cappadocian Greek with a LC = YES tag (with LC standing for 'Language Contact'), followed in turn by a tag defining the grammatical status of the loanword (we focus on morphosyntactic borrowing).

@ozlemcek
Copy link
Contributor

Dan, thank you for bringing up the topic, here is my attempt for harmonisation. Sorry for being late to the party, on the bright side, it gives plenty of time for discussion and implementation before the next release.

The Turkish-German SAGT treebank uses language IDs adapted from the Code-Switching Workshop shared tasks on language identification. Each token bears one of the following labels as the value of the LangID feature:

TR: Turkish
DE: German
LANG3: A third language, e.g. English, Spanish,... in this context
MIXED: word-internal code-switching
OTHER: Punctuation, numbers, emoticons, symbols, and any other token that does not fall into the other categories

Ideally, I want to keep the information these labels convey and follow a UD-general scheme.

Option 1: Renaming LangID to Lang and mapping its values to ISO codes.

  • TR -> tr and DE -> de are straightforward, LANG3 can be converted to their respective ISO codes.

  • MIXED tokens are usually non-Turkish words with Turkish inflection.

In the example sentence "In dem dritten Semesterda Java gelecek" (In the third semester Java will come), the German "Semester" takes the Turkish locative case marker "da".

4 Semesterda Semester NOUN _ Case=Loc|Number=Sing 6 obl _ LangID=MIXED|DeGender=Neut|DeCase=Dat

In mixed words we decided to follow the morphological features of the head's language (in this example, according to German rules, Case would be Dat, we keep this information in the MISC column).

The SAGT treebank has its own language code qtd, perhaps I can use it for mixed words?

  • OTHER tokens are not language-specific, so I'm not sure how to label them with ISO codes. Any suggestions?

There could also be another issue. The validator also checks the possible deprel list of a language, right?
In the sentence "... bana gibt es schon so gute Impulse" (it already gives me such good impulses), the Turkish pronoun "bana" (to me) is annotated as iobj as its head is German, but there is no iobj in Turkish.
If the annotation is Lang=tr, would the validator complain about seeing an iobj?

11      bana    ben     PRON    _       Case=Dat|Number=Sing|Person=1|PronType=Prs      12      iobj    _       LangID=TR
12      gibt    geben   VERB    _       Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   3       conj    _       LangID=DE

Option 2: Renaming LangID with another feature name (to avoid confusions) and keeping its values. Adding the Lang feature with values as follows:

  • ISO codes for TR, DE, LANG3 tokens
  • qtd for MIXED tokens
  • no Lang feature for OTHER tokens
  • For tokens with dependencies that come up due to code-switching, using the label qtd, after all it is the ISO code for the new combined language. In this case, "bana" in the example above will have Lang=qtd although it is Turkish.

@ozlemcek ozlemcek reopened this Jun 18, 2021
@dan-zeman dan-zeman modified the milestones: v2.8, v2.9 Jun 18, 2021
@dan-zeman
Copy link
Member Author

I did not remember whether the Lang attribute would be considered when checking the deprel but I just tried on my copy of the data marking bana as Lang=tr, and it still passes the validation, despite the iobj relation. It is not clear what exactly the validator should do with deprels in this respect because the deprel classifies a relation between two words rather than being a property of the dependent word. Perhaps it should consider the language of both words and permit a union of the known relations from the two languages. Right now it probably just takes the list from the main language, which is qtd here.

On the other hand, morphological features are language-specific. If I add Poss=Yes to the features of bana, it is accepted with Lang=qtd and Lang=de, but rejected with Lang=tr.

I find both the options you propose as reasonable. Option 2 would cost more space but it would allow you to preserve more faithfully your original scheme, which may be desirable if this scheme is already a de-facto standard in code-switching NLP. I did not assume that the Lang attribute would be used for every token in the treebank (quite the opposite: in most treebanks it would be used exceptionally only when a foreign phrase sneaks into a host-language sentence). Therefore, omitting the attribute for tokens where you currently have LangID=OTHER is a straightforward solution.

@ozlemcek
Copy link
Contributor

Thanks for checking the validator. qtd is really the union of tr and de relations so iobj is in its deprel list. It won't be an issue for the Turkish-German treebank then, but it could be a good solution for monolingual treebanks with some code-switching if the validator permits a union of the known relations from the two languages. Regarding the morphological features, I think using Lang=qtd would cover all non-standard morphology.

I checked Frisian-Dutch and Hindi-English treebanks which identify themselves as code-switching treebanks. They both assign a language ID for each token. In code-switching NLP, language ID prediction is a common task, therefore there is the tradition of annotating each token (although the tag sets differ). For monolingual treebanks with code-switching, it makes sense that not all tokens have a Lang feature.

I will go with Option 2 at the moment, mainly to keep language IDs for each token and to keep the OTHER class intact. It also allows backwards compatibility with the language identification and morphological analysis models we released.

@dan-zeman dan-zeman modified the milestones: v2.9, v2.11 Jun 13, 2022
@dan-zeman dan-zeman modified the milestones: v2.11, v2.13 May 31, 2023
@Stormur
Copy link
Contributor

Stormur commented Oct 30, 2023

Coming back to this issue with regard to the interaction between language specification and validation.

I would like to propose that the validator only issue warnings instead of errors for words tagged with Foreign=Yes in its features and Lang in MISC. The proposal is motivated by the following occurrences:

  • it is well possible that a language appears for which there is no treebank in UD, and then every morphological feature is evaluated as an error. For example, in Latin treebanks we have Aramaic terms, so I would like to use the ISO code jpa, but as far as I know it is not present in UD. Or, there are Hebrew terms and I would like to use hbo for Ancient Hebrew: Modern Hebrew (heb) exists in UD, but it is simply not a correct choice.
  • sometimes one does not precisely know every detail of that language-specific annotation, or would like to expand the morphological/syntactic annotation a bit in compatible ways but without intervening on that language-specific documentation. I feel it does not make much sense that a whole treebank can be considered invalid for such small details, but warnings would be OK, also for further improvements.

As an aside, I report the fact that sometimes it is quite difficult to make out or find the correspondence between ISO codes and languages in UD, so maybe an indexing by ISO codes, a list of correspondences, and the specification of the code also on the various pages pertaining to that language (and on the main page) would be very welcome.

@dan-zeman
Copy link
Member Author

  • it is well possible that a language appears for which there is no treebank in UD, and then every morphological feature is evaluated as an error. For example, in Latin treebanks we have Aramaic terms, so I would like to use the ISO code jpa, but as far as I know it is not present in UD. Or, there are Hebrew terms and I would like to use hbo for Ancient Hebrew: Modern Hebrew (heb) exists in UD, but it is simply not a correct choice.

This sounds reasonable to me. UD does not know Aramaic yet, and we probably don't want to add it just because of a few words in a Latin treebank. Although I'm wondering what principles you would follow when assigning Aramaic-specific features to the Aramaic word if there are no guidelines for Aramaic. There is still the default option for foreign words in UD, use UPOS=X and no features.

BTW, Ancient Hebrew (hbo) already is in UD. But that of course does not invalidate the issue in general.

@Stormur
Copy link
Contributor

Stormur commented Oct 31, 2023

BTW, Ancient Hebrew (hbo) already is in UD. But that of course does not invalidate the issue in general.

Oh, I lost it! This relates to my last point moved in the new discussion.

The specific point is for example that I would like to add NameType tags to such words, but they are not (yet) defined for hbo (and most other languages). Or, I would like to add InflClass to Ancient Greek terms: this can be done "on the fly" in a very sensible way based on what is already defined for Latin. Anyway, both these "local extensions" do not impact in any way on the annotation of the treebanks of those other languages.

I would suggest that:

  • the validator should accept the annotation style of the treebank's language also for foreign terms (in addition to the style for that foreign language), possibly issuing a warning, but not invalidating it.

Although I'm wondering what principles you would follow when assigning Aramaic-specific features to the Aramaic word if there are no guidelines for Aramaic

I would stay, at least in an initial phase, on a very basic level, and I believe that some markings are quite indisputable, like Gender=Masc|Number=Sing for abba 'father', also by similitude with other related languages. At least, I think they should be not penalised but possibly just signalled for future corrections.

There is still the default option for foreign words in UD, use UPOS=X and no features.

This is really a last-resort option which cancels all the information that we might have and creates "holes" in the annotation. I would avoid it as much as possible.

@mr-martian
Copy link
Contributor

The specific point is for example that I would like to add NameType tags to such words, but they are not (yet) defined for hbo (and most other languages). Or, I would like to add InflClass to Ancient Greek terms: this can be done "on the fly" in a very sensible way based on what is already defined for Latin. Anyway, both these "local extensions" do not impact in any way on the annotation of the treebanks of those other languages.

I would be open to adding NameType to hbo - I've been working on coreference annotations and could probably derive NameType from them without too much trouble. (For that matter, I'd also be open to adding InflClass to grc, though that requires more people to be on-board with it than just me.)

@amir-zeldes
Copy link
Contributor

UD does not know Aramaic yet, and we probably don't want to add it just because of a few words in a Latin treebank

That's true, but if it's just about the origin of a loan word, we can also use OrigLang. UD Coptic has that and also includes Greek, Hebrew, Latin, and Aramaic words (OrigLang=arc), for example:

https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/blob/master/cop_scriptorium-ud-dev.conllu#L4499

@Stormur
Copy link
Contributor

Stormur commented Oct 31, 2023

UD does not know Aramaic yet, and we probably don't want to add it just because of a few words in a Latin treebank

That's true, but if it's just about the origin of a loan word, we can also use OrigLang. UD Coptic has that and also includes Greek, Hebrew, Latin, and Aramaic words (OrigLang=arc), for example:

https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/blob/master/cop_scriptorium-ud-dev.conllu#L4499

Oh, this is nice and I think I am going for this one at the moment. But then one gets the complementary problem of using a feature for this language which is not part of the main language's annotation and thus still raises an error.

Anyway, what should be the ratio for preferring OrigLang over Lang, or viceversa?

I would be open to adding NameType to hbo - I've been working on coreference annotations and could probably derive NameType from them without too much trouble. (For that matter, I'd also be open to adding InflClass to grc, though that requires more people to be on-board with it than just me.)

That's great! But it would still be nice to be able to go a little bit ahead inside the own treebank! 😬

By the way, NameType is something that does not really fit morphological features and that should be moved at another level of annotation, and transitorily maybe in MISC. So for the moment I am keeping it there, curating it as far as possible and having some fun with it, but it is something that will need a major change. It has its niche interest, though.

@dan-zeman
Copy link
Member Author

By the way, NameType is something that does not really fit morphological features and that should be moved at another level of annotation

Yes. Use Entity in MISC.

@Stormur
Copy link
Contributor

Stormur commented Oct 31, 2023

By the way, NameType is something that does not really fit morphological features and that should be moved at another level of annotation

Yes. Use Entity in MISC.

For the time being I am leaving it there as a kind of "placeholder", but Entity will be the next step (2.14 milestone, assigned to self 😬 ).

@dan-zeman
Copy link
Member Author

UD does not know Aramaic yet, and we probably don't want to add it just because of a few words in a Latin treebank

That's true, but if it's just about the origin of a loan word, we can also use OrigLang. UD Coptic has that and also includes Greek, Hebrew, Latin, and Aramaic words (OrigLang=arc), for example:
https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/blob/master/cop_scriptorium-ud-dev.conllu#L4499

Oh, this is nice and I think I am going for this one at the moment. But then one gets the complementary problem of using a feature for this language which is not part of the main language's annotation and thus still raises an error.

Leaving this issue open to remind myself that eventually the validator should not check features if MISC Lang points to a language that does not yet have UD documentation.

@dan-zeman dan-zeman modified the milestones: v2.13, v2.14 Nov 15, 2023
@dan-zeman dan-zeman modified the milestones: v2.14, v2.15 May 15, 2024
@dan-zeman dan-zeman self-assigned this Nov 16, 2024
@dan-zeman dan-zeman modified the milestones: v2.15, v2.16 Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants