-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specifying foreign language in case of code switching #776
Comments
On the long run this will also be useful for our Cappadocian Greek (CG) Corpus (AMGiC). Right now, tentatively, I marked all elements of Turkish origin with #. I am just wondering if it is theoretically justifiable to treat as "code-switching" any lexical item of Turkish origin already incorporated into CG as an integral component. This is actually sth we will have to examine for our Corpus before we move on. |
I suppose there is a scale and you will have to decide for yourself where exactly to draw the line. But if a foreign word has been integrated in the host language, then it is a loanword rather than code switching, that is, we treat it the same way as native words. (You could still define another MISC attribute that would mark the origin of loanwords, but it would not affect validation. And vice versa, if you mark a foreign word with |
This same question applies to UD_Coptic: we actually have extensive source language annotations for all loan words in the data, but 99% are what you would call integrated (so just individual loan words, not a Greek sentence in the middle of everything). We've considered making them |
This is a very welcome development, and with the Komi treebanks we are more than happy to harmonize the tags. I'll change |
Yes, if it takes target language morphology (which is different from the source language morphology), then we have to treat it as a target language word. May I propose |
It is a very nice addition! We have recently uploaded a new Latin treebank where there are quite a lot of "foreign" words and pieces of sentences, and it would be nice to annotate them, too. Also, in general, in Classical Latin texts it is not so uncommon to have entire phrases in Greek, and since their "morphological proximity" it would be nice to integrate their annotation in the treebank, too. Now, is this "code switching" limited to languages already present in UD? I wonder if it is possible to use also other codes, and if there is room too for dialectal and diachronic variation (and maybe both at the same time). |
The Dialectal and diachronic variation is occasionally supported (with unsatisfactory level of granularity) by the ISO codes, so you can distinguish Old French ( |
OK, I'm ready to add this for UD_Coptic, but the validator says:
I thought |
Here. But I enabled the feature for all UPOS categories in Coptic now, so the errors should disappear. The feature indeed is known because it is documented globally. But even universal features are now configured individually because some languages do not use some of them (e.g., |
Perfect, thanks! UD_Coptic now has BTW I think I asked about this in some issue somewhere, but is there a recommendation on how to handle 'partly' loaned words? These all got the same Foreign=Yes right now, but we can actually distinguish partly foreign words in the source data for the corpus, so we could make the distinction. I mean derivations or incorporations like "Zeitgeistyness" in English (German base, English derivations). There are quite a few of these in the data. |
I don't know about any recommendation in the guidelines, although there might be some discussion somewhere in the issues. For me, Zeitgeistyness is an English word (because it is no longer German), i.e., |
I don't need for these words to be called |
For the time being, i.e. tentatively, in AMGiC we mark elements of Turkish origin occurring in Cappadocian Greek with a LC = YES tag (with LC standing for 'Language Contact'), followed in turn by a tag defining the grammatical status of the loanword (we focus on morphosyntactic borrowing). |
Dan, thank you for bringing up the topic, here is my attempt for harmonisation. Sorry for being late to the party, on the bright side, it gives plenty of time for discussion and implementation before the next release. The Turkish-German SAGT treebank uses language IDs adapted from the Code-Switching Workshop shared tasks on language identification. Each token bears one of the following labels as the value of the TR: Turkish Ideally, I want to keep the information these labels convey and follow a UD-general scheme. Option 1: Renaming
In the example sentence "In dem dritten Semesterda Java gelecek" (In the third semester Java will come), the German "Semester" takes the Turkish locative case marker "da".
In mixed words we decided to follow the morphological features of the head's language (in this example, according to German rules, Case would be The SAGT treebank has its own language code
There could also be another issue. The validator also checks the possible deprel list of a language, right?
Option 2: Renaming
|
I did not remember whether the On the other hand, morphological features are language-specific. If I add I find both the options you propose as reasonable. Option 2 would cost more space but it would allow you to preserve more faithfully your original scheme, which may be desirable if this scheme is already a de-facto standard in code-switching NLP. I did not assume that the |
Thanks for checking the validator. I checked Frisian-Dutch and Hindi-English treebanks which identify themselves as code-switching treebanks. They both assign a language ID for each token. In code-switching NLP, language ID prediction is a common task, therefore there is the tradition of annotating each token (although the tag sets differ). For monolingual treebanks with code-switching, it makes sense that not all tokens have a I will go with Option 2 at the moment, mainly to keep language IDs for each token and to keep the OTHER class intact. It also allows backwards compatibility with the language identification and morphological analysis models we released. |
Coming back to this issue with regard to the interaction between language specification and validation. I would like to propose that the validator only issue warnings instead of errors for words tagged with
As an aside, I report the fact that sometimes it is quite difficult to make out or find the correspondence between ISO codes and languages in UD, so maybe an indexing by ISO codes, a list of correspondences, and the specification of the code also on the various pages pertaining to that language (and on the main page) would be very welcome. |
This sounds reasonable to me. UD does not know Aramaic yet, and we probably don't want to add it just because of a few words in a Latin treebank. Although I'm wondering what principles you would follow when assigning Aramaic-specific features to the Aramaic word if there are no guidelines for Aramaic. There is still the default option for foreign words in UD, use UPOS= BTW, Ancient Hebrew ( |
Oh, I lost it! This relates to my last point moved in the new discussion. The specific point is for example that I would like to add I would suggest that:
I would stay, at least in an initial phase, on a very basic level, and I believe that some markings are quite indisputable, like
This is really a last-resort option which cancels all the information that we might have and creates "holes" in the annotation. I would avoid it as much as possible. |
I would be open to adding |
That's true, but if it's just about the origin of a loan word, we can also use |
Oh, this is nice and I think I am going for this one at the moment. But then one gets the complementary problem of using a feature for this language which is not part of the main language's annotation and thus still raises an error. Anyway, what should be the ratio for preferring
That's great! But it would still be nice to be able to go a little bit ahead inside the own treebank! 😬 By the way, |
Yes. Use Entity in MISC. |
For the time being I am leaving it there as a kind of "placeholder", but |
Leaving this issue open to remind myself that eventually the validator should not check features if MISC |
Sometimes it is desirable to be able to say that a token is in a language different from the main language of the file, and to specify the foreign language. Some corpora have occasional code switching, others have a lot of it. And if the annotators decide to actually annotate the foreign segment following the foreign language rules, the validator needs to know that it should temporarily switch to a different list of auxiliaries and morphological features.
I have now modified the validator so that if it sees
Lang=en
in the MISC column, it will switch to English for the current token (it affects the auxiliary list, copula list, and feature-value-UPOS combinations). The value is the ISO 639 code as registered for the language in UD (either two-letter ISO 639-1, or three-letter ISO 639-3); it must be lowercased. I have also documented it here, here, and here. So if you need to specify the foreign language for the validator, you can now do so.Nevertheless, I also found out that various treebanks already try to indicate the language in MISC, and given the lack of standard so far, various approaches are taken:
Lang=Mixed
,Lang=Rus
(the values are not UD-registered ISO codes).lang=fy
,lang=nl
(the name of the attribute is not capitalized).LangID=TR
,LangID=DE
,LangID=OTHER
(different attribute name; uppercase language codes will not be recognized).hi
oren
, without saying that these are language codes.It would be nice if we could harmonize these annotations in future releases. Although not every treebank needs the validator to recognize them (e.g., the massively code-switching treebanks are registered under special user-defined language codes, such as
qtd
for Turkish-German, so they have their own set of language-specific guidelines).The text was updated successfully, but these errors were encountered: