Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of mwe in Scandinavian languages #262

Open
jnivre opened this issue Feb 16, 2016 · 5 comments
Open

Use of mwe in Scandinavian languages #262

jnivre opened this issue Feb 16, 2016 · 5 comments

Comments

@jnivre
Copy link
Contributor

jnivre commented Feb 16, 2016

The use of the "mwe" relation differs a lot between the Scandinavian treebanks. Just frequency says a lot:

UD_Swedish has 16.6 "mwe" relations per 1000 words.
UD_Danish has 4.9 "mwe" relations per 1000 words.
UD_Norwegian has 0 "mwe" relations per 1000 words.

The zero frequency in Norwegian is certainly by design, but the difference between Danish and Swedish is too large to be compatible with a consistent treatment of (fixed) MWEs. I suspect that they are overused in Swedish (and perhaps underused in Danish).

If we can come up with a core set of expressions that should be treated as "mwe" across the three languages, the same principles can be used also for other languages.

@jnivre
Copy link
Contributor Author

jnivre commented Apr 11, 2016

Any hope to make progress on this before v1.3?

@liljao
Copy link
Contributor

liljao commented Apr 11, 2016

I think that is not realistic for the Norwegian data unfortunately seeing that it will require a large manual effort.

@jnivre
Copy link
Contributor Author

jnivre commented Apr 11, 2016

I agree. We should probably change the milestone then.

@hectormartinez
Copy link
Contributor

For Danish we only inserted mwe relations for words that:
a) had been underscored_together in the original Copenhagen Dependency
Treebank.
b) worked as function words

The only analysis we performed was to identify the form, lemma and UPOS of
the formants of the mwe so we could split them into syntactic-word tokens.

Here is a Dropbox link to the file we created for the conversion, maybe it
is a useful reference:

https://www.dropbox.com/s/e2pv7gr7i8mrc3j/danishCDT_2_UD_mwe-info.tsv?dl=0

2016-04-11 21:40 GMT+02:00 Joakim Nivre notifications@github.com:

I agree. We should probably change the milestone then.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#262 (comment)

@jnivre
Copy link
Contributor Author

jnivre commented Apr 13, 2016

@hectormartinez: Thanks! This is very useful. At some point, we should definitely try to harmonise this across the three languages (to begin with), but there is no way we can do this for 1.3. I will change the milestone until 1.4.

@jnivre jnivre modified the milestones: later, lg-specific v1.3 Apr 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants