-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filenames and other computery entities #666
Comments
If there are no spaces, I would keep ".doc" together with the main name in one token. Then it seems natural to treat the filename as one word with spaces, although personally I am not a big fan of words with spaces. The dot (adjacent to a letter on both sides) makes it recognizable as a validation exception; without extension, it would be tokenized and analyzed like movie/book titles. Or maybe we could do without words with spaces completely and only keep the last word together with ".doc" while the other words would be separate tokens. |
I could see a case for using |
@amir-zeldes If you know that a given filename does not include a space, but there is a typo in the text (e.g. "auto exec.bat" or "~/.bash rc") then you can use |
I think you definitely obtain a better file name by deleting spaces :) But I see your point! |
Another question: should these be |
I think PROPN makes sense. In EWT xpos could also be either NNP or ADD, by analogy to URLs (I guess they are all like URIs?) |
Another reason to be skeptical about Here's another example: # sent_id = email-enronsent32_01-0039
# text = - GPSA Guaranty.doc
1 - - PUNCT NFP _ 2 punct 2:punct _
2 GPSA gpsa NOUN GW _ 0 root 0:root _
3 Guaranty.doc guaranty.doc X NN Number=Sing 2 flat 2:flat _ The last part, "Guaranty.doc", has an odd combination of X and Number=Sing. Should it be treated as a NOUN? Should ".doc" be split off as a separate word and tagged as X? |
What about parsing it as a single token? There's precedent for tokens with spaces in French for example when they represent a single concept |
What are some of the French examples? I was only aware of this being done for numbers where the space separator is merely for readability. |
I haven't had to deal with them before, but I think my inclination would be to use |
IMHO, better than |
I'm wary of removing the And I don't see any particular problem with noting e.g. that "Releases" in the long filename I posted above is a plural noun attaching as Curious to hear @dan-zeman's opinion. |
What there was a requirement that |
Interesting idea...what would be the criterion for "and such"? :D I.e. what are the characteristics of expressions that this strategy should be used for, beyond filenames? |
I guess that could work for phone numbers too? |
Perhaps an inappropriate "and such" on my part, but I suppose that would cover any other tokens with spaces that aren't mistakes, though I have no examples ready to hand. |
So...named entities correctly including spaces but lacking regular internal syntax? I thought that's what flat was for—how to draw the boundary? |
No, that's not how I understood it - I thought the idea was to use it for things we consider to be single 'words', which I guess could be things that have a single lexical category. For example, I think phone numbers are just numbers, so they have the single category NUM, and if they happen to be spelled with internal spaces, we could use |
In general, how would we tell that though? If we're stepping away from the idea that wordhood, absent morphosyntactic cues, is defined by orthography, it seems like opening a can of worms....e.g. one could argue that a telephone number is made up of individual digits, each of which is in principle a word regardless of the spacing. Or one could argue that a foreign expression written with a space (et cetera) is actually a single word of English. In my interpretation, |
Hm, OK - I don't urgently need anything to happen here, but it sounded like this was already being done for numbers with spaces, so in as far as someone had a criterion for why they used spaces in tokens, I think it would be the same criterion applying to this suggestion. Concretely regarding filenames with spaces, they feel like the same sort of things as phone numbers with spaces to me. If a guideline is formulated which explicitly covers only phone numbers and files (or maybe URIs in general?), then I don't see the danger of a slippery slope. For me spaces in tokens are worse than almost any other solution! |
Yeah, maybe somebody can weigh in on what warranted that exception—I assume because it's routine in some orthographic styles to use spaces for thousands separators whereas we'd use commas, and numerals are so frequent in many genres that it would be cumbersome to break them up. But space separation for special numeric entities (like telephone numbers) does NOT warrant this exception. |
Every language can have some strange orthographic conventions for a couple of words. For instance, French has at least one: parce que 'because'. Nobody wants to have parce as a word, because it doesn't exist without que. But it would be costly and dangerous to relax the rule forbidding tokens with spaces just for this word and |
I find I also like the flexibility that if file name has spaces and is tokenized into multiple tokens, these may or may not get morphological analysis depending on what makes more sense in individual cases. |
Exactly. Spaces in numbers are regulated by the standardized spelling in Czech (as well as some other languages). Telephone numbers are not (and some people, like me, use hyphens instead of spaces in them). But at least telephone numbers are still "numbers" (plus punctuation), so I would not mind treating them the same way as normal numbers if the latter already can have spaces in the language. I would definitely not treat alphanumeric file names this way. And if the language does not have an exception for numbers, I would cluster telephone numbers with file names. |
I think the standard solution we already have for this is |
@dan-zeman But |
No, fixed is precisely for words with spaces (not for MWEs in general).
Skickat från Outlook för iOS<https://aka.ms/o0ukef>
…________________________________
Från: Sylvain Kahane ***@***.***>
Skickat: Wednesday, May 29, 2024 6:56:16 PM
Till: UniversalDependencies/docs ***@***.***>
Kopia: Subscribed ***@***.***>
Ämne: Re: [UniversalDependencies/docs] Filenames and other computery entities (#666)
@dan-zeman<https://github.com/dan-zeman> But fixed is for MWEs, no? parce que is a word, not a MWE. It is word written with a space. As I said parce is not a word of French, just a strange orthographic form.
—
Reply to this email directly, view it on GitHub<#666 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZ7ZVSRCI4HYPLNZRG6TCLZEYCDBAVCNFSM4JHLWCLKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJTG44DMOJZHA4A>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert.
CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.
När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
|
Right, I think of the breakdown as follows:
|
I would be in favour of keeping them as tokens with internal spaces. If not, I am not sure we really want to use |
The Core Group discussed this and decided on
(In retrospect, perhaps instead of flat/fixed/goeswith it would have been better to have one relation for multi-token words and another relation for headless multi-word expressions. Something to consider for a potential UDv3.) |
I think we are completely loosing the meaning of the UD syntactic relations, or at least I am completely lost. In the other way, By the way, I am also a bit confused by @dan-zeman, @jnivre, @nschneid answers about |
With regard to Tokenisation over spaces would be the opposite and complementary option of multiword tokens. I think it might be very useful to recognise that spaces are actually often used to separate things which are at an intermediate level between what we identify as syntactic words and phrases, but, like punctuation marks, cannot be an ultimate tokenisation criterion themselves. If this really has an impact on current parsers needs to be investigated, but from a machine point of view a space is just a character like any other. I do not think I put forward a too strong interpretation of By the way, |
I think the point that “flat” indicates a construction but “goeswith” does not is a good one. I hadn’t thought of that. On the other hand, the main use of “goeswith” also carries the implication that it is accidental and erroneous, which doesn’t apply to the filename case (presumably), so one would have to decide which is the most important criterion.
When it comes to “fixed”, I do maintain that it should be restricted to “words with spaces”, as stated in the documentation, but its application across languages and treebanks is currently quite inconsistent. This is not least true about the Swedish treebanks, as pointed out by my colleague Lars Ahrenberg in a paper at this year’s UD workshop. In addition, I think there may be different conceptions of what a “word with spaces” is. You mention the example “parce que” in French and the fact that “parce” is only used in that combination. This is clearly a good indication that it is a word with spaces, but I don’t think the occurrence of such an element is a necessary condition.
Let me give the example of expression referring to days in Swedish. The equivalent of “today” is “i dag” or “idag” (both orthographies are common and accepted as correct); the equivalent of “yesterday” is “i går” or “igår”. It so happens that “går” is like “parce”, that is, it only occurs in this combination (disregarding the homonymous verb form meaning “walk”), while “dag” is a regular noun meaning “day”. However, I would argue that both expressions are equally frozen in modern Swedish and should be analyze as “fixed” when written with a space.
|
I think @sylvainkahane is suggesting a primary distinction between multi-token words (words-with-spaces) and headless phrases (where individual elements might be omissible, modifiable, etc.). That sounds perfectly sensible to me, it's just not what UDv2 has given its narrow definitions of Some treebanks are using If there is a UDv3 I do think the goeswith/fixed/flat relations should be reconsidered, also because many people expect the term "fixed" to cover morphosyntactically fixed expressions in general, whereas it is only intended for a small list of grammatical ones.
The current list of English |
I would like to express my support for @nschneid's suggestion that
It is obvious from this discussion that so many long-time UD experts have different intuitions on how these relations should be used. And although the guidelines for fixed have been updated they are still not detailed enough. What is actually meant by 'the most grammaticalized cases'? In the paper @jnivre refers to, I try to identify (in Swedish) what I call rigid expressions, i.e. those showing no variation at all. But they are still too numerous to qualify as 'a closed class'. The comment by @sylvainkahane that he sees flat as a relation for headless constructions I find interesting. The problem is that UD currently only recognizes one such construction, ie names. Currently, fixed is used for many expressions that have an internal head, such as ADP + NOUN which we may call 'headed constructions' with the noun as the head even if it is non-determined. If UD keeps only one deprel for headless constructions, the distinction between names and fixed non-headed expressions (and typos) could instead be made with features, say in the MISC column. And with a feature for fixedness the headed fixed expressions could have both their syntax annotated (with deprels) and their status as fixed expressions represented. |
The email genre of English-EWT lists file attachments, e.g. "Constellation Power (GISB draft).doc".
compound
relations and anappos
relation for the parenthetical. I'm not sure how ".doc" should attach—flat
?The text was updated successfully, but these errors were encountered: