Filenames and other computery entities #666

nschneid · 2019-10-31T15:09:11Z

The email genre of English-EWT lists file attachments, e.g. "Constellation Power (GISB draft).doc".

Should filenames always be tokenized into discernible linguistic words ("ConstellationPower(GSB_draft).doc"), or only when there are spaces?
- What about filesystem paths and URLs containing spaces?
- Presumably we would never tokenize email addresses, hashtags, or variable names in code as these never contain spaces
To what extent should annotators attempt to infer internal structure, like in titles of artistic works (Long titles of works of art #664)? E.g. the above could include two compound relations and an appos relation for the parenthetical. I'm not sure how ".doc" should attach—flat?

The text was updated successfully, but these errors were encountered:

dan-zeman · 2019-11-09T04:32:29Z

If there are no spaces, I would keep ".doc" together with the main name in one token.

Then it seems natural to treat the filename as one word with spaces, although personally I am not a big fan of words with spaces. The dot (adjacent to a letter on both sides) makes it recognizable as a validation exception; without extension, it would be tokenized and analyzed like movie/book titles.

Or maybe we could do without words with spaces completely and only keep the last word together with ".doc" while the other words would be separate tokens.

amir-zeldes · 2019-11-11T17:58:38Z

I could see a case for using goeswith here - if you believe that filenames are 'single words' then in some sense they should be spelled together, but there is a space here. So it's somewhat similar to a single word broken up into two tokens because of a space?

martinpopel · 2019-11-12T07:27:56Z

@amir-zeldes If you know that a given filename does not include a space, but there is a typo in the text (e.g. "auto exec.bat" or "~/.bash rc") then you can use goeswith. However, nowadays there are many filenames containing spaces (e.g. "Constellation Power (GISB draft).doc" mentioned by @nschneid) and I think we should not use goeswith here. We should not break the rule that goeswith is reserved only for text that is not well edited and that by deleting the extra space you obtain a better edited text.

amir-zeldes · 2019-11-12T14:46:22Z

I think you definitely obtain a better file name by deleting spaces :)

But I see your point!

nschneid · 2021-01-01T16:41:30Z

Another question: should these be PROPN?

amir-zeldes · 2021-01-03T16:54:31Z

I think PROPN makes sense. In EWT xpos could also be either NNP or ADD, by analogy to URLs (I guess they are all like URIs?)

nschneid · 2024-05-25T13:47:54Z

Another reason to be skeptical about goeswith is that filenames-with-spaces are compositional and we don't think of them as having a single lemma in the language. So I think flat is the better choice.

Here's another example:

# sent_id = email-enronsent32_01-0039
# text = - GPSA Guaranty.doc
1	-	-	PUNCT	NFP	_	2	punct	2:punct	_
2	GPSA	gpsa	NOUN	GW	_	0	root	0:root	_
3	Guaranty.doc	guaranty.doc	X	NN	Number=Sing	2	flat	2:flat	_

The last part, "Guaranty.doc", has an odd combination of X and Number=Sing. Should it be treated as a NOUN? Should ".doc" be split off as a separate word and tagged as X?

AngledLuffa · 2024-05-27T03:26:35Z

What about parsing it as a single token? There's precedent for tokens with spaces in French for example when they represent a single concept

nschneid · 2024-05-27T11:21:13Z

What are some of the French examples? I was only aware of this being done for numbers where the space separator is merely for readability.

amir-zeldes · 2024-05-28T17:37:40Z

I haven't had to deal with them before, but I think my inclination would be to use goeswith but without Typo=Yes. In other words, I think of them as single tokens that unfortunately happen to have spaces, so they need to be linked with goeswith. Normally this is the result of a typo (space in the middle of a word), but in this case I wouldn't say it's a 'mistake', so I would just refrain from using Typo. I'm aware the goeswith guidelines say it's for badly spelled text, but I would prefer to extend the documentation to include files with spaces, rather than have multiple 'true tokens' with tags and deprels in there.

arademaker · 2024-05-28T17:44:24Z

IMHO, better than flat!

nschneid · 2024-05-28T17:45:38Z

I'm wary of removing the Typo=Yes requirement that we established for goeswith as (1) it's a reversal of a guidelines amendment and (2) it would create confusion as to whether Typo=Yes is appropriate for the vast majority of goeswith units (if it can't be checked for, people will forget to provide it).

And I don't see any particular problem with noting e.g. that "Releases" in the long filename I posted above is a plural noun attaching as flat (I am guessing; seems more likely than VERB). Words that are hard to decide a tag for can simply be X in this context.

Curious to hear @dan-zeman's opinion.

mr-martian · 2024-05-28T17:54:35Z

What there was a requirement that Typo= accompany goeswith but have filenames and such be marked with Typo=No?

nschneid · 2024-05-28T17:58:35Z

Interesting idea...what would be the criterion for "and such"? :D I.e. what are the characteristics of expressions that this strategy should be used for, beyond filenames?

amir-zeldes · 2024-05-28T18:00:19Z

I guess that could work for phone numbers too?

mr-martian · 2024-05-28T18:00:19Z

Perhaps an inappropriate "and such" on my part, but I suppose that would cover any other tokens with spaces that aren't mistakes, though I have no examples ready to hand.

nschneid · 2024-05-28T18:06:06Z

So...named entities correctly including spaces but lacking regular internal syntax? I thought that's what flat was for—how to draw the boundary?

amir-zeldes · 2024-05-28T18:12:51Z

No, that's not how I understood it - I thought the idea was to use it for things we consider to be single 'words', which I guess could be things that have a single lexical category. For example, I think phone numbers are just numbers, so they have the single category NUM, and if they happen to be spelled with internal spaces, we could use goeswith to mean we think they are functioning as a single lexical item, but use Typo=No to indicate the spelling with space is expected/canonical.

nschneid · 2024-05-28T18:24:56Z

I thought the idea was to use it [broadened goeswith] for things we consider to be single 'words'

In general, how would we tell that though? If we're stepping away from the idea that wordhood, absent morphosyntactic cues, is defined by orthography, it seems like opening a can of worms....e.g. one could argue that a telephone number is made up of individual digits, each of which is in principle a word regardless of the spacing. Or one could argue that a foreign expression written with a space (et cetera) is actually a single word of English.

In my interpretation, flat and X already give us the fudge factor we need to deal with real data. Introducing an entirely new kind of wordhood seems risky unless there is a clear test.

amir-zeldes · 2024-05-28T18:34:11Z

Hm, OK - I don't urgently need anything to happen here, but it sounded like this was already being done for numbers with spaces, so in as far as someone had a criterion for why they used spaces in tokens, I think it would be the same criterion applying to this suggestion.

Concretely regarding filenames with spaces, they feel like the same sort of things as phone numbers with spaces to me. If a guideline is formulated which explicitly covers only phone numbers and files (or maybe URIs in general?), then I don't see the danger of a slippery slope. For me spaces in tokens are worse than almost any other solution!

nschneid · 2024-05-28T18:43:55Z

numbers with spaces

Yeah, maybe somebody can weigh in on what warranted that exception—I assume because it's routine in some orthographic styles to use spaces for thousands separators whereas we'd use commas, and numerals are so frequent in many genres that it would be cumbersome to break them up.

But space separation for special numeric entities (like telephone numbers) does NOT warrant this exception.

sylvainkahane · 2024-05-29T13:43:42Z

Every language can have some strange orthographic conventions for a couple of words. For instance, French has at least one: parce que 'because'. Nobody wants to have parce as a word, because it doesn't exist without que. But it would be costly and dangerous to relax the rule forbidding tokens with spaces just for this word and goeswith and Typo=No could be a better solution, I think.

dan-zeman · 2024-05-29T15:52:00Z

Curious to hear @dan-zeman's opinion.

I find flat better than goeswith. Also, if flat is the policy, it will require just a small clarification somewhere, while if goeswith is the policy, it will be an amendment and we will have to carefully scan the guidelines for places that talk about goeswith and say it is used only for ill-edited text.

I also like the flexibility that if file name has spaces and is tokenized into multiple tokens, these may or may not get morphological analysis depending on what makes more sense in individual cases.

dan-zeman · 2024-05-29T15:59:39Z

numbers with spaces

Yeah, maybe somebody can weigh in on what warranted that exception—I assume because it's routine in some orthographic styles to use spaces for thousands separators whereas we'd use commas, and numerals are so frequent in many genres that it would be cumbersome to break them up.

But space separation for special numeric entities (like telephone numbers) does NOT warrant this exception.

Exactly. Spaces in numbers are regulated by the standardized spelling in Czech (as well as some other languages). Telephone numbers are not (and some people, like me, use hyphens instead of spaces in them). But at least telephone numbers are still "numbers" (plus punctuation), so I would not mind treating them the same way as normal numbers if the latter already can have spaces in the language. I would definitely not treat alphanumeric file names this way. And if the language does not have an exception for numbers, I would cluster telephone numbers with file names.

dan-zeman · 2024-05-29T16:00:51Z

Every language can have some strange orthographic conventions for a couple of words. For instance, French has at least one: parce que 'because'. Nobody wants to have parce as a word, because it doesn't exist without que. But it would be costly and dangerous to relax the rule forbidding tokens with spaces just for this word and goeswith and Typo=No could be a better solution, I think.

I think the standard solution we already have for this is fixed. No need for goeswith here.

sylvainkahane · 2024-05-29T16:55:45Z

@dan-zeman But fixed is for MWEs, no? parce que is a word, not a MWE. It is word written with a space. As I said parce is not a word of French, just a strange orthographic form.

jnivre · 2024-05-29T17:08:10Z

No, fixed is precisely for words with spaces (not for MWEs in general). Skickat från Outlook för iOS<https://aka.ms/o0ukef>

…

________________________________ Från: Sylvain Kahane ***@***.***> Skickat: Wednesday, May 29, 2024 6:56:16 PM Till: UniversalDependencies/docs ***@***.***> Kopia: Subscribed ***@***.***> Ämne: Re: [UniversalDependencies/docs] Filenames and other computery entities (#666) @dan-zeman<https://github.com/dan-zeman> But fixed is for MWEs, no? parce que is a word, not a MWE. It is word written with a space. As I said parce is not a word of French, just a strange orthographic form. — Reply to this email directly, view it on GitHub<#666 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZ7ZVSRCI4HYPLNZRG6TCLZEYCDBAVCNFSM4JHLWCLKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJTG44DMOJZHA4A>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe. När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/ E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

nschneid · 2024-05-29T17:25:37Z

Right, I think of the breakdown as follows:

goeswith is for incorrectly added spaces (typos)
fixed is for words-with-spaces (grammatical elements of a language that are conventionally spelled with a space for historical reasons)
flat is for other cases where no single syntactic head can be identified—typical examples are named entities that aren't structured by general syntactic relations, foreign/borrowed phrases, and repetitions or sound sequences.

Stormur · 2024-05-30T16:47:47Z

I would be in favour of keeping them as tokens with internal spaces. If not, I am not sure we really want to use flat, since this would mean that we would always like to analyse all the elements of all such file names as if they were "actual words". This seems to me really difficult to me, as these strings are mostly placeholders which occasionally contain strings looking like well-formed phrases, but this is misleading. For this reason, as discussed under another issue, I would vie for SYM as their part of speech. In this context, fixed might be the better choice in the end, even if in my personal opinion it seems to tell something different than a token with spaces.

nschneid · 2024-05-31T01:05:58Z

The Core Group discussed this and decided on flat. I understand there is a concern about treating a filename as having multiple words that are in some sense linguistically independent units, but I think that's too strong of an interpretation of flat. Like fixed for grammatical expressions and goeswith for misspellings, flat can apply in some cases where the morphosyntactic notion of word contains multiple tokens per the tokenization. And tokenizing on (at minimum) spaces is a very strong convention for languages where the primary function of spaces is to show a word boundary.

X is available for the UPOS of tokens regarded as something smaller than a syntactic word (or not an "actual word", in line with @Stormur's concern). At the discretion of treebanks, a filename might be analyzed as containing some recognizable words with substantive UPOS/feats, or they might all be labeled X. The syntactic category of the whole filename can be signaled with ExtPos=PROPN.

(In retrospect, perhaps instead of flat/fixed/goeswith it would have been better to have one relation for multi-token words and another relation for headless multi-word expressions. Something to consider for a potential UDv3.)

sylvainkahane · 2024-05-31T08:54:55Z

I think we are completely loosing the meaning of the UD syntactic relations, or at least I am completely lost. flat is used for headless constructions, such as the "first name - second name" construction. They are particular constructions in the sense of CxG for instance. It is true that flat:foreign is also use foreign expressions, and in this case does not really refer to a headless construction, but ok. For the cases we are discussing here, I don't think they are headless constructions in any acceptable sense.

In the other way, goeswith means 'goes with', that is two tokens that should be together. It can be because of a misspelling or, as proposed, because of a strange orthographic convention. Contrary to flat, goeswith clearly indicates that there is no construction in this case. I think we should clearly separate dependency labels referring to syntactic constructions from non-linguistic dependency labels.

By the way, I am also a bit confused by @dan-zeman, @jnivre, @nschneid answers about fixed, saying that "fixed is for words-with-spaces". Do you really consider that whether or not, according to, all but, etc. are words of English? (see https://universal.grew.fr/?custom=66598aedccf83).

Stormur · 2024-05-31T09:48:17Z

With regard to fixed, there clearly is a problem in how it is used more than in how it is defined.

Tokenisation over spaces would be the opposite and complementary option of multiword tokens. I think it might be very useful to recognise that spaces are actually often used to separate things which are at an intermediate level between what we identify as syntactic words and phrases, but, like punctuation marks, cannot be an ultimate tokenisation criterion themselves. If this really has an impact on current parsers needs to be investigated, but from a machine point of view a space is just a character like any other.

I do not think I put forward a too strong interpretation of flat: it is defined to be used for "flat" phrases, so it entails a linguistic interpretation. A filename has none such interpretation, as neither does an email address, a phone number, any number expressed by means of symbols... so I think it should be avoided, because a file name, i.e. a single block of alphanumeric + other characters, is really different from a personal name with many components, which all by themselves are morphosyntactically analysable words.

By the way, flat is dangerously close to conj up to the point one wonders where the difference is, but this is another story...

jnivre · 2024-05-31T10:15:55Z

I think the point that “flat” indicates a construction but “goeswith” does not is a good one. I hadn’t thought of that. On the other hand, the main use of “goeswith” also carries the implication that it is accidental and erroneous, which doesn’t apply to the filename case (presumably), so one would have to decide which is the most important criterion. When it comes to “fixed”, I do maintain that it should be restricted to “words with spaces”, as stated in the documentation, but its application across languages and treebanks is currently quite inconsistent. This is not least true about the Swedish treebanks, as pointed out by my colleague Lars Ahrenberg in a paper at this year’s UD workshop. In addition, I think there may be different conceptions of what a “word with spaces” is. You mention the example “parce que” in French and the fact that “parce” is only used in that combination. This is clearly a good indication that it is a word with spaces, but I don’t think the occurrence of such an element is a necessary condition. Let me give the example of expression referring to days in Swedish. The equivalent of “today” is “i dag” or “idag” (both orthographies are common and accepted as correct); the equivalent of “yesterday” is “i går” or “igår”. It so happens that “går” is like “parce”, that is, it only occurs in this combination (disregarding the homonymous verb form meaning “walk”), while “dag” is a regular noun meaning “day”. However, I would argue that both expressions are equally frozen in modern Swedish and should be analyze as “fixed” when written with a space.

nschneid · 2024-05-31T11:35:41Z

I think @sylvainkahane is suggesting a primary distinction between multi-token words (words-with-spaces) and headless phrases (where individual elements might be omissible, modifiable, etc.). That sounds perfectly sensible to me, it's just not what UDv2 has given its narrow definitions of goeswith and fixed, and its broad definition of flat.

Some treebanks are using flat:foreign as a way to acknowledge that foreign expressions are a bit different in this regard from the flat expressions that are headless phrases. What about another subtype that would apply to the telephone numbers and filenames, e.g. flat:mtw for "multi-token word"?

If there is a UDv3 I do think the goeswith/fixed/flat relations should be reconsidered, also because many people expect the term "fixed" to cover morphosyntactically fixed expressions in general, whereas it is only intended for a small list of grammatical ones.

By the way, I am also a bit confused by @dan-zeman, @jnivre, @nschneid answers about fixed, saying that "fixed is for words-with-spaces". Do you really consider that whether or not, according to, all but, etc. are words of English? (see https://universal.grew.fr/?custom=66598aedccf83).

The current list of English fixed expressions is documented here. It is largely inherited from the Stanford Dependencies annotation of EWT, and there are definitely debatable cases in this list, as well as others that maybe should be added to the list (UniversalDependencies/UD_English-EWT#400). I'm happy to discuss those separately, but for purposes of the present discussion, we should go by the universal definition at https://universaldependencies.org/u/dep/fixed.html.

LarsAhrenberg · 2024-05-31T14:15:47Z

I would like to express my support for @nschneid's suggestion that

If there is a UDv3 I do think the goeswith/fixed/flat relations should be reconsidered,

It is obvious from this discussion that so many long-time UD experts have different intuitions on how these relations should be used. And although the guidelines for fixed have been updated they are still not detailed enough. What is actually meant by 'the most grammaticalized cases'? In the paper @jnivre refers to, I try to identify (in Swedish) what I call rigid expressions, i.e. those showing no variation at all. But they are still too numerous to qualify as 'a closed class'.

The comment by @sylvainkahane that he sees flat as a relation for headless constructions I find interesting. The problem is that UD currently only recognizes one such construction, ie names. Currently, fixed is used for many expressions that have an internal head, such as ADP + NOUN which we may call 'headed constructions' with the noun as the head even if it is non-determined. If UD keeps only one deprel for headless constructions, the distinction between names and fixed non-headed expressions (and typos) could instead be made with features, say in the MISC column. And with a feature for fixedness the headed fixed expressions could have both their syntax annotated (with deprels) and their status as fixed expressions represented.

…s#666)

…ersalDependencies/docs#666)

nschneid added tokenization dependencies labels Oct 31, 2019

dan-zeman added this to the v2.6 milestone Nov 9, 2019

dan-zeman added the standard needed label May 14, 2020

dan-zeman modified the milestones: v2.6, v2.7 May 14, 2020

dan-zeman modified the milestones: v2.7, v2.8 Nov 14, 2020

nschneid referenced this issue in UniversalDependencies/UD_English-EWT Jan 1, 2021

fixes to POS annotation of goeswith

23cf42d

dan-zeman modified the milestones: v2.8, v2.9 Jun 17, 2021

dan-zeman modified the milestones: v2.9, v2.11 Jun 13, 2022

dan-zeman closed this as completed May 31, 2023

nschneid mentioned this issue Nov 10, 2023

Lemmas for typos UniversalDependencies/UD_English-EWT#471

Closed

nschneid reopened this Nov 10, 2023

dan-zeman modified the milestones: v2.11, later Nov 10, 2023

nschneid added a commit that referenced this issue May 31, 2024

flat: filenames (#666)

fe1538e

nschneid mentioned this issue May 31, 2024

UPOS "X" UniversalDependencies/UD_English-EWT#440

Closed

5 tasks

nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue May 31, 2024

FlatType=Phone and FlatType=Filename (#440, UniversalDependencies/doc…

628e146

…s#666)

nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue May 31, 2024

add ExtPos=PROPN for FlatType=Phone and FlatType=Filename (#440, Univ…

3a50b36

…ersalDependencies/docs#666)

Filenames and other computery entities #666

Filenames and other computery entities #666

Comments

nschneid commented Oct 31, 2019

dan-zeman commented Nov 9, 2019

amir-zeldes commented Nov 11, 2019

martinpopel commented Nov 12, 2019

amir-zeldes commented Nov 12, 2019

nschneid commented Jan 1, 2021

amir-zeldes commented Jan 3, 2021

nschneid commented May 25, 2024

AngledLuffa commented May 27, 2024

nschneid commented May 27, 2024

amir-zeldes commented May 28, 2024

arademaker commented May 28, 2024

nschneid commented May 28, 2024

mr-martian commented May 28, 2024

nschneid commented May 28, 2024

amir-zeldes commented May 28, 2024

mr-martian commented May 28, 2024

nschneid commented May 28, 2024

amir-zeldes commented May 28, 2024

nschneid commented May 28, 2024

amir-zeldes commented May 28, 2024

nschneid commented May 28, 2024 • edited Loading

sylvainkahane commented May 29, 2024

dan-zeman commented May 29, 2024 • edited Loading

dan-zeman commented May 29, 2024 • edited Loading

dan-zeman commented May 29, 2024

sylvainkahane commented May 29, 2024

jnivre commented May 29, 2024 via email

nschneid commented May 29, 2024

Stormur commented May 30, 2024

nschneid commented May 31, 2024

sylvainkahane commented May 31, 2024

Stormur commented May 31, 2024 • edited Loading

jnivre commented May 31, 2024 via email • edited by dan-zeman Loading

nschneid commented May 31, 2024

LarsAhrenberg commented May 31, 2024

nschneid commented May 28, 2024 •

edited

Loading

dan-zeman commented May 29, 2024 •

edited

Loading

dan-zeman commented May 29, 2024 •

edited

Loading

Stormur commented May 31, 2024 •

edited

Loading

jnivre commented May 31, 2024 via email •

edited by dan-zeman

Loading