Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filenames and other computery entities #666

Open
nschneid opened this issue Oct 31, 2019 · 40 comments
Open

Filenames and other computery entities #666

nschneid opened this issue Oct 31, 2019 · 40 comments

Comments

@nschneid
Copy link
Contributor

The email genre of English-EWT lists file attachments, e.g. "Constellation Power (GISB draft).doc".

  1. Should filenames always be tokenized into discernible linguistic words ("ConstellationPower(GSB_draft).doc"), or only when there are spaces?
    • What about filesystem paths and URLs containing spaces?
    • Presumably we would never tokenize email addresses, hashtags, or variable names in code as these never contain spaces
  2. To what extent should annotators attempt to infer internal structure, like in titles of artistic works (Long titles of works of art #664)? E.g. the above could include two compound relations and an appos relation for the parenthetical. I'm not sure how ".doc" should attach—flat?
@dan-zeman
Copy link
Member

If there are no spaces, I would keep ".doc" together with the main name in one token.

Then it seems natural to treat the filename as one word with spaces, although personally I am not a big fan of words with spaces. The dot (adjacent to a letter on both sides) makes it recognizable as a validation exception; without extension, it would be tokenized and analyzed like movie/book titles.

Or maybe we could do without words with spaces completely and only keep the last word together with ".doc" while the other words would be separate tokens.

@dan-zeman dan-zeman added this to the v2.6 milestone Nov 9, 2019
@amir-zeldes
Copy link
Contributor

I could see a case for using goeswith here - if you believe that filenames are 'single words' then in some sense they should be spelled together, but there is a space here. So it's somewhat similar to a single word broken up into two tokens because of a space?

@martinpopel
Copy link
Member

@amir-zeldes If you know that a given filename does not include a space, but there is a typo in the text (e.g. "auto exec.bat" or "~/.bash rc") then you can use goeswith. However, nowadays there are many filenames containing spaces (e.g. "Constellation Power (GISB draft).doc" mentioned by @nschneid) and I think we should not use goeswith here. We should not break the rule that goeswith is reserved only for text that is not well edited and that by deleting the extra space you obtain a better edited text.

@amir-zeldes
Copy link
Contributor

I think you definitely obtain a better file name by deleting spaces :)

But I see your point!

@dan-zeman dan-zeman modified the milestones: v2.6, v2.7 May 14, 2020
@dan-zeman dan-zeman modified the milestones: v2.7, v2.8 Nov 14, 2020
@nschneid
Copy link
Contributor Author

nschneid commented Jan 1, 2021

Another question: should these be PROPN?

nschneid referenced this issue in UniversalDependencies/UD_English-EWT Jan 1, 2021
@amir-zeldes
Copy link
Contributor

I think PROPN makes sense. In EWT xpos could also be either NNP or ADD, by analogy to URLs (I guess they are all like URIs?)

@dan-zeman dan-zeman modified the milestones: v2.8, v2.9 Jun 17, 2021
@dan-zeman dan-zeman modified the milestones: v2.9, v2.11 Jun 13, 2022
@nschneid nschneid reopened this Nov 10, 2023
@dan-zeman dan-zeman modified the milestones: v2.11, later Nov 10, 2023
@nschneid
Copy link
Contributor Author

Another reason to be skeptical about goeswith is that filenames-with-spaces are compositional and we don't think of them as having a single lemma in the language. So I think flat is the better choice.

Here's another example:

# sent_id = email-enronsent32_01-0039
# text = - GPSA Guaranty.doc
1	-	-	PUNCT	NFP	_	2	punct	2:punct	_
2	GPSA	gpsa	NOUN	GW	_	0	root	0:root	_
3	Guaranty.doc	guaranty.doc	X	NN	Number=Sing	2	flat	2:flat	_

The last part, "Guaranty.doc", has an odd combination of X and Number=Sing. Should it be treated as a NOUN? Should ".doc" be split off as a separate word and tagged as X?

@AngledLuffa
Copy link

What about parsing it as a single token? There's precedent for tokens with spaces in French for example when they represent a single concept

@nschneid
Copy link
Contributor Author

What are some of the French examples? I was only aware of this being done for numbers where the space separator is merely for readability.

@amir-zeldes
Copy link
Contributor

I haven't had to deal with them before, but I think my inclination would be to use goeswith but without Typo=Yes. In other words, I think of them as single tokens that unfortunately happen to have spaces, so they need to be linked with goeswith. Normally this is the result of a typo (space in the middle of a word), but in this case I wouldn't say it's a 'mistake', so I would just refrain from using Typo. I'm aware the goeswith guidelines say it's for badly spelled text, but I would prefer to extend the documentation to include files with spaces, rather than have multiple 'true tokens' with tags and deprels in there.

@arademaker
Copy link
Contributor

IMHO, better than flat!

@nschneid
Copy link
Contributor Author

I'm wary of removing the Typo=Yes requirement that we established for goeswith as (1) it's a reversal of a guidelines amendment and (2) it would create confusion as to whether Typo=Yes is appropriate for the vast majority of goeswith units (if it can't be checked for, people will forget to provide it).

And I don't see any particular problem with noting e.g. that "Releases" in the long filename I posted above is a plural noun attaching as flat (I am guessing; seems more likely than VERB). Words that are hard to decide a tag for can simply be X in this context.

Curious to hear @dan-zeman's opinion.

@mr-martian
Copy link
Contributor

What there was a requirement that Typo= accompany goeswith but have filenames and such be marked with Typo=No?

@nschneid
Copy link
Contributor Author

Interesting idea...what would be the criterion for "and such"? :D I.e. what are the characteristics of expressions that this strategy should be used for, beyond filenames?

@amir-zeldes
Copy link
Contributor

I guess that could work for phone numbers too?

@mr-martian
Copy link
Contributor

Perhaps an inappropriate "and such" on my part, but I suppose that would cover any other tokens with spaces that aren't mistakes, though I have no examples ready to hand.

@nschneid
Copy link
Contributor Author

So...named entities correctly including spaces but lacking regular internal syntax? I thought that's what flat was for—how to draw the boundary?

@amir-zeldes
Copy link
Contributor

No, that's not how I understood it - I thought the idea was to use it for things we consider to be single 'words', which I guess could be things that have a single lexical category. For example, I think phone numbers are just numbers, so they have the single category NUM, and if they happen to be spelled with internal spaces, we could use goeswith to mean we think they are functioning as a single lexical item, but use Typo=No to indicate the spelling with space is expected/canonical.

@nschneid
Copy link
Contributor Author

I thought the idea was to use it [broadened goeswith] for things we consider to be single 'words'

In general, how would we tell that though? If we're stepping away from the idea that wordhood, absent morphosyntactic cues, is defined by orthography, it seems like opening a can of worms....e.g. one could argue that a telephone number is made up of individual digits, each of which is in principle a word regardless of the spacing. Or one could argue that a foreign expression written with a space (et cetera) is actually a single word of English.

In my interpretation, flat and X already give us the fudge factor we need to deal with real data. Introducing an entirely new kind of wordhood seems risky unless there is a clear test.

@amir-zeldes
Copy link
Contributor

Hm, OK - I don't urgently need anything to happen here, but it sounded like this was already being done for numbers with spaces, so in as far as someone had a criterion for why they used spaces in tokens, I think it would be the same criterion applying to this suggestion.

Concretely regarding filenames with spaces, they feel like the same sort of things as phone numbers with spaces to me. If a guideline is formulated which explicitly covers only phone numbers and files (or maybe URIs in general?), then I don't see the danger of a slippery slope. For me spaces in tokens are worse than almost any other solution!

@nschneid
Copy link
Contributor Author

nschneid commented May 28, 2024

numbers with spaces

Yeah, maybe somebody can weigh in on what warranted that exception—I assume because it's routine in some orthographic styles to use spaces for thousands separators whereas we'd use commas, and numerals are so frequent in many genres that it would be cumbersome to break them up.

But space separation for special numeric entities (like telephone numbers) does NOT warrant this exception.

@sylvainkahane
Copy link
Contributor

Every language can have some strange orthographic conventions for a couple of words. For instance, French has at least one: parce que 'because'. Nobody wants to have parce as a word, because it doesn't exist without que. But it would be costly and dangerous to relax the rule forbidding tokens with spaces just for this word and goeswith and Typo=No could be a better solution, I think.

@dan-zeman
Copy link
Member

dan-zeman commented May 29, 2024

Curious to hear @dan-zeman's opinion.

I find flat better than goeswith. Also, if flat is the policy, it will require just a small clarification somewhere, while if goeswith is the policy, it will be an amendment and we will have to carefully scan the guidelines for places that talk about goeswith and say it is used only for ill-edited text.

I also like the flexibility that if file name has spaces and is tokenized into multiple tokens, these may or may not get morphological analysis depending on what makes more sense in individual cases.

@dan-zeman
Copy link
Member

dan-zeman commented May 29, 2024

numbers with spaces

Yeah, maybe somebody can weigh in on what warranted that exception—I assume because it's routine in some orthographic styles to use spaces for thousands separators whereas we'd use commas, and numerals are so frequent in many genres that it would be cumbersome to break them up.

But space separation for special numeric entities (like telephone numbers) does NOT warrant this exception.

Exactly. Spaces in numbers are regulated by the standardized spelling in Czech (as well as some other languages). Telephone numbers are not (and some people, like me, use hyphens instead of spaces in them). But at least telephone numbers are still "numbers" (plus punctuation), so I would not mind treating them the same way as normal numbers if the latter already can have spaces in the language. I would definitely not treat alphanumeric file names this way. And if the language does not have an exception for numbers, I would cluster telephone numbers with file names.

@dan-zeman
Copy link
Member

Every language can have some strange orthographic conventions for a couple of words. For instance, French has at least one: parce que 'because'. Nobody wants to have parce as a word, because it doesn't exist without que. But it would be costly and dangerous to relax the rule forbidding tokens with spaces just for this word and goeswith and Typo=No could be a better solution, I think.

I think the standard solution we already have for this is fixed. No need for goeswith here.

@sylvainkahane
Copy link
Contributor

@dan-zeman But fixed is for MWEs, no? parce que is a word, not a MWE. It is word written with a space. As I said parce is not a word of French, just a strange orthographic form.

@jnivre
Copy link
Contributor

jnivre commented May 29, 2024 via email

@nschneid
Copy link
Contributor Author

Right, I think of the breakdown as follows:

  • goeswith is for incorrectly added spaces (typos)
  • fixed is for words-with-spaces (grammatical elements of a language that are conventionally spelled with a space for historical reasons)
  • flat is for other cases where no single syntactic head can be identified—typical examples are named entities that aren't structured by general syntactic relations, foreign/borrowed phrases, and repetitions or sound sequences.

@Stormur
Copy link
Contributor

Stormur commented May 30, 2024

I would be in favour of keeping them as tokens with internal spaces. If not, I am not sure we really want to use flat, since this would mean that we would always like to analyse all the elements of all such file names as if they were "actual words". This seems to me really difficult to me, as these strings are mostly placeholders which occasionally contain strings looking like well-formed phrases, but this is misleading. For this reason, as discussed under another issue, I would vie for SYM as their part of speech. In this context, fixed might be the better choice in the end, even if in my personal opinion it seems to tell something different than a token with spaces.

nschneid added a commit that referenced this issue May 31, 2024
@nschneid
Copy link
Contributor Author

The Core Group discussed this and decided on flat. I understand there is a concern about treating a filename as having multiple words that are in some sense linguistically independent units, but I think that's too strong of an interpretation of flat. Like fixed for grammatical expressions and goeswith for misspellings, flat can apply in some cases where the morphosyntactic notion of word contains multiple tokens per the tokenization. And tokenizing on (at minimum) spaces is a very strong convention for languages where the primary function of spaces is to show a word boundary.

X is available for the UPOS of tokens regarded as something smaller than a syntactic word (or not an "actual word", in line with @Stormur's concern). At the discretion of treebanks, a filename might be analyzed as containing some recognizable words with substantive UPOS/feats, or they might all be labeled X. The syntactic category of the whole filename can be signaled with ExtPos=PROPN.

(In retrospect, perhaps instead of flat/fixed/goeswith it would have been better to have one relation for multi-token words and another relation for headless multi-word expressions. Something to consider for a potential UDv3.)

@sylvainkahane
Copy link
Contributor

I think we are completely loosing the meaning of the UD syntactic relations, or at least I am completely lost. flat is used for headless constructions, such as the "first name - second name" construction. They are particular constructions in the sense of CxG for instance. It is true that flat:foreign is also use foreign expressions, and in this case does not really refer to a headless construction, but ok. For the cases we are discussing here, I don't think they are headless constructions in any acceptable sense.

In the other way, goeswith means 'goes with', that is two tokens that should be together. It can be because of a misspelling or, as proposed, because of a strange orthographic convention. Contrary to flat, goeswith clearly indicates that there is no construction in this case. I think we should clearly separate dependency labels referring to syntactic constructions from non-linguistic dependency labels.

By the way, I am also a bit confused by @dan-zeman, @jnivre, @nschneid answers about fixed, saying that "fixed is for words-with-spaces". Do you really consider that whether or not, according to, all but, etc. are words of English? (see https://universal.grew.fr/?custom=66598aedccf83).

@Stormur
Copy link
Contributor

Stormur commented May 31, 2024

With regard to fixed, there clearly is a problem in how it is used more than in how it is defined.

Tokenisation over spaces would be the opposite and complementary option of multiword tokens. I think it might be very useful to recognise that spaces are actually often used to separate things which are at an intermediate level between what we identify as syntactic words and phrases, but, like punctuation marks, cannot be an ultimate tokenisation criterion themselves. If this really has an impact on current parsers needs to be investigated, but from a machine point of view a space is just a character like any other.

I do not think I put forward a too strong interpretation of flat: it is defined to be used for "flat" phrases, so it entails a linguistic interpretation. A filename has none such interpretation, as neither does an email address, a phone number, any number expressed by means of symbols... so I think it should be avoided, because a file name, i.e. a single block of alphanumeric + other characters, is really different from a personal name with many components, which all by themselves are morphosyntactically analysable words.

By the way, flat is dangerously close to conj up to the point one wonders where the difference is, but this is another story...

@jnivre
Copy link
Contributor

jnivre commented May 31, 2024 via email

@nschneid
Copy link
Contributor Author

I think @sylvainkahane is suggesting a primary distinction between multi-token words (words-with-spaces) and headless phrases (where individual elements might be omissible, modifiable, etc.). That sounds perfectly sensible to me, it's just not what UDv2 has given its narrow definitions of goeswith and fixed, and its broad definition of flat.

Some treebanks are using flat:foreign as a way to acknowledge that foreign expressions are a bit different in this regard from the flat expressions that are headless phrases. What about another subtype that would apply to the telephone numbers and filenames, e.g. flat:mtw for "multi-token word"?

If there is a UDv3 I do think the goeswith/fixed/flat relations should be reconsidered, also because many people expect the term "fixed" to cover morphosyntactically fixed expressions in general, whereas it is only intended for a small list of grammatical ones.

By the way, I am also a bit confused by @dan-zeman, @jnivre, @nschneid answers about fixed, saying that "fixed is for words-with-spaces". Do you really consider that whether or not, according to, all but, etc. are words of English? (see https://universal.grew.fr/?custom=66598aedccf83).

The current list of English fixed expressions is documented here. It is largely inherited from the Stanford Dependencies annotation of EWT, and there are definitely debatable cases in this list, as well as others that maybe should be added to the list (UniversalDependencies/UD_English-EWT#400). I'm happy to discuss those separately, but for purposes of the present discussion, we should go by the universal definition at https://universaldependencies.org/u/dep/fixed.html.

@LarsAhrenberg
Copy link
Contributor

I would like to express my support for @nschneid's suggestion that

If there is a UDv3 I do think the goeswith/fixed/flat relations should be reconsidered,

It is obvious from this discussion that so many long-time UD experts have different intuitions on how these relations should be used. And although the guidelines for fixed have been updated they are still not detailed enough. What is actually meant by 'the most grammaticalized cases'? In the paper @jnivre refers to, I try to identify (in Swedish) what I call rigid expressions, i.e. those showing no variation at all. But they are still too numerous to qualify as 'a closed class'.

The comment by @sylvainkahane that he sees flat as a relation for headless constructions I find interesting. The problem is that UD currently only recognizes one such construction, ie names. Currently, fixed is used for many expressions that have an internal head, such as ADP + NOUN which we may call 'headed constructions' with the noun as the head even if it is non-determined. If UD keeps only one deprel for headless constructions, the distinction between names and fixed non-headed expressions (and typos) could instead be made with features, say in the MISC column. And with a feature for fixedness the headed fixed expressions could have both their syntax annotated (with deprels) and their status as fixed expressions represented.

nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue May 31, 2024
nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests