-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling quotation marks #42
Comments
Adding a WORDSEP affix class seems reasonable. Any of the usual white-space marks are reasonable separators, of course. To get fancy, the non-break space should be among them, as its a typesetting thing. |
This was discussed again in issue #632 -- I am copying parts of that discussion here. |
There are (at least) three types of distinct quote usage:
Both case 3 and case 4 can probably be handled the same way. Case 5 suggests that there should be a generic mechanism that treats a quoted phrase as if it were a noun. That is, the internal grammatical structure of the phrase should be ignored. The quotation marks form a wall (like right-wall, left-wall), preventing links from crossing over the wall. |
From issue #632:
I added it to my TODO list. |
Thank you. |
I have a very ill-defined, vague comment - something to try to think about and understand: is there some (elegant) way of reformulating tokenization (and related issues) into a collection of rules (that could be encoded in a file)? For example: capitalization: we have a rule (coded in C++) that if a word is at the beginning of a sentence, then we should search for a lower-case version of it. ... or if the word is after a semicolon, then we should search for a lower-case version of it. ... or if the word is after a quote, then we should search for a lower-case version of it. ( In the language learning code, I don't downcase any data in advance. Instead, the system eventually learns that certain words behave the same way, grammatically, whether they are upercased or not. The system is blind to uppercasing: it just sees two different UTF8 strings that happen to fall into the same grammatical class. To "solve" this problem, one can imagine three steps. First, a "morphological" analysis: given a certain grammatical class, compare pairs to strings to see if they have a common substring - for example, if the whole string matches, except for the first letter. This would imply that some words have a "morphology", where the first letter can be either one of two, while the rest of the word is the same. The second step is to realize that there is a meta-morphology-rule, which states that there are many words, all of which have the property that they can begin with either one of two different initial letters. The correct choice of the initial letter depends on whether the preceding token was a semicolon, a quote, or the left-wall. The third step is to realize that the meta-morphology-rule can be factored into approximately 26 different classes. That is, in principle, there are 52-squared/2=1352 possible sets containing two (initial) letters. Of these, only 26 are seen: {A, a}, {B, b}, {C, c} ....and one never ever sees {P, Q} or {Z, a}. As long was we write C code, and know in advance that we are dealing with capital letters, then we can use pre-defined POSIX locales for capitalization. I'm trying to take two or three steps backwards, here. One is to treat capitalization as a kind of morphology, just like any other kind of morphology. The second is to create morphology classes - the pseudo-morpheme The meta-meta-meta issue is that I want to expand the framework beyond just language written as UTF8 strings, but more generally, language with associated intonation, affect, facial expressions, or "language" from other domains (biology, spatial relations, etc.) |
I created issue #690 to track this capitalization-as-pseudo-morphology idea. The meta issue is, again:
For the first pass, its just fine to write this as pure C code that "does the right thing". The meta-issue to identify these kinds of rules in the tokenizer algorithm, and capture them as "generic rules". |
I have an abandoned project that does just that...
I sent a very detailed proposal, and even had a prototype implementation. However, I somehow understood such an approach is too complex and is actually not needed, maybe from your response:
I then sent a very detailed answer (that you didn't address). I also started to investigate another project, of a "zero knowledge" tokenizer. The idea was that the dict will include the needed information for tokenization. Here is the relevant post (you didn't respond).
We can continue the said discussions if desired.
I'll continue in #690. |
Sorry. The proposal did not seem quite right, and figuring out how to do it correctly takes a lot of time and effort, and I ran out of energy. The generic "problem" still needs to be solved. I recall two problems that bugged me:
As I write this, it occurs to me that perhaps parsing and tokenization really truly are "adjoint functors". However, the term "adjoint functor" is a rather complex and abstract concept, and so I would have to think long and hard about how to demonstrate that adjointness directly, and how it could be used to guide the design of the tokenizer. |
I think a main misunderstanding between us was that I didn't intend at all to "prefix" and "suffix" according to their linguistic meanings, and I repeated mentioning that. But I understand that I failed to clarify this point. I just defined "prefix" and "suffix" in a way that is effective for tokenizing, disregarding (at all, as I pointed out) their linguistic role, since for tokenizing purpose this is not important - it only important to break words to morphemes in all the possible ways. Instead I could call these parts e.g ISSP and ESSP for "initial sentence string parts", and "ending sentence string parts", or anything else ("sentence string" and not "word" to further avoid a possible linguistic term clash). If you think it is clearer to use, for the purpose of tokenization discussion, other terms than specially defined "prefix" and "suffix", then I have no obligation for that.
I think that my definitions, that are especially tailored for the purpose of tokenization (and hence different from the linguistic terms) are not constraining tokenization, and words can still get broken to all the possibilities without missing any. But if any constraint is discovered, this definitions can be fine tuned as needed, because there is no need that they match any linguistic concept. But it occured to me (and I also posted on that in details) that even these terms are not needed, if we just mark the possible internal links somehow, because the tokenizer can then infer everything from the dict. And as usual, for many things that I said, including this, I also wrote (before I said...) a demo program to check them. |
Again, sorry for misunderstanding. A part of that involved the use of the equals sign. I think I understand the English and Russian examples above, I don't quite understand
I also don't understand how to write rules for Lithuanian, where the "prefixes" (which can occur in the middle of a word) are drawn from a "closed class" (there are maybe 20 of them; one can list them all, exhaustively), the "infixes" are another closed class (maybe five of them total), the "suffixes" are again closed (maybe 50 or 100 of them) but, again, can occur in the middle of a word. The stems are then open class (thousands) which cannot be exhaustively listed, and have to be drawn from the dictionary (and a word might have two stems inside of it). So the idea was that if its "closed class" viz: a small, finite number of them, completely well-known by all speakers, then its an affix. If its not closed class, then its "open class", because its impossible to create a complete list; most speakers do not know (will never know) all of them - they are like words you never heard of before, never use, don't recognize. It's only because the total number of closed-class affixes is small that it makes sense to list them in one file. It's a good thing that the total number is small, as otherwise, morphology would be very difficult, requiring the lookup of huge numbers - combinatoric explosion - of alternatives in the dictionary. In English, the closed-class words are pronouns (he she ...) determiners (this that..) prepositions (as in of next by) and all speakers know all of them, and new ones are never invented / created / coined even in slang (with exceptions: xyr xe ... - closed-class words are very difficult to invent and popularize.) The closed-class morphemes in Lithuanian are somewhat similar. Again, sorry for the mis-understandings I cause. I make mistakes, I'm short tempered and have a large variety of human failings :-) |
In the current code, quotation marks are removed from the sentence.
Actually, they are converted to whitespace. This means they serve as a word separator.
For example:
This"is a test"
is converted to:
This is a test
In addition, a word just after quotation mark is considered to be in a capitalizable position.
This is true even for closing quotation mark... including when it is a "right mark".
In my new tokenization code, quotes are tokenized. They are defined in RPUNC and LPUNC, thus they get strip off words from their LHS and RHS. However, this doesn't preserve its "separator" behavior that exit in the current code.
In English (and I guess some other languages, maybe many) this seems to be a desired behavior,
because if we see qwerty"yuiop" we may guess it is actually qwerty "yuiop".
However, to generally do this in Hebrew will be wrong, as double quote U+0022 is a de facto replacement for the Hebrew character "gershayim" that can be an integral part of Hebrew words (as
gershayim is not found on the Hebrew keyboard). It is also a de facto replacement for Hebrew quotation marks (form the same reason). So a general tokenization code cannot blindly use it as a word separator.
In order to solve this, I would like to introduce an affix-class WORDSEP, which will be a list of characters to be used as a word separator, when blank would be the default. Character listed there will still be able to be listed in other affix classes and thus serve as tokens.
Is this solution sensible?
Another option is just not to use it as a word separator, at least in the first version of "quotation mark as token" (this is what my current code does).
Regarding the capitalizable position after closing quote, meanwhile I will mostly preserve this behavior in the hard-coded capitalization handling, because we are going to try to implement capitalization using the dict.
The text was updated successfully, but these errors were encountered: