-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero (phantom, unvoiced) word support. #224
Comments
The "(You) go home" variant is particularly important, as it provides a subject for directives/imperatives. |
One way to implement this would be with certain special link types. For example:
The
that is, the
so it looks as if
The second point being the more serious: one reason to entertain the idea of phantom words is the hope that the grammar would be simplified. However, in this proposal, the dicts would need to contain rules for both Wi as well as Wd, Sp, WV and so the post-parsing conversion does not simplify the grammar. The post-parsing stage could be carried out using some graph re-write rules, e.g. in relex or with the opencog pattern matcher. Since this happens post-parsing, there is no particular point of putting it "inside" of LG itself. |
One possible solution is to perform a one-point compactification. The dictionary contains the phantom words, and thier connectors. Ordinary disjuncts can link to these, but should do so using a special initial lower-case letter (say, 'z', in addition to 'h' and 't' as is currently implemented). The parser, as it works, examines the initial letter of each connector: if it is 'z', then the usual pruning rules no longer apply, and one or more phantom words are selected out of the bucket of phantom words. (This bucket is kept out-of-line, it is not yet placed into sentence word sequence order, which is why the usual pruning rules get modified.) Otherwise, parsing continues as normal. At the end of parsing, if there are any phantom words that are linked, then all of the connectors on the disjunct must be satisfied (of course!) else the linkage is invalid. After parsing, the phantom words can be inserted into the sentence, with the location deduced from link lengths. |
Issue #1234 describes a special case. |
@linas, |
Implementing a concept of "optional alternatives" can maybe help to handle phantom words. This idea can also solve issue #224 "spell-guessing mis-handles capitalized words". One obstacle in this idea is the case of more than one location of an "optional alternative" in a sentence when the lower-cost alternatives in different locations are mutually exclusive. |
The phantom word addition may be redundant, i.e. that the sentence will get parsed w/o it. Of course, this cannot be found in advance (before parsing).
As an example, I tried to inspect this algo for the sentence |
There seem to be several plausible solutions (including inserting words during tokenization). I don't know which is best or easiest. Let's explore each in greater detail. So, this #224 (comment) is rather cryptic. Let me try to reconstruct what I was thinking. Given the sentence
For now, I don't care about the specific format: besides The existing
During dictionary read, a special class of zero-words is created: these can be spotted in one of two ways: (1) they are surrounded by square brackets e.g. Counting is done as usual, with a slight modification: if a disjunct has a connector starting with
During linkage generation, look at the chosen disjuncts, and see if any of them contain That is, while generating the above linkage, it will be discovered that Well, almost done. Some open questions:
In the above, |
The above comment points out that we have an API issue. The current API is driven by word indexes: all words have a fixed location in a sentence, which means that location-independent linkages are not currently possible. Here is a futuristic example of a desirable linkage diagram:
Here, I would like to do things like the above, but the current infrastructure isn't suitable for that. We can kind-of do multiple sentences, today, but its ad hoc:
but there is no way to draw links to references:
where the
I have deep misgivings about 3 and 4 since it seems one must sacrifice huge amounts of representational power, in exchange for a grab-bag of mostly silly tools. However, there are vast numbers of programmers (and corporate executives) who love these frameworks, and view them as the answer to all problems. The problem with 2. is that it is too abstract: it's not programmer-friendly, its missing graph visualization tools (despite over a decade of failed efforts), there aren't any API utilities, e.g. no easy way to jam it into a computer-game non-player character chatbot. There are companies that make good money creating fancy chatbots for computer games... its frustrating that we can't demo that. The problem with 1. is that... is there a problem? Well, it's not 2,3 or 4. I guess that's a problem. But I know that you personally would have a lot more fun working on 1. than on 2,3,4 and so .. that's a good thing. Lets push the boundary. See how far we can go. |
Edited to complete it after a premature posting. I need several more clarifications...
If seems
The "islands" state should be the same for counting.
But it is not so, the parsing with islands allowed is actually:
The word
How many words would need I said above:
It seems phantom words will get inserted in many places even though they are not needed.
Moreover, they may get inserted in places that make invalid sentences parsable. For example:
BTW, I noted that for the sentence |
I pressed ENTER by mistake out of the comment box and the default button "comment" got triggered... |
I have completed the said post... I have more things to ask but they may be redundant (or get changed) depending on the clarifications.
It should be extended anyway so this seems a good direction.
It may be an addition, not a replacement.
I guess it would be a good idea to learn the API of NLTK and mimic it when applicable. |
Why keep them? why not discard them? Just right now, I see no reason to keep them, but everything is a bit murky...
Turn on "island" only if pruning left behind disjuncts with
I think you did it wrong. Try again with this dict:
The first diff emulates
Note that all of the |
This will happen only if the dictionary contains a transitive phantom:
without the second line, it won't parse. This is a generic dictionary maintenance headache: poorly structured disjuncts allow crazy parses; it is challenging to set them up so that they work well. You can try it:
|
Given that it seems to be immensely popular, I suppose so. I am spread far too thin, working on too many projects already, so this is not something I could undertake. But, sure, looking how other people do things, and then stealing the best ideas is usually a good thing. |
Since non-
Note that the current pruning looks at "Islands" for possible optimization (skipping parsing altogether in case there are more nulls than requested). This optimization can make a difference only when parsing with a null_count>0. So if any disjunct contains a |
What is the purpose parsing with islands? Is this for finding the exact location of the island words?
|
It's a flag that dates back to the original code. It says, basically "I can't parse the whole thing, but here are a bunch of phrases I understand, I just can't join them together." It is an alternative to saying "I can't parse the whole thing, but if I ignore these words, then I can". For this example, the two islands "make sense", in a way: That said, the historic default has been skipped words instead of islands; I have no idea why that's the default. I kind of like islands better. They're usually less crazy. |
I wrote above:
This was a strange typo - my intention was
Additional questions:
|
It sure seems like it, doesn't it?
Possibly! I have repeatedly noticed that, when I repair the English dict to handle some new case, that there is a matching version with a phantom word that does parse correctly. Having explicit phantom word support could lead to simplifications of the dictionary, or so it seems: I keep having to add complexity to handle those cases; this is hard to do, and it creates yet-more disjuncts. Obviously, having fewer disjuncts would be better. The psychological lesson here is that "newspaper English" is well-written and articulate and precise. But when people talk, they are sloppy, imprecise, and drop words all the time. Non-native speakers drop words simply because they just don't know what they should be. It seems that phantom words restore these, or "fill in the blanks", literally. Interesting... |
Why just not actually insert it in the sentence (to show how the sentence got parsed)? |
Not sure. I guess all of the above examples do have an explicit location for the phantom word. An interesting exception is #1240 where the missing word forces a subject-verb inversion. |
Zero/phantom words: Expressions such as "Looks good" have an implicit "it" (also called a zero-it or phantom-it) in them; that is, the sentence should really parse as "(it) looks good". The dictionary could be simplified by admitting such phantom words explicitly, rather than modifying the grammar rules to allow such constructions.
Other examples, with the phantom word in parenthesis, include:
One possible solution to the unvoiced-word problem might be to allow the LG rules to insert alternatives during the early culling stages. This avoids the need to pre-insert all possible alternatives during tokenization...
The text was updated successfully, but these errors were encountered: