Provide context #58

gjtorikian · 2013-12-03T02:43:51Z

This is a start to resolving #25.

Right now it seems to be pretty easy to find the original position of a matching token, as I've done here. What I'm having a hard time doing is getting the original, untokenized document back. It seems to be impossible currently. I'm very new to the code, though, and I maybe be overlooking something.

What I'd like to do is this:

Given a text of: Professor Plumb has a green plant in his study

And a search of: green plant

I want to get the index of the matched token (in this case, 24).

From there, I'd like to devise some sort of extraction, so that the result returned to the user is actually something like a summary: ...has a green plant in his.... This way, the reader can kind of get a "preview" of what the search yielded in the doc. (This is probably best handled outside the core code.)

olivernn · 2013-12-03T20:16:07Z

Thanks for taking a look at this.

The way I see it there are a couple of things that need to change before this feature can be added. Trying to get the positions of tokens is a little harder than it first seems. You have to assume that by the time the incoming document has passed through the pipeline it bears little resemblance to the original text of the document. This is due to stemming, stop word removal or any other form of text processing the user has added to the pipeline.

This means that, as part of the pipeline, probably the very first step, the start index of the token in a string, must be tracked. Currently tokens have no meta data attached to them as they passthrough the pipeline, this would need to change, and this is potentially incompatible with any pipeline function anyone else has created. There are two ways (that I can think of now) to attach this meta data to a token.

Instead of just passing around strings, create a lunr.Token type that wraps this string. A lunr.Token could then have any amount of metadata, including its position, attached to it. Alternatively we could just take advantage of JavaScripts openness and add some properties to the existing token strings.

The first way, creating a lunr.Token type, seems like the Right Way™, tokens are a fairly important aspect of the library, it sort of makes sense to give them their own type. The downsides are that it will require more change, both in lunr and any user code in pipelines.

Just adding extra properties onto strings would allow us to attach the position meta data, however it feels a bit sloppy to me, on the upside it would remain backwards compatible with any end user code in the pipeline.

Currently I'm leaning more towards creating a lunr.Token type…

With the correct position of the matching token returned from the search, some additional code, perhaps a plugin for lunr, could then handle storing the original document, and then using the position to generate a useful preview of the search result.

I'd be very interested in your thoughts on this, or any other suggestions you might have.

gjtorikian · 2013-12-06T19:43:22Z

I think I follow what you're saying. You're basically against polluting the String.prototype namespace and in favor of generating another data structure to handle all the metadata, right?

While I think backwards compatibility is important, I think progress is more important (:grinning:), and not mucking around with a language's internals is best. A bump of a major revision should basically guarantee not to brake < 0.4.x releases of lunr.js.

I'll see if I can sit down and figure out some way to achieve this. Thanks for your feedback!

gjtorikian added 4 commits December 2, 2013 17:18

Update documentation about running tests

83d82be

Include necessary node packages to run locally

dbf1d3c

Start tracking position of tokens

6fdd058

Revert accidental change to package.json

3da1054

cambridgemike mentioned this pull request Apr 1, 2014

Creates a TokenMetadataStore to return startPosition of tokens in results #79

Open

olivernn mentioned this pull request Jun 16, 2015

Get surrounding text for search. #158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide context #58

Provide context #58

Uh oh!

gjtorikian commented Dec 3, 2013

Uh oh!

olivernn commented Dec 3, 2013

Uh oh!

gjtorikian commented Dec 6, 2013

Uh oh!

Uh oh!

Provide context #58

Are you sure you want to change the base?

Provide context #58

Uh oh!

Conversation

gjtorikian commented Dec 3, 2013

Uh oh!

olivernn commented Dec 3, 2013

Uh oh!

gjtorikian commented Dec 6, 2013

Uh oh!

Uh oh!