LuceneSearcher.from_prebuilt_index returns empty contents #1250

joelrorseth · 2022-08-15T20:13:00Z

joelrorseth
Aug 15, 2022

Hi there, I'm using one of the pre-built indexes via LuceneSearcher.from_prebuilt_index('msmarco-passage'). I noticed that the contents field of each JLuceneSearcherResult returned by the search() method is not populated, however, the raw field is populated (and appears to contain the entire contents encoded in JSON).

Do the prebuilt indexes store the full document contents? If so, is there any way to access them without decoding the raw JSON myself)? I noticed there is a -storeContents arg for the actual indexer, but I figure that isn't exposed in the pre-built methods.

Thanks!

Answered by lintool

Aug 16, 2022

raw stores the raw document in its original format, contents stores the "parsed" document. So, for example, raw might give the original HTML doc, and contents provides what's actually indexed after tag cleanup. Thus, you're always able to reconstruct contents form raw (i.e., just re-parse the document), but not vice versa. For this reason, we only store raw in the prebuilt indexes.

In this case, you get contents from raw by parsing out the JSON and pulling out the right field.

If you want an index with contents but not raw, you'll have to build a fresh index yourself.

Hope this helps!

View full answer

lintool · 2022-08-16T11:45:29Z

lintool
Aug 16, 2022
Maintainer

raw stores the raw document in its original format, contents stores the "parsed" document. So, for example, raw might give the original HTML doc, and contents provides what's actually indexed after tag cleanup. Thus, you're always able to reconstruct contents form raw (i.e., just re-parse the document), but not vice versa. For this reason, we only store raw in the prebuilt indexes.

In this case, you get contents from raw by parsing out the JSON and pulling out the right field.

If you want an index with contents but not raw, you'll have to build a fresh index yourself.

Hope this helps!

1 reply

joelrorseth Aug 16, 2022
Author

Thanks @lintool, that clears things up!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LuceneSearcher.from_prebuilt_index returns empty contents #1250

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

LuceneSearcher.from_prebuilt_index returns empty contents #1250

joelrorseth Aug 15, 2022

Replies: 1 comment · 1 reply

lintool Aug 16, 2022 Maintainer

joelrorseth Aug 16, 2022 Author

joelrorseth
Aug 15, 2022

Replies: 1 comment 1 reply

lintool
Aug 16, 2022
Maintainer

joelrorseth Aug 16, 2022
Author