Cord-19 index and result #1153

zkt12 · 2020-05-06T07:07:27Z

I follow exactly the instructions to build cord19 index, and retrieval. The ndcg@10 (query-udel round1 04-10) on full-text index is 0.4996, which is claimed 0.5407. Why?
And, when I build the paragraph index (05-01), only 1.72m are indexed, which is claimed 1.76m.

lintool · 2020-05-06T11:32:52Z

As a first step to debugging, why don't you start with the pre-built indexes first?
We also provide the 05-01 index pre-built.

After that, we need more details... what OS, Java version, etc.
It'd be helpful if you copy and pasted the exact commands to be clear.

zkt12 · 2020-05-06T12:02:06Z

I tried to get pre-built indexes, seems the dropbox is not available for mainland China, even use VPN.

So I built the indexes like this:

DATE=2020-05-01
DATA_DIR=./cord19-"${DATE}"
mkdir "${DATA_DIR}"

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/comm_use_subset.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/noncomm_use_subset.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/custom_license.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/biorxiv_medrxiv.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/arxiv.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/metadata.csv -P "${DATA_DIR}"
ls "${DATA_DIR}"/*.tar.gz | xargs -I {} tar -zxvf {} -C "${DATA_DIR}"

sh target/appassembler/bin/IndexCollection
-collection Cord19ParagraphCollection -generator Cord19Generator
-threads 8 -input "${DATA_DIR}"
-index "${DATA_DIR}"/lucene-index-cord19-paragraph-"${DATE}"
-storePositions -storeDocvectors -storeContents -storeRaw -optimize > log.cord19-paragraph.${DATE}.txt

and retrieval like this:

target/appassembler/bin/SearchCollection -index lucene-index-covid-paragraph-2020-05-01
-topicreader Covid -topics src/main/resources/topics-and-qrels/topics.covid-round1-udel.xml -topicfield query -removedups -strip_segment_id
-bm25 -output runs/run.covid-r1.paragraph.query-udel.bm25.txt

My OS is Ubuntu 16.04, Java 11.0.6

lintool · 2020-05-06T12:10:20Z

Well, you're trying to evaluate retrieval against the 5/1 corpus using qrels from the 4/10 corpus, so of course your numbers are going to be lower... TREC-COVID round 1 is against the 4/10 corpus, so to replicate results you'll have to use that.

zkt12 · 2020-05-06T12:35:09Z

Sorry, I didn't make it clear.
I've built the abstract, full-text and paragraph indexes for both 04-10 and 05-01 corpus. There are 2 differences with what you claimed:

When I retrieval on 04_10 full-text index, the ndcg@10 is 0.4996.
When I build 05_01 paragraph index, there are only 1.72m are indexed.

lintool · 2020-05-06T13:49:40Z

Our indexes are mirrored here: https://git.uwaterloo.ca/jimmylin/cord19-indexes

Can you try with pre-built indexes?

lintool · 2020-05-06T19:07:25Z

I think I understand the issue now. After the construction of the pre-built indexes, I manually did some data cleaning to blacklist a few outlier documents, see: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/Cord19Generator.java#L95

For example: 37491d1

Note that this was done independently of search results; in fact you can see from the commit id that my manual cleaning predates the release of the round 1 results.

Thus, if you use the latest HEAD to go back and index the corpus from 4/10, you'd get slightly different document counts. This changes term and document statistics slightly... apparently enough to have an impact on the effectiveness. Small changes though.

Hope this clarifies things.

To be clear, this explains:

And, when I build the paragraph index (05-01), only 1.72m are indexed, which is claimed 1.76m.

…storini#1153)

lintool mentioned this issue May 7, 2020

TREC-COVID baselines: added Judged@10 metric and checksums for round 2 runs #1165

Merged

lintool closed this as completed in #1165 May 7, 2020

crystina-z pushed a commit to crystina-z/anserini that referenced this issue Oct 28, 2022

Add note in docs/experiments-ance-prf.md about needing merges.txt (ca…

46a721d

…storini#1153)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cord-19 index and result #1153

Cord-19 index and result #1153

zkt12 commented May 6, 2020

lintool commented May 6, 2020

zkt12 commented May 6, 2020

lintool commented May 6, 2020

zkt12 commented May 6, 2020

lintool commented May 6, 2020

lintool commented May 6, 2020 •

edited

Loading

Cord-19 index and result #1153

Cord-19 index and result #1153

Comments

zkt12 commented May 6, 2020

lintool commented May 6, 2020

zkt12 commented May 6, 2020

lintool commented May 6, 2020

zkt12 commented May 6, 2020

lintool commented May 6, 2020

lintool commented May 6, 2020 • edited Loading

lintool commented May 6, 2020 •

edited

Loading