-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cord-19 index and result #1153
Comments
As a first step to debugging, why don't you start with the pre-built indexes first? After that, we need more details... what OS, Java version, etc. |
I tried to get pre-built indexes, seems the dropbox is not available for mainland China, even use VPN. So I built the indexes like this: DATE=2020-05-01 wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/comm_use_subset.tar.gz -P "${DATA_DIR}" sh target/appassembler/bin/IndexCollection and retrieval like this: target/appassembler/bin/SearchCollection -index lucene-index-covid-paragraph-2020-05-01 My OS is Ubuntu 16.04, Java 11.0.6 |
Well, you're trying to evaluate retrieval against the 5/1 corpus using qrels from the 4/10 corpus, so of course your numbers are going to be lower... TREC-COVID round 1 is against the 4/10 corpus, so to replicate results you'll have to use that. |
Sorry, I didn't make it clear.
|
Our indexes are mirrored here: https://git.uwaterloo.ca/jimmylin/cord19-indexes Can you try with pre-built indexes? |
I think I understand the issue now. After the construction of the pre-built indexes, I manually did some data cleaning to blacklist a few outlier documents, see: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/Cord19Generator.java#L95 For example: 37491d1 Note that this was done independently of search results; in fact you can see from the commit id that my manual cleaning predates the release of the round 1 results. Thus, if you use the latest HEAD to go back and index the corpus from 4/10, you'd get slightly different document counts. This changes term and document statistics slightly... apparently enough to have an impact on the effectiveness. Small changes though. Hope this clarifies things. To be clear, this explains:
|
I follow exactly the instructions to build cord19 index, and retrieval. The ndcg@10 (query-udel round1 04-10) on full-text index is 0.4996, which is claimed 0.5407. Why?
And, when I build the paragraph index (05-01), only 1.72m are indexed, which is claimed 1.76m.
The text was updated successfully, but these errors were encountered: