-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New MS MARCO (V1) doc regressions #1721
Comments
Yes, all these look correct to me! |
I've split into separate files and repackaged as follows, on $ ls -l /store/collections/msmarco/*.tar
-rw-r--r-- 1 jimmylin jimmylin 10088448000 Jan 7 12:45 /store/collections/msmarco/msmarco-doc-docTTTTTquery.tar
-rw-r--r-- 1 jimmylin jimmylin 7821808128 Jan 7 10:58 /store/collections/msmarco/msmarco-doc-segmented-docTTTTTquery.tar
-rw-r--r-- 1 jimmylin jimmylin 6222943744 Jan 7 11:31 /store/collections/msmarco/msmarco-doc-segmented.tar
-rw-r--r-- 1 jimmylin jimmylin 8464093803 Jan 7 12:15 /store/collections/msmarco/msmarco-doc.tar
$ md5sum /store/collections/msmarco/*.tar
2f2debe5478cbf034e9c19603003060f /store/collections/msmarco/msmarco-doc-docTTTTTquery.tar
c26e9ff07bf2ad9f377e5373b520fa04 /store/collections/msmarco/msmarco-doc-segmented-docTTTTTquery.tar
d46d4cf3fb47b6dfc50b37463dabe0a2 /store/collections/msmarco/msmarco-doc-segmented.tar
2d36ae5e632a4b75d633dcb5c5a87b82 /store/collections/msmarco/msmarco-doc.tar Recording the checksums for future reference. I have confirmed that the packaging is correct by uncompressing the collection and then running: $ cat docs-* | md5 I've confirmed that the checksums of the single-file corpora match above. |
Okay, final steps, I've confirmed that the regressions run successfully on
In the first block, we're actually building the index. DL19 and DL20 use the same indexes, so we're just performing retrieval. |
More work on the reproducibility issue described in https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc-doc2query-details.md
There will be a forthcoming PR swapping in new "ground truth" for MS MARCO (V1) doc regressions. The segmentation has been fixed, and expansion now uses JSON-formatted data.
For reference, these are the final source ground truth corpora:
These were generated by @ronakice on
orca
.@ronakice please confirm this is correct.
The text was updated successfully, but these errors were encountered: