Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-processing code for missing datasets #86

Closed
memray opened this issue Mar 30, 2022 · 6 comments
Closed

Pre-processing code for missing datasets #86

memray opened this issue Mar 30, 2022 · 6 comments

Comments

@memray
Copy link

memray commented Mar 30, 2022

Hi there,

I wonder if the code for preprocessing unprovided datasets could be released? For example, I downloaded bioasq and signal1m following the instruction. But it's not clear to me how to convert the raw dataset to corpus.jsonl, queries.jsonl, qrels/{train/dev/test}.tsv the same way as you did. I think it's critical for reproducing your results and fair benchmarking.

Thank you,
Rui

@thakur-nandan
Copy link
Member

thakur-nandan commented Mar 30, 2022

Hi @memray,

You can send out an email to nandant@gmail.com. I can send you the datasets privately.
Please ensure you are responsible for accepting the licenses for all private datasets.

Kind Regards,
Nandan Thakur

@memray
Copy link
Author

memray commented Mar 30, 2022

Thanks @NThakur20 ! I'll reach out to you via email later.

@memray memray closed this as completed Mar 30, 2022
@gsgoncalves
Copy link

Hi @NThakur20,
I sent an email as well.
It would be great if you could share the details for the html2text initialization for TREC News, and which Anserini tweet indexing options were used for Signal AI.
Thanks!

@Cyril-JZ
Copy link

Hi @thakur-nandan,

I also sent an email to kindly request access to the TREC-News dataset TGZ files. I assure you that the dataset will be used solely for academic purposes. I greatly appreciate your assistance and support!

Thanks!
im.jzfeng@gmail.com

@thakur-nandan
Copy link
Member

Hi @gsgoncalves and @Cyril-JZ, you can find the private BEIR datasets here (all preprocessed): https://drive.google.com/drive/folders/1CgDO-KmQQMpGEGeD3R20ZgTTM008xix9?usp=sharing.

Hope it helps!

Thanks!

@Cyril-JZ
Copy link

Thanks for your prompt reply! It helps me a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants