-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RVL-CDIP dataset #4050
Add RVL-CDIP dataset #4050
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've left a few comments.
You can use this URL to avoid manual download: https://drive.google.com/uc?export=download&id=0Bz1dfcnrpXM-MUt4cHNzUEFXcmc
Also, since the data files are TAR archives, we need to implement the streamable (can't use os.path.join
) and the non-streamable versions of _generate_examples
(see this script for instance). The labels-only data file URL doesn't work for me, so feel free to ask the authors whether they are OK with us hosting the file on the Hub/S3 (to speed up the streamable version). Let me know if you need help implementing this part.
License as other for custom license. Co-authored-by: Mario Šaško <mario@huggingface.co>
Add leaderboard details. Co-authored-by: Mario Šaško <mario@huggingface.co>
Add data collection info. Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Fix links for the license. Co-authored-by: Mario Šaško <mario@huggingface.co>
Add bibtex tag. Co-authored-by: Mario Šaško <mario@huggingface.co>
…tasets into add_rvl-cdip_dataset
Thanks a lot for inputs. I'll use the URL suggested and check.
Sure. I will check the reference and try this out, will get back to you if I face any issues.
Just checked. The author (Adam Harley) has responded positively and allowed us to host the file. Do I share the file with you for hosting it on Hub/S3 ? |
Yes, feel free to e-mail me the file. Then I'll create a repo under my namespace and push the file there. We run a GH action on a GH dataset after merging to create its repo on the Hub, so after this PR is merged, I'll push the file to the "official" namespace and update the download link. |
For some reason, the direct download doesn't seem to work for me even with this URL.
And this directory does not exist. Am I doing something wrong ?
Got it. I've sent you an email with the file. Thank you. |
Actually this URL works for direct download : I'm working on the streamable versions of _generate_examples as well, will update you regarding this. |
Google Drive is a tricky host, and it's easy to exceed daily download quota limits, so if we are allowed to host the |
Just checked, the authors have agreed. He mentioned that he had complaints about the GDrive link. |
I have added :
Is it possible for you to upload the train.txt, test.txt, val.txt files separately to the Hub instead of labels_only.tar.gz file. |
The rvl-cdip.tar.gz archive and txt files with the labels are on the Hub! |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits, and we can merge then.
Co-authored-by: Mario Šaško <mario@huggingface.co>
add tasks based on new taxonomy. Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
I've removed the extraction of the archive file locally as suggested. Let me know if any other changes are required. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed minor fixes/improvements to the card and the script. It all looks good now. Good job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @dnaveenr ! And thanks @mariosasko for updating the URLs
It looks all good to me :)
feel free to merge when the CI is green
The check for Update Hub repositories / update-hub-repositories has failed.
|
Hi ! Thanks for reporting ;) yes this CI job has been failing for a few days. I'm working on fixing it, and I'm manually running it on my side in the meantime |
Great. :D Thank you @lhoestq |
Resolves #2762
Dataset Request : Add RVL-CDIP dataset #2762
This PR adds the RVL-CDIP dataset.
The dataset contains Google Drive link for download and wasn't getting downloaded automatically, so I have provided manual_download_instructions.
Needed inputs on how I can run the real data and the dummy data tests for datasets with manual download ?
Inputs and suggestions for improvement are welcome. Thank you.