-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduction of computational requirements for generating C4_200M #2
Comments
Hi, we cannot provide a downloadable C4 2.2.1 version, but you can try to simply use v3.0.1 instead - I could imagine that the difference between versions is small enough to get away with loosing no more than a small fraction of the sentences, especially since we're only using a 200M subset. If you do try, please let us know here. Best, Felix |
Hi, I tried with C4 v3.0.1 resulting into 183894319 sentence pairs. Kind regards, |
Thanks so much for checking, I've added a note to the readme. |
Yes, I can confirm using v3.0.1 works fine, |
I guess this is how the other people here did it, but I can confirm v3.0.1 works and the script for data generation If @fstahlberg deems it useful I can provide the modified script, but it almost feels like overkill (the only change is to change the for loop to read a json file...). |
@Giovanni-Alzetta An additional |
I would also like to know what the changes to the loop are in order to get the target sentences. Either @Giovanni-Alzetta , @PrithivirajDamodaran, @palasso please post changes to the get_target_sentences.py to get the required 183894319 sentence pairs. I would love to use the dataset myself. |
I downloaded the TFDS version from allennlp (which costs about $100):
and modified line 28 of
|
Sorry for the delay, here is the script: |
The last line should be:
|
Obtaining and generating the clean version of the C4 corpus version 2.2.1 seems to be computationally expensive.
Are there any plans for an alternative way to generate the C4_200M dataset?
For example, obtaining the C4 clean version 3.0.1 seems more easy as it is available by allennlp, or, alternatively, providing a downloadable C4 clean version 2.2.1 would also make it easier.
Thank you 🙂
The text was updated successfully, but these errors were encountered: