Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduction of computational requirements for generating C4_200M #2

Closed
palasso opened this issue Jun 14, 2021 · 10 comments
Closed

Reduction of computational requirements for generating C4_200M #2

palasso opened this issue Jun 14, 2021 · 10 comments

Comments

@palasso
Copy link

palasso commented Jun 14, 2021

Obtaining and generating the clean version of the C4 corpus version 2.2.1 seems to be computationally expensive.

Are there any plans for an alternative way to generate the C4_200M dataset?

For example, obtaining the C4 clean version 3.0.1 seems more easy as it is available by allennlp, or, alternatively, providing a downloadable C4 clean version 2.2.1 would also make it easier.

Thank you 🙂

@fstahlberg
Copy link
Collaborator

Hi,

we cannot provide a downloadable C4 2.2.1 version, but you can try to simply use v3.0.1 instead - I could imagine that the difference between versions is small enough to get away with loosing no more than a small fraction of the sentences, especially since we're only using a 200M subset. If you do try, please let us know here.

Best, Felix

@palasso
Copy link
Author

palasso commented Jun 26, 2021

Hi,

I tried with C4 v3.0.1 resulting into 183894319 sentence pairs.

Kind regards,
Vassilis

@palasso palasso closed this as completed Jun 26, 2021
@fstahlberg
Copy link
Collaborator

Thanks so much for checking, I've added a note to the readme.

@PrithivirajDamodaran
Copy link

Yes, I can confirm using v3.0.1 works fine,

@Giovanni-Alzetta
Copy link
Contributor

Giovanni-Alzetta commented Aug 19, 2021

I guess this is how the other people here did it, but I can confirm v3.0.1 works and the script for data generation c4200m_get_target_sentences.py requires few changes if you want to use the json files directly (allenai provides only those for free).

If @fstahlberg deems it useful I can provide the modified script, but it almost feels like overkill (the only change is to change the for loop to read a json file...).

@fstahlberg
Copy link
Collaborator

@Giovanni-Alzetta An additional get_target_sentences script that grabs the sentences from allenai is definitely useful. It would be great if you could make a PR - thanks!

@5hogun-Ormerod
Copy link

I would also like to know what the changes to the loop are in order to get the target sentences. Either @Giovanni-Alzetta , @PrithivirajDamodaran, @palasso please post changes to the get_target_sentences.py to get the required 183894319 sentence pairs. I would love to use the dataset myself.

@palasso
Copy link
Author

palasso commented Aug 25, 2021

I downloaded the TFDS version from allennlp (which costs about $100):

mkdir -p local_datasets_dir/c4/en/3.0.1/
gsutil -u <google-cloud-project-name> -m cp 'gs://allennlp-tensorflow-datasets/c4/en/3.0.1/*' local_datasets_dir/c4/en/3.0.1/

and modified line 28 of c4200m_get_target_sentences.py accordingly:

tfds.load("c4/en", download=False, data_dir="local_datasets_dir/c4/en/3.0.1/", split="train")

@Giovanni-Alzetta
Copy link
Contributor

Sorry for the delay, here is the script:
#3
I actually changed some stuff so I'm running it again on an edit file just for a final check, but it should be fine.

@ayaka14732
Copy link

ayaka14732 commented Nov 12, 2021

@palasso

I downloaded the TFDS version from allennlp (which costs about $100):

mkdir -p local_datasets_dir/c4/en/3.0.1/
gsutil -u <google-cloud-project-name> -m cp 'gs://allennlp-tensorflow-datasets/c4/en/3.0.1/*' local_datasets_dir/c4/en/3.0.1/

and modified line 28 of c4200m_get_target_sentences.py accordingly:

tfds.load("c4/en", download=False, data_dir="local_datasets_dir/c4/en/3.0.1/", split="train")

The last line should be:

tfds.load("c4/en", download=False, data_dir="local_datasets_dir/", split="train")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants