Reduction of computational requirements for generating C4_200M #2

palasso · 2021-06-14T13:16:44Z

Obtaining and generating the clean version of the C4 corpus version 2.2.1 seems to be computationally expensive.

Are there any plans for an alternative way to generate the C4_200M dataset?

For example, obtaining the C4 clean version 3.0.1 seems more easy as it is available by allennlp, or, alternatively, providing a downloadable C4 clean version 2.2.1 would also make it easier.

Thank you 🙂

fstahlberg · 2021-06-14T13:31:12Z

Hi,

we cannot provide a downloadable C4 2.2.1 version, but you can try to simply use v3.0.1 instead - I could imagine that the difference between versions is small enough to get away with loosing no more than a small fraction of the sentences, especially since we're only using a 200M subset. If you do try, please let us know here.

Best, Felix

palasso · 2021-06-26T09:51:25Z

Hi,

I tried with C4 v3.0.1 resulting into 183894319 sentence pairs.

Kind regards,
Vassilis

fstahlberg · 2021-06-29T07:47:44Z

Thanks so much for checking, I've added a note to the readme.

PrithivirajDamodaran · 2021-07-05T17:24:36Z

Yes, I can confirm using v3.0.1 works fine,

Giovanni-Alzetta · 2021-08-19T14:12:17Z

I guess this is how the other people here did it, but I can confirm v3.0.1 works and the script for data generation c4200m_get_target_sentences.py requires few changes if you want to use the json files directly (allenai provides only those for free).

If @fstahlberg deems it useful I can provide the modified script, but it almost feels like overkill (the only change is to change the for loop to read a json file...).

fstahlberg · 2021-08-20T05:41:28Z

@Giovanni-Alzetta An additional get_target_sentences script that grabs the sentences from allenai is definitely useful. It would be great if you could make a PR - thanks!

5hogun-Ormerod · 2021-08-25T19:35:13Z

I would also like to know what the changes to the loop are in order to get the target sentences. Either @Giovanni-Alzetta , @PrithivirajDamodaran, @palasso please post changes to the get_target_sentences.py to get the required 183894319 sentence pairs. I would love to use the dataset myself.

palasso · 2021-08-25T21:00:35Z

I downloaded the TFDS version from allennlp (which costs about $100):

mkdir -p local_datasets_dir/c4/en/3.0.1/
gsutil -u <google-cloud-project-name> -m cp 'gs://allennlp-tensorflow-datasets/c4/en/3.0.1/*' local_datasets_dir/c4/en/3.0.1/

and modified line 28 of c4200m_get_target_sentences.py accordingly:

tfds.load("c4/en", download=False, data_dir="local_datasets_dir/c4/en/3.0.1/", split="train")

Giovanni-Alzetta · 2021-09-15T08:16:15Z

Sorry for the delay, here is the script:
#3
I actually changed some stuff so I'm running it again on an edit file just for a final check, but it should be fine.

ayaka14732 · 2021-11-12T11:28:19Z

@palasso

I downloaded the TFDS version from allennlp (which costs about $100):
mkdir -p local_datasets_dir/c4/en/3.0.1/
gsutil -u <google-cloud-project-name> -m cp 'gs://allennlp-tensorflow-datasets/c4/en/3.0.1/*' local_datasets_dir/c4/en/3.0.1/
and modified line 28 of c4200m_get_target_sentences.py accordingly:
tfds.load("c4/en", download=False, data_dir="local_datasets_dir/c4/en/3.0.1/", split="train")

The last line should be:

tfds.load("c4/en", download=False, data_dir="local_datasets_dir/", split="train")

palasso closed this as completed Jun 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduction of computational requirements for generating C4_200M #2

Reduction of computational requirements for generating C4_200M #2

palasso commented Jun 14, 2021

fstahlberg commented Jun 14, 2021

palasso commented Jun 26, 2021

fstahlberg commented Jun 29, 2021

PrithivirajDamodaran commented Jul 5, 2021

Giovanni-Alzetta commented Aug 19, 2021 •

edited

Loading

fstahlberg commented Aug 20, 2021

5hogun-Ormerod commented Aug 25, 2021

palasso commented Aug 25, 2021

Giovanni-Alzetta commented Sep 15, 2021

ayaka14732 commented Nov 12, 2021 •

edited

Loading

Reduction of computational requirements for generating C4_200M #2

Reduction of computational requirements for generating C4_200M #2

Comments

palasso commented Jun 14, 2021

fstahlberg commented Jun 14, 2021

palasso commented Jun 26, 2021

fstahlberg commented Jun 29, 2021

PrithivirajDamodaran commented Jul 5, 2021

Giovanni-Alzetta commented Aug 19, 2021 • edited Loading

fstahlberg commented Aug 20, 2021

5hogun-Ormerod commented Aug 25, 2021

palasso commented Aug 25, 2021

Giovanni-Alzetta commented Sep 15, 2021

ayaka14732 commented Nov 12, 2021 • edited Loading

Giovanni-Alzetta commented Aug 19, 2021 •

edited

Loading

ayaka14732 commented Nov 12, 2021 •

edited

Loading