Skip to content

Latest commit

 

History

History
 
 

data_preparation

This directory provides utilities to create a Cartoonizer dataset for InstructPix2Pix like training.

Steps

We used 5000 randomly sampled images as the original images from the train set of ImageNette. To derive their cartoonized renditions, we used the Whitebox Cartoonizer model. For deriving the instructions.txt file, we used ChatGPT. In particular, we used the following prompt:

Provide al teast 50 synonymous sentences for the following instruction: "Cartoonize the following image."

Dataset preparation is divided into three steps:

Step 0: Install dependencies

pip install -q requirements.txt

Step 1: Obtain the image-cartoon pairs

python generate_dataset.py

If you want to use more than 5000 samples, specify the --max_num_samples option. One the image-cartoon pairs are generated, you should see a directory called cartoonizer-dataset directory (unless you specified a different one via --data_root):

Step 2: Export the dataset to 🤗 Hub

For this step, you need to be authorized to access your Hugging Face account. Run the following command to do so:

huggingface-cli login

Then run:

python export_to_hub.py

Warning

Please ensure that an empty DS_NAME dataset was created on the Hub first. Instructions on how to do that are here.

You can find a mini dataset here: