This directory provides utilities to create a Cartoonizer dataset for InstructPix2Pix like training.
We used 5000 randomly sampled images as the original images from the train
set of ImageNette. To derive their
cartoonized renditions, we used the Whitebox Cartoonizer model. For deriving the instructions.txt
file, we used ChatGPT. In particular, we used the following prompt:
Provide al teast 50 synonymous sentences for the following instruction: "Cartoonize the following image."
Dataset preparation is divided into three steps:
pip install -q requirements.txt
python generate_dataset.py
If you want to use more than 5000 samples, specify the --max_num_samples
option. One the image-cartoon pairs are generated, you should see a directory called cartoonizer-dataset
directory (unless you specified a different one via --data_root
):
For this step, you need to be authorized to access your Hugging Face account. Run the following command to do so:
huggingface-cli login
Then run:
python export_to_hub.py
Warning
Please ensure that an empty DS_NAME
dataset was created on the Hub first. Instructions on how to do that are here.
You can find a mini dataset here: