This is the dataset proposed in our paper VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models (NeurIPS 2024).
VidProM is the first dataset featuring 1.67 million unique text-to-video prompts and 6.69 million videos generated from 4 different state-of-the-art diffusion models. It inspires many exciting new research areas, such as Text-to-Video Prompt Engineering, Efficient Video Generation, Fake Video Detection, and Video Copy Detection for Diffusion Models.
You can download the VidProM from Hugging Face.
For users from China, we cooperate with Wisemodel, and you can download them faster from here.
Install the datasets library first, by:
pip install datasets
Then it can be downloaded automatically with
import numpy as np
from datasets import load_dataset
dataset = load_dataset('WenhaoWang/VidProM')
You can also download each file by wget
, for instance:
wget https://huggingface.co/datasets/WenhaoWang/VidProM/resolve/main/VidProM_unique.csv
*DATA_PATH
*VidProM_unique.csv
*VidProM_semantic_unique.csv
*VidProM_embed.hdf5
*original_files
*generate_1_ori.html
*generate_2_ori.html
...
*pika_videos
*pika_videos_1.tar
*pika_videos_2.tar
...
*vc2_videos
*vc2_videos_1.tar
*vc2_videos_2.tar
...
*t2vz_videos
*t2vz_videos_1.tar
*t2vz_videos_2.tar
...
*ms_videos
*ms_videos_1.tar
*ms_videos_2.tar
...
We use the example
folder to illustrate how to load VidProM using PyTorch Dataloader and WebDataset.
The example
directory is
*example
*VidProM_unique_example.csv
*VidProM_embed_example.hdf5
*pika_videos_example
pika-xxx-xxx.mp4
pika-xxx-xxx.mp4
...
*t2vz_videos_example
t2vz-xxx-xxx.mp4
t2vz-xxx-xxx.mp4
...
*vc2_videos_example
vc2-xxx-xxx.mp4
vc2-xxx-xxx.mp4
...
*ms_videos_example
ms-xxx-xxx.mp4
ms-xxx-xxx.mp4
...
We have the following PyTorch Dataloader:
import os
import pandas as pd
import h5py
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision.io import read_video
import numpy as np
class VidProMDataset(Dataset):
def __init__(self, csv_file, hdf5_file, video_dirs, transform=None):
self.metadata = pd.read_csv(csv_file)
self.video_dirs = video_dirs
self.transform = transform
self.nsfw_names = ['toxicity','obscene','identity_attack','insult','threat','sexual_explicit']
self.hdf5_file = h5py.File(hdf5_file, 'r')
self.hdf5_uuid = np.array(self.hdf5_file["uuid"][:], dtype=object).astype(str).tolist()
self.hdf5_embed = np.array(self.hdf5_file['embeddings'])
def __len__(self):
return len(self.metadata)
def __getitem__(self, idx):
video_info = self.metadata.iloc[idx]
video_id = video_info['uuid']
prompt = video_info['prompt']
time = video_info['time']
nsfw_scores = torch.tensor(list(video_info[self.nsfw_names]))
embed = torch.tensor(self.hdf5_embed[self.hdf5_uuid.index(video_id)])
video_path = self._find_video_path(video_id)
video_frames, _, _ = read_video(video_path, pts_unit='sec')
if self.transform:
video_frames = self.transform(video_frames)
return {
'video_id': video_id,
'video_frames': video_frames,
'embed': embed,
'prompt': prompt,
'time': time,
'nsfw_scores': nsfw_scores
}
def _find_video_path(self, video_id):
for video_dir in self.video_dirs:
video_file = os.path.join(video_dir, video_dir.split('_')[0] + f"-{video_id}.mp4")
if os.path.exists(video_file):
return video_file
raise FileNotFoundError(f"Video {video_id}.mp4 not found in any of the directories.")
def __del__(self):
self.hdf5_file.close()
csv_file = 'VidProM_unique_example.csv'
hdf5_file = 'VidProM_embed_example.hdf5'
video_dirs = ['t2vz_videos_example', 'pika_videos_example', 'vc2_videos_example', 'ms_videos_example']
dataset = VidProMDataset(csv_file, hdf5_file, video_dirs)
dataloader = DataLoader(dataset, batch_size=16, shuffle=False, num_workers=0)
We can load videos using WebDataset from the tar
files directly, and we assume the directory is
*example
*VidProM_unique_example.csv
*VidProM_embed_example.hdf5
*pika_videos_example.tar
*t2vz_videos_example.tar
*vc2_videos_example.tar
*ms_videos_example.tar
We have the following:
import os
import io
import av
import pandas as pd
import h5py
import numpy as np
from PIL import Image
import torchvision.transforms as transforms
import torch
import webdataset as wds
tar_file_path = 't2vz_videos_example.tar' # we use t2vz_videos_example.tar for example
csv_file = 'VidProM_unique_example.csv'
hdf5_file = 'VidProM_embed_example.hdf5'
dataset = wds.WebDataset(tar_file_path)
metadata = pd.read_csv(csv_file)
hdf5_file = h5py.File(hdf5_file, 'r')
hdf5_uuid = np.array(hdf5_file["uuid"][:], dtype=object).astype(str).tolist()
hdf5_embed = np.array(hdf5_file['embeddings'])
for sample in dataset:
#obtain tensor of a video
binary_data = sample['mp4']
container = av.open(io.BytesIO(binary_data))
transform = transforms.ToTensor()
frames = []
for frame in container.decode(video=0):
img = frame.to_image()
img_tensor = transform(img)
frames.append(img_tensor)
video_tensor = torch.stack(frames)
#obtain uuid of a video
uuid = '-'.join(sample['__key__'].split('/')[-1].split('-')[1:])
#obtain the prompt
prompt = list(metadata[metadata['uuid']==uuid].iloc[:, 1])[0]
#obtain the time
time = list(metadata[metadata['uuid']==uuid].iloc[:, 2])[0]
#obtain the nsfw_scores
nsfw_scores = list(metadata[metadata['uuid']==uuid].iloc[0, 3:])
#obtain the prompt embedding
embed = torch.tensor(hdf5_embed[hdf5_uuid.index(uuid)])
VidProM_unique.csv
contains the UUID, prompt, time, and 6 NSFW probabilities.
It can easily be read by
import pandas
df = pd.read_csv("VidProM_unique.csv")
Below are three rows from VidProM_unique.csv
:
uuid | prompt | time | toxicity | obscene | identity_attack | insult | threat | sexual_explicit |
---|---|---|---|---|---|---|---|---|
6a83eb92-faa0-572b-9e1f-67dec99b711d | Flying among clouds and stars, kitten Max discovered a world full of winged friends. Returning home, he shared his stories and everyone smiled as they imagined flying together in their dreams. | Sun Sep 3 12:27:44 2023 | 0.00129 | 0.00016 | 7e-05 | 0.00064 | 2e-05 | 2e-05 |
3ba1adf3-5254-59fb-a13e-57e6aa161626 | Use a clean and modern font for the text "Relate Reality 101." Add a small, stylized heart icon or a thought bubble above or beside the text to represent emotions and thoughts. Consider using a color scheme that includes warm, inviting colors like deep reds, soft blues, or soothing purples to evoke feelings of connection and intrigue. | Wed Sep 13 18:15:30 2023 | 0.00038 | 0.00013 | 8e-05 | 0.00018 | 3e-05 | 3e-05 |
62e5a2a0-4994-5c75-9976-2416420526f7 | zoomed out, sideview of an Grey Alien sitting at a computer desk | Tue Oct 24 20:24:21 2023 | 0.01777 | 0.00029 | 0.00336 | 0.00256 | 0.00017 | 5e-05 |
VidProM_semantic_unique.csv
is a semantically unique version of VidProM_unique.csv
.
VidProM_embed.hdf5
is the 3072-dim embeddings of our prompts. They are embedded by text-embedding-3-large, which is the latest text embedding model of OpenAI.
It can easily be read by
import numpy as np
import h5py
def read_descriptors(filename):
hh = h5py.File(filename, "r")
descs = np.array(hh["embeddings"])
names = np.array(hh["uuid"][:], dtype=object).astype(str).tolist()
return names, descs
uuid, features = read_descriptors('VidProM_embed.hdf5')
original_files
are the HTML files from official Pika Discord collected by DiscordChatExporter. You can do whatever you want with it under CC BY-NC 4.0 license.
pika_videos
, vc2_videos
, t2vz_videos
, and ms_videos
are the generated videos by 4 state-of-the-art text-to-video diffusion models. Each contains 30 tar files.
Click the WizMap (and wait for 5 seconds) for an interactive visualization of our 1.67 million prompts. Above is a thumbnail.
Please check our paper for a detailed comparison.
VidProM is created by Wenhao Wang and Professor Yi Yang.
The prompts and videos generated by Pika in our VidProM are licensed under the CC BY-NC 4.0 license. Additionally, similar to their original repositories, the videos from VideoCraft2, Text2Video-Zero, and ModelScope are released under the Apache license, the CreativeML Open RAIL-M license, and the CC BY-NC 4.0 license, respectively. Our code is released under the CC BY-NC 4.0 license.
@article{wang2024vidprom,
title={VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models},
author={Wang, Wenhao and Yang, Yi},
booktitle={Thirty-eighth Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=pYNl76onJL}
}
If you have any questions, feel free to contact Wenhao Wang (wangwenhao0716@gmail.com).