Anees Dataset

The dataset used to fine-tune the GPT-2 model used in Anees for the multi-turn dialogue generation.

Introduction

The dataset is a combination of 4 multi-turn dialogue datasets:

DailyDialog: a high-quality multi-turn open-domain English dialog dataset. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.
EmpatheticDialogues: a large-scale multi-turn empathetic dialogue dataset collected on Amazon Mechanical Turk, containing 24,850 one-to-one open-domain conversations.
Persona-Chat: crowd-sourced dialogues where each participant plays the part of an assigned persona; and each persona has a word-distinct paraphrase.
BlendedSkillTalk: an English-language dataset blending three conversation skills in balanced proportions (demonstrating knowledge, empathy, or ability to talk about oneself).

Dataset	# of training dialogues	# of training utterances	# of validation dialogues	# of validation utterances
DailyDialog	11150	87467	1968	15512
EmpatheticDialogues	19628	84674	3464	14912
Persona-Chat	16046	212873	2832	37788
BlendedSkillTalk	5786	76435	1022	13482
Total	52610	461449	9286	81694

The English dataset was tokenized using the GPT2 Tokenizer.
The Arabic dataset was tokenized using the AraGPT2 Tokenizer.
The translation from English to Arabic was done using Opus-MT on Colab.
The preprocessing and loading details of the data can be found on Anees repository.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dataset		dataset
README.md		README.md