This repository contains the FREDSum dataset, a comprehensive collection of transcripts and metadata from various political and public debates in France. The dataset aims to provide researchers, linguists, and data scientists with a rich source of debate content for analysis and natural language processing tasks.
Further details are provided in the paper "FREDSum: A Dialogue Summarization Corpus for French Political Debates" (see Acknowledgement below). While we continue to improve the dataset, the version of the transcripts and summaries used in the FREDSum paper can be found in the release v0.1-emnlp-2023.
The dataset can also be found on Hugging Face and Ortolang.
Data from the French National Assembly and French Senate that were used to continue the pretraining of the Barthez language model, as described in the paper "FREDSum: A Dialogue Summarization Corpus for French Political Debates", is available at FREDSum Parliament. (The portion from the National Assembly can also be found on Hugging Face.)
The dataset includes transcripts from a range of French political debates and discussions. Each transcript is provided along with abstractive and extractive summaries.
The dataset is organized as follows:
-
transcripts
: contains the debate transcripts.- Each file is named in the format
Speakers--Partie_X_Theme.txt
. The name structure is consistant through folders for usage purposes.
- Each file is named in the format
-
summary_extractive
: contains two sub folders, one for each of the two annotators who made the extractive summaries. -
summary_abstractive
: contains three sub folders corresponding to three different types of abstractive summary as follows:-
- Contains summaries that aim to preserve the original wording as much as possible (more extractive) and to limit co-reference resolution by using proper names instead of pronouns.
-
- Contains summaries that aim to preserve the original wording as much as possible (more extractive) while allowing for co-reference.
-
- Contains summaries that have been written freely (more abstractive).
-
-
summary_abstractive_prediction
: contains abstractive summaries generated by three models:- Barthez
- ChatGPT
- Open Assistant (based on Llama 30b).
Please note that there are no predicted summaries for the debate 'Destaing_Mitterand_2'.
-
community
: contains abstractive communities for abstractive summaries 1 and 3. In an abstractive community, a single sentence from the abstractive summary is paired with a set of sentences from the corresponding extractive community that supports it. -
FREDSum_test.json
: a list of the names of the test files.
- v0.1-emnlp-2023: the original transcripts and summaries used in the paper "FREDSum: A Dialogue Summarization Corpus for French Political Debates"
- the current version contains:
- standardized speaker tags (e.g. "GA" -> "Gabriel Attal")
- corrected text for abstractive summaries (1-3)
This corpus was created as a part of the SUMM-RE (ANR-20-CE23-0017) and CORTEX2 (Horizon Europe CL4-2021-HUMAN-01-25) research projects.
If you use this dataset, please cite the following article:
- Virgile Rennard, Guokan Shang, Damien Grari, Julie Hunter, and Michalis Vazirgiannis. 2023. FREDSum: A Dialogue Summarization Corpus for French Political Debates. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4241–4253, Singapore. Association for Computational Linguistics.