FREDSum: A Dialogue Summarization Corpus for French Political Debates

Overview

This repository contains the FREDSum dataset, a comprehensive collection of transcripts and metadata from various political and public debates in France. The dataset aims to provide researchers, linguists, and data scientists with a rich source of debate content for analysis and natural language processing tasks.

Further details are provided in the paper "FREDSum: A Dialogue Summarization Corpus for French Political Debates" (see Acknowledgement below). While we continue to improve the dataset, the version of the transcripts and summaries used in the FREDSum paper can be found in the release v0.1-emnlp-2023.

The dataset can also be found on Hugging Face and Ortolang.

Data from the French National Assembly and French Senate that were used to continue the pretraining of the Barthez language model, as described in the paper "FREDSum: A Dialogue Summarization Corpus for French Political Debates", is available at FREDSum Parliament. (The portion from the National Assembly can also be found on Hugging Face.)

Dataset Description

The dataset includes transcripts from a range of French political debates and discussions. Each transcript is provided along with abstractive and extractive summaries.

Structure

The dataset is organized as follows:

transcripts: contains the debate transcripts.
- Each file is named in the format Speakers--Partie_X_Theme.txt. The name structure is consistant through folders for usage purposes.
summary_extractive: contains two sub folders, one for each of the two annotators who made the extractive summaries.
summary_abstractive: contains three sub folders corresponding to three different types of abstractive summary as follows:
- 1. Contains summaries that aim to preserve the original wording as much as possible (more extractive) and to limit co-reference resolution by using proper names instead of pronouns.
- 1. Contains summaries that aim to preserve the original wording as much as possible (more extractive) while allowing for co-reference.
- 1. Contains summaries that have been written freely (more abstractive).
summary_abstractive_prediction: contains abstractive summaries generated by three models:
- Barthez
- ChatGPT
- Open Assistant (based on Llama 30b).
Please note that there are no predicted summaries for the debate 'Destaing_Mitterand_2'.
community: contains abstractive communities for abstractive summaries 1 and 3. In an abstractive community, a single sentence from the abstractive summary is paired with a set of sentences from the corresponding extractive community that supports it.
FREDSum_test.json: a list of the names of the test files.

Versions

v0.1-emnlp-2023: the original transcripts and summaries used in the paper "FREDSum: A Dialogue Summarization Corpus for French Political Debates"
the current version contains:
- standardized speaker tags (e.g. "GA" -> "Gabriel Attal")
- corrected text for abstractive summaries (1-3)

Acknowledgement

This corpus was created as a part of the SUMM-RE (ANR-20-CE23-0017) and CORTEX2 (Horizon Europe CL4-2021-HUMAN-01-25) research projects.

If you use this dataset, please cite the following article:

Virgile Rennard, Guokan Shang, Damien Grari, Julie Hunter, and Michalis Vazirgiannis. 2023. FREDSum: A Dialogue Summarization Corpus for French Political Debates. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4241–4253, Singapore. Association for Computational Linguistics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FREDSum: A Dialogue Summarization Corpus for French Political Debates

Overview

Dataset Description

Structure

Versions

Acknowledgement

About

Releases

Packages

Contributors 4

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
community		community
summary_abstractive		summary_abstractive
summary_abstractive_prediction		summary_abstractive_prediction
summary_extractive		summary_extractive
transcript		transcript
FREDSum_test.json		FREDSum_test.json
LICENSE		LICENSE
README.md		README.md

License

linagora-labs/FREDSum

Folders and files

Latest commit

History

Repository files navigation

FREDSum: A Dialogue Summarization Corpus for French Political Debates

Overview

Dataset Description

Structure

Versions

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages