This repository contains the code, data, and results for analyzing linguistic trends in speeches made at the United Nations General Assembly (UNGA) from 1946 to 2022. The study focuses on linguistic markers that may reflect changing geopolitical, cultural, and rhetorical patterns in global diplomacy.
This project uses the UN General Debate Corpus to extract time-series features from political speeches, capturing how countries express themselves over time. It quantifies linguistic changes across multiple dimensions including:
- Self-reference
- Direct audience address
- Modal and degree adverbs
- Negation
- Sentiment
- Readability
- Vocabulary richness
- Use of numbers and swear words
All results are plotted as normalized time series to highlight long-term developments in rhetorical style and content.
The speeches were extracted from the UN General Debate Corpus, a large, annotated collection of all country statements delivered at the UN General Debate since 1970. For years before 1970, archival documents were processed with OCR and matched manually where necessary.
We applied a pipeline of linguistic preprocessing and feature extraction, including:
- Tokenization and lemmatization
- Part-of-speech tagging
- Readability and sentiment analysis
- Calculation of stylistic features (e.g., average sentence length)
- Use of wordlists for topic-specific content (e.g., crisis vocabulary)
Time series were smoothed using moving averages and normalized by total word count.
| Feature | Description |
|---|---|
| Self-Reference Rate | Frequency of first-person singular/plural pronouns (e.g., “I”, “we”) |
| Direct Addressee Rate | Use of second-person pronouns or phrases directed at the audience |
| Modal Adverb Rate | Words indicating possibility or obligation (e.g., “probably”, “must”) |
| Degree Adverb Rate | Words modifying intensity (e.g., “very”, “extremely”) |
| Negation Rate | Occurrence of negations (e.g., “not”, “no”) |
| Average Sentence Length | Mean number of words per sentence |
| Moving Type-Token Ratio (TTR) | Proxy for lexical diversity over moving windows |
| Flesch-Kincaid Score | A readability index estimating required education level |
| Sentiment Polarity | Scale from negative (−1) to positive (+1) sentiment |
| Sentiment Subjectivity | Degree of subjectivity in statements |
| Swear Word Rate | Use of profanities (rare, but relevant in crises) |
| Crisis Word Rate | Frequency of terms linked to conflict, disaster, urgency |
| Number Rate | Use of numerical expressions or statistics |