Skip to content

Analyzes linguistic trends in UN General Assembly speeches (1946–2022) using NLP pipelines on the UN General Debate Corpus. Extracts features like sentiment, self-reference, and readability to reveal evolving geopolitical and rhetorical patterns via normalized, smoothed time-series visualizations.

Notifications You must be signed in to change notification settings

Pigeon-Effect/UNGD-linguistic-patterns

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 

Repository files navigation


UNGA Linguistic Patterns (1946–2022)

This repository contains the code, data, and results for analyzing linguistic trends in speeches made at the United Nations General Assembly (UNGA) from 1946 to 2022. The study focuses on linguistic markers that may reflect changing geopolitical, cultural, and rhetorical patterns in global diplomacy.


Overview

This project uses the UN General Debate Corpus to extract time-series features from political speeches, capturing how countries express themselves over time. It quantifies linguistic changes across multiple dimensions including:

  • Self-reference
  • Direct audience address
  • Modal and degree adverbs
  • Negation
  • Sentiment
  • Readability
  • Vocabulary richness
  • Use of numbers and swear words

All results are plotted as normalized time series to highlight long-term developments in rhetorical style and content.


Data Source

The speeches were extracted from the UN General Debate Corpus, a large, annotated collection of all country statements delivered at the UN General Debate since 1970. For years before 1970, archival documents were processed with OCR and matched manually where necessary.


Methodology

We applied a pipeline of linguistic preprocessing and feature extraction, including:

  • Tokenization and lemmatization
  • Part-of-speech tagging
  • Readability and sentiment analysis
  • Calculation of stylistic features (e.g., average sentence length)
  • Use of wordlists for topic-specific content (e.g., crisis vocabulary)

Time series were smoothed using moving averages and normalized by total word count.


Key Features Analyzed

Feature Description
Self-Reference Rate Frequency of first-person singular/plural pronouns (e.g., “I”, “we”)
Direct Addressee Rate Use of second-person pronouns or phrases directed at the audience
Modal Adverb Rate Words indicating possibility or obligation (e.g., “probably”, “must”)
Degree Adverb Rate Words modifying intensity (e.g., “very”, “extremely”)
Negation Rate Occurrence of negations (e.g., “not”, “no”)
Average Sentence Length Mean number of words per sentence
Moving Type-Token Ratio (TTR) Proxy for lexical diversity over moving windows
Flesch-Kincaid Score A readability index estimating required education level
Sentiment Polarity Scale from negative (−1) to positive (+1) sentiment
Sentiment Subjectivity Degree of subjectivity in statements
Swear Word Rate Use of profanities (rare, but relevant in crises)
Crisis Word Rate Frequency of terms linked to conflict, disaster, urgency
Number Rate Use of numerical expressions or statistics

Results

Stylistic and Semantic Trends (1946–2022)

Self Reference Rate Direct Addressee Rate Modal Adverb Rate Degree Adverb Rate Negation Rate


Readability, Sentiment & Lexical Diversity

Average Sentence Length Average Moving TTR Flesch Kincaid Readability Sentiment Polarity Sentiment Subjectivity


Topical Trends and Rhetorical Intensity

Swear Word Rate Crisis Word Rate Number Rate


About

Analyzes linguistic trends in UN General Assembly speeches (1946–2022) using NLP pipelines on the UN General Debate Corpus. Extracts features like sentiment, self-reference, and readability to reveal evolving geopolitical and rhetorical patterns via normalized, smoothed time-series visualizations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages