Skip to content

Built a cross-lingual sentiment analysis model over 8+ languages for monolingual and cross-lingual tasks, achieving Sentiment Graph F1 of 0.55 (cross-lingual) and 0.559 (monolingual) using RoBERTa with CRF for span extraction and polarity classification.

Notifications You must be signed in to change notification settings

vk93102/Structured-Sentiment-Analysis

Repository files navigation

🧠 Structured Sentiment Analysis

IIIT Delhi Β· Natural Language Processing Project

Python PyTorch HuggingFace License: MIT

A structured approach to opinion mining β€” extracting who expresses what sentiment toward whom, with a polarity label.


πŸ“‹ Table of Contents


🎯 Problem Statement

Structured Sentiment Analysis (SSA) extends traditional sentiment analysis by extracting complete sentiment graphs from raw text. Instead of simply classifying a text as positive or negative, SSA identifies all the structured opinion tuples present in a sentence.

Formal Definition

Given a text $T = O_1, O_2, \ldots, O_n$, the goal is to predict all opinion tuples of the form:

$$\mathcal{G} = {(h, b, t, p)}$$

Where each element is defined as:

Symbol Role Description
$h$ Holder The entity who expresses the opinion
$b$ Expression The polar expression (words conveying sentiment)
$t$ Target The entity toward which the opinion is directed
$p$ Polarity The sentiment label: positive, negative, or neutral

Illustrative Example

Sentence: "Even though the price is decent for Paris, I would not recommend this hotel."

Opinion Tuple 1:
  Holder     β†’ "I"
  Expression β†’ "would not recommend"
  Target     β†’ "this hotel"
  Polarity   β†’ NEGATIVE

Opinion Tuple 2:
  Holder     β†’ (implicit)
  Expression β†’ "decent"
  Target     β†’ "the price"
  Polarity   β†’ POSITIVE

Unlike standard sentiment analysis, SSA captures complex, overlapping, and multi-target opinions within a single sentence β€” making it far more expressive and practically useful.


πŸ—οΈ Architecture

Our model follows a unified end-to-end architecture built on top of a pretrained BERT-based encoder, enhanced with a CRF decoding layer and a separate polarity head.

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚              Input Text                   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   [CLS]   w₁    wβ‚‚   ...   wβ‚™   [SEP]   β”‚
                    β”‚         Token Embedding Layer             β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚        Norwegian BERT Encoder             β”‚
                    β”‚      (NbAiLab / nb-bert-base)             β”‚
                    β”‚   Contextual Hidden State Representations β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚                    β”‚
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚   CRF for BIO Taggingβ”‚   β”‚  Polarity Head        β”‚
               β”‚                      β”‚   β”‚  ([CLS] token)        β”‚
               β”‚  B-Source  I-Source  β”‚   β”‚                       β”‚
               β”‚  B-Target  I-Target  β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
               β”‚  B-Polar   I-Polar   β”‚   β”‚  β”‚   Positive   β”‚    β”‚
               β”‚  O                   β”‚   β”‚  β”‚   Negative   β”‚    β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  β”‚   Neutral    β”‚    β”‚
                          β”‚               β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
                          β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚                         β”‚
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚         Structured Opinion Generation           β”‚
               β”‚                                                 β”‚
               β”‚   Merge CRF spans + polarity predictions       β”‚
               β”‚   Apply offset mapping & span filtering        β”‚
               β”‚                                                 β”‚
               β”‚  Output: (Source, Target, Expression, Polarity)β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Screenshot 2026-02-25 at 3 19 36β€―PM

Model Components at a Glance

Component Description Technology
Encoder Contextual representation of input tokens RoBERTa / BERT (multilingual)
BIO Tagger Sequence labeling for span identification Linear Projection
CRF Decoder Structured decoding with label constraints Conditional Random Field
Polarity Head Classifies sentiment of identified spans Multi-layer Classifier on [CLS]
Opinion Builder Assembles final structured output tuples Post-processing pipeline

πŸ”¬ Methodology

1. Initial Processing & Tokenization

  • SSA extends basic sentiment analysis by extracting four key components: Holder, Target, Expression, Polarity
  • We use a pre-trained RoBERTa model (PolitAi Lab/nb-bert-base, [Liu et al., 2019]) as our encoder
  • The encoder transforms raw input text into rich contextual embeddings for each token
  • We benchmarked RoBERTa with bert-base and bert-medium β€” RoBERTa consistently outperforms both
# Encoder produces hidden states for every token position
hidden_states = roberta_encoder(input_ids, attention_mask)
# Shape: [batch_size, seq_len, hidden_dim]

2. Tag Classification Layer (BIO Tagging)

  • A linear projection layer maps contextual embeddings to BIO tag logits
  • We use the BIO (Beginning-Inside-Outside) tagging scheme to represent the boundaries of each opinion component

BIO Tag Space

O           β†’ Outside any opinion component
B-Source    β†’ Beginning of a Holder span
I-Source    β†’ Inside a Holder span
B-Target    β†’ Beginning of a Target span
I-Target    β†’ Inside a Target span
B-Polar     β†’ Beginning of a Polar Expression span
I-Polar     β†’ Inside a Polar Expression span

Projection Formula

$$Z = W \cdot z + b$$

where $W \in \mathbb{R}^{t \times H}$ is the weight matrix, $z^{t,i} \in \mathbb{R}^H$ represents the hidden state at position $i$, and $t$ is the number of possible BIO tags.

$$\hat{Y} = \mathbb{R}^{N \times t}$$


3. Conditional Random Field (CRF)

After obtaining token-level hidden states from BERT/RoBERTa, a CRF layer is applied to decode BIO tag sequences in a globally-consistent manner.

  • The CRF learns transition weights between adjacent tags
  • This enforces structural constraints (e.g., I-Source cannot follow B-Target)
  • Enables span detection while explicitly modeling tag dependencies

CRF Training Objective

$$\mathcal{L}_{CRF} = -\log\frac{\exp(\text{score}(\text{gold tags}))}{\sum_{\hat{y}} \exp(\text{score}(\hat{y}, \text{ valid tags}))}$$

Key Insight: CRF outperforms a plain softmax decoder because it captures inter-tag dependencies critical for multi-span structured extraction.


4. Polarity Classification

  • A separate multi-layer classification head takes the [CLS] token representation as input
  • Predicts one of three sentiment labels:
Sentiment Labels:
  βœ… Positive   β†’  Favorable / approving sentiment
  ❌ Negative   β†’  Critical / disapproving sentiment
  βž– Neutral    β†’  Factual / non-opinionated expression
  • The model struggles with ambiguous sentences like:

    "The food was great, but the service was terrible."

    where polarity is mixed at the span level β€” a key challenge that motivates future work on span-level polarity.


5. Structured Opinion Generation

In the final stage, we combine CRF-predicted spans with polarity predictions to produce the complete structured opinion quadruple (holder, expression, target, polarity).

Post-processing pipeline:

  1. Convert span boundaries to character offsets using BERT's offset mapping with adjacent token merging
  2. Filter short spans (< 3 characters) to remove noise
  3. Assemble final output as JSON-formatted opinion tuples

⚠️ This filtering step leads to a small loss (~a few percent) of true predictions.


πŸ” Baseline System

As a comparison point, we implement a Sequence Labeling + Relation Classification pipeline using BiLSTM models.

Pipeline Overview

Step 1: Span Extraction
        β”œβ”€β”€ BiLSTM β†’ Extract Holders
        β”œβ”€β”€ BiLSTM β†’ Extract Targets
        └── BiLSTM β†’ Extract Polar Expressions

Step 2: Relation Prediction
        └── BiLSTM + Max Pooling
              β”œβ”€β”€ Full text representation
              β”œβ”€β”€ Holder / Target representation
              └── Expression representation
                    └── Concatenate β†’ Linear β†’ Sigmoid
                          β†’ Binary: has_relation? (threshold = 0.5)

Step 3: Assemble tuples β†’ (holder, target, expression, polarity)

The baseline uses GloVe / FastText embeddings and trains three separate BiLSTMs, one per annotation type, followed by a relation prediction model.


✨ Features

Feature Description
🌍 Multilingual Support Works across Norwegian, English, Spanish, Catalan, and Basque
🏷️ BIO Sequence Labeling Precise span-level identification using structured tagging
πŸ”— CRF Decoding Globally-consistent tag sequence prediction with structural constraints
🎭 Polarity Classification 3-class (Positive / Negative / Neutral) sentiment head
🧩 Quadruple Extraction Complete (holder, target, expression, polarity) output
πŸ“Š Weighted F1 Evaluation Partial-overlap scoring using token-level Jaccard intersection
πŸ”„ Cross-lingual Transfer Train on English, evaluate on low-resource target languages
πŸ“¦ Modular Architecture Encoder, CRF, and classifier heads are independently configurable

πŸ“¦ Datasets

Subtask 1 β€” Monolingual

Dataset Language Domain
norec Norwegian Professional reviews (multi-domain)
opener_en English Hotel reviews
opener_es Spanish Hotel reviews
multibooked_ca Catalan Hotel reviews
multibooked_eu Basque Hotel reviews
darmstadt_unis English University reviews (online)
mpqa English News (opinion annotations)

Subtask 2 β€” Cross-Lingual

Trained on high-resource English data, evaluated on:

Test Dataset Language
opener_es Spanish
multibooked_ca Catalan
multibooked_eu Basque

Data Format

Each dataset is in JSON format with the following schema:

{
  "sent_id": "opener/en/hotel/english00164-6",
  "text": "Even though the price is decent for Paris, I would not recommend this hotel.",
  "opinions": [
    {
      "Source":           [["I"],                    ["44:45"]],
      "Target":           [["this hotel"],           ["66:76"]],
      "Polar_expression": [["would not recommend"],  ["46:65"]],
      "Polarity":         "negative",
      "Intensity":        "average"
    }
  ]
}

πŸ“Š Results

Monolingual Performance

Average across all 7 datasets

Metric Score
SF1 (Sentiment F1) 0.46
SP (Sentiment Precision) 0.62
SR (Sentiment Recall) 0.65

Per-dataset Breakdown:

Dataset SF1 SP SR
Opener_en 0.41 0.37 0.47
Opener_es 0.35 0.33 0.38
NoReC 0.23 0.30 0.18
Multibooked_ca 0.57 0.53 0.63
Multibooked_eu 0.53 0.40 0.71
darmstadt_unis 0.55 0.58 0.00
MPQA 0.52 0.55 0.00

Cross-Lingual Performance

Average across 3 target language datasets

Metric Score
SF1 (Sentiment F1) 0.35
SP (Sentiment Precision) 0.85
SR (Sentiment Recall) 0.63

Per-dataset Breakdown:

Dataset SF1 Precision Recall
Opener_es 0.000 0.000 0.000
Multibooked_ca 0.481 0.461 0.503
Multibooked_eu 0.671 0.618 0.733

πŸ” Key Observations

  • 🟒 BERT significantly improves span extraction results over CRF baseline alone β€” especially for English and language-similar corpora
  • 🟑 Multibooked_eu performs best in cross-lingual settings β€” likely due to the smaller dataset size and consistent hotel-review characteristics
  • πŸ”΄ Complex/ambiguous expressions (different polarity in different contexts) present a challenge across all datasets
  • πŸ“Œ Character-level BERT representations outperform word-level representations as a comparison baseline

πŸš€ Installation

Prerequisites

  • Python β‰₯ 3.8
  • PyTorch β‰₯ 1.9
  • CUDA (optional, for GPU acceleration)

1. Clone the Repository

git clone https://github.com/your-username/Structured-Sentiment-Analysis-IIITD-NLP-PROJECT.git
cd Structured-Sentiment-Analysis-IIITD-NLP-PROJECT

2. Install Core Dependencies

pip install torch torchvision transformers
pip install nltk scikit-learn tqdm gensim

3. Install Baseline Dependencies

pip install -r baseline/sequence_labeling/requirements.txt

4. Install Data Processing Dependencies

pip install -r data/requirements.txt

5. Prepare External Datasets

MPQA 2.0 β€” Download from the MPQA website and run:

cd data/mpqa && bash process_mpqa.sh

Darmstadt Service Review Corpus β€” Download from TU Darmstadt and run:

cd data/darmstadt_unis && bash process_darmstadt.sh

πŸ› οΈ Usage

Evaluation

Run the official evaluation script on model predictions:

python evaluate.py <input_dir> <output_dir>

Where:

  • <input_dir>/res/ contains your predictions.json per dataset
  • <input_dir>/ref/data/ contains the gold test files

Baseline β€” Training

# Train all BiLSTM baseline models across datasets
cd baseline/sequence_labeling
bash get_baselines.sh

Baseline β€” Inference

# Run inference on a specific dataset and split
python baseline/sequence_labeling/inference.py \
    --DATADIR opener_en \
    --FILE dev.json

Output will be saved to:

baseline/sequence_labeling/saved_models/relation_prediction/<DATADIR>/prediction.json

Predictions Format

Prediction files must match the gold data format. Each entry should look like:

{
  "sent_id": "unique-sentence-id",
  "text": "Raw input sentence here.",
  "opinions": [
    {
      "Source":           [["holder text"],     ["start:end"]],
      "Target":           [["target text"],     ["start:end"]],
      "Polar_expression": [["expression text"], ["start:end"]],
      "Polarity":         "positive"
    }
  ]
}

πŸ“ Project Structure

Structured-Sentiment-Analysis-IIITD-NLP-PROJECT/
β”‚
β”œβ”€β”€ πŸ“„ evaluate.py                  ← Official evaluation script (SF1 / SP / SR)
β”‚
β”œβ”€β”€ πŸ“ baseline/
β”‚   └── sequence_labeling/
β”‚       β”œβ”€β”€ extraction_module.py    ← BiLSTM span extractor (Holder / Target / Expr)
β”‚       β”œβ”€β”€ relation_prediction_module.py  ← BiLSTM relation classifier
β”‚       β”œβ”€β”€ inference.py            ← End-to-end inference pipeline
β”‚       β”œβ”€β”€ convert_to_bio.py       ← Convert JSON data to BIO format
β”‚       β”œβ”€β”€ convert_to_rels.py      ← Convert predictions to relation pairs
β”‚       β”œβ”€β”€ utils.py                ← Data loading & vocabulary utilities
β”‚       β”œβ”€β”€ WordVecs.py             ← Pretrained word embedding loader
β”‚       β”œβ”€β”€ get_baselines.sh        ← Script to train all baseline models
β”‚       └── requirements.txt
β”‚
β”œβ”€β”€ πŸ“ data/
β”‚   β”œβ”€β”€ norec/                      ← Norwegian multi-domain reviews
β”‚   β”‚   β”œβ”€β”€ train.json
β”‚   β”‚   β”œβ”€β”€ dev.json
β”‚   β”‚   └── test.json
β”‚   β”œβ”€β”€ opener_en/                  ← English hotel reviews
β”‚   β”œβ”€β”€ opener_es/                  ← Spanish hotel reviews
β”‚   β”œβ”€β”€ multibooked_ca/             ← Catalan hotel reviews
β”‚   β”œβ”€β”€ multibooked_eu/             ← Basque hotel reviews
β”‚   β”œβ”€β”€ mpqa/                       ← MPQA news corpus
β”‚   β”œβ”€β”€ darmstadt_unis/             ← English university reviews
β”‚   └── README.md                   ← Data format documentation
β”‚
└── πŸ“ predictions/
    └── norec/
        └── predictions.json        ← Sample model predictions

πŸ”­ Further Work

We identify several promising directions to build upon this work:

πŸ•ΈοΈ Dependency Graph Parsing

In the future, we would likely move to a dependency graph parsing approach for structured sentiment β€” augmenting the token-level representation with their heads in a dependency tree (Kurtz et al., 2020). This allows richer relational reasoning between opinion components.

🌐 Multi-task & Cross-lingual Learning

  • Exploring multi-task learning across languages to better leverage shared structure in multilingual sentiment graphs
  • Joint training on all monolingual datasets with language-specific adapters

πŸ”— Advanced Graph Parsers

  • Point Network (Samuel & Straka, 2020) β€” a strong graph parser for SSA
  • PERIN β€” a permutation-invariant structured prediction framework

πŸ“ Span-level Polarity

  • Currently, polarity is predicted globally per sentence via the [CLS] token
  • Moving to span-level polarity prediction would handle cases like "The food was great, but the service was terrible"

πŸ€– Serialized Large Language Models

  • Explore whether large pre-trained models (e.g., GPT-4, LLaMA) can directly predict structured opinion tuples via in-context learning or fine-tuning, without an explicit CRF layer

πŸ“š References

Barnes, J. et al. (2021). SemEval-2022 Task 10: Structured Sentiment Analysis.
  Proceedings of the 16th Workshop on Semantic Evaluation.

Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
  arXiv:1907.11692.

Kurtz, et al. (2020). Improving Low-Resource NMT through Relevance Based Linguistic Features.
  ACL 2020.

Samuel, D. & Straka, M. (2020). ÚFAL at MRP 2020: Permutation-Invariant Semantic Parsing.
  CoNLL 2020 Shared Task.

Made with ❀️ at IIIT Delhi | NLP Course Project

For questions or contributions, please open an issue or pull request.

About

Built a cross-lingual sentiment analysis model over 8+ languages for monolingual and cross-lingual tasks, achieving Sentiment Graph F1 of 0.55 (cross-lingual) and 0.559 (monolingual) using RoBERTa with CRF for span extraction and polarity classification.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published