Skip to content

Aspect-Based Sentiment Analysis with TextCNN #218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions research/arxiv_papers/PL_FGSA/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# 🧠 Aspect-Based Sentiment Analysis with TextCNN (MindSpore)

本项目基于 **MindSpore 框架**,实现了一个轻量级的 **情感极性分类系统**,支持多个典型情感分析数据集的加载、处理、训练与评估,并采用 **TextCNN** 架构进行建模。

## ✅ 特性 Highlights

- 📦 **支持数据集**:
- SemEval-2014 Task 4 (Laptops)
- MAMS-ATSA
- SST-2(自动兼容 `.parquet` 转 JSON)

- 🔨 **统一数据预处理模块**:
- 多格式支持(CSV / XML / JSONL / Parquet)
- 处理结果统一转为 JSON 格式

- ⚙️ **模型架构**:
- 轻量化 TextCNN
- 支持 Dropout / Weight Decay 正则化
- 可快速在 CPU 上训练

- 📊 **完整评估指标输出**:
- 准确率 / 精确率 / 召回率 / F1
- 每个数据集独立日志 & 最终汇总为 CSV

## 📁 项目结构

```
├── data/ # 数据预处理(多格式 → JSON)
│ ├── MAMS # MAMS数据集
│ ├── SemEval_2014_Task_4 # SemEval数据集
│ └── sst-2 # sst-2数据集
├── processed/ # 存储处理好的 JSON 格式数据集
│ ├── mams # MAMS数据集
│ ├── semeval # SemEval数据集
│ └── sst2 # sst-2数据集
├── train.py # TextCNN 模型训练入口
├── dataset.py # 数据加载器 + 词表构建
└── README.md
```

## 🚀 使用说明

### ✅ 1. 数据预处理

请将原始数据放入指定目录,然后运行:

```bash
python dataset.py
```

将自动处理以下数据集并转换为 JSON 格式:

- `data/SemEval_2014_Task_4/`
- `data/MAMS/MAMS-ATSA/raw/`
- `data/SST2/`

### ✅ 2. 模型训练

运行主训练脚本:

```bash
python train.py
```

将依次训练以下数据集:

- `semeval`
- `mams`
- `sst2`

每个数据集:

- 自动构建词表;
- 划分训练/验证集(比例 8:1:1);
- 输出日志至 `log_*.txt`;
- 将评估指标汇总至 `results_summary.csv`。

## 📊 示例评估结果(`results_summary.csv`)

| 数据集 | Acc | Prec | Recall | F1 |
| ------- | ---- | ---- | ------ | ---- |
| semeval | 0.71 | 0.69 | 0.68 | 0.69 |
| mams | 0.61 | 0.59 | 0.60 | 0.59 |
| sst2 | 0.86 | 0.85 | 0.86 | 0.85 |

## 🛠️ 环境依赖

- Python ≥ 3.9
- MindSpore ≥ 2.5.0 (CPU)
- numpy, pandas, scikit-learn, tqdm
- pyarrow(用于读取 `.parquet`)

安装依赖:

```bash
pip install -r requirements.txt
```

## 📌 注意事项

- 如果不使用 `sst2` 数据集,可在 `train.py` 中移除对应行;
- SST-2 默认使用 `train.json` 划分训练/验证集

## 📚 参考文献

- Kim, Y. (2014). *Convolutional Neural Networks for Sentence Classification*.
- Socher, R. et al. (2013). *Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank*.
179 changes: 179 additions & 0 deletions research/arxiv_papers/PL_FGSA/dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
import csv
import json
import xml.etree.ElementTree as ET
from pathlib import Path
from tqdm import tqdm

def unify_label(label: str) -> int:
label = str(label).lower()
if label in ["positive", "pos", "2"]:
return 2
elif label in ["neutral", "neu", "1"]:
return 1
elif label in ["negative", "neg", "0"]:
return 0
else:
raise ValueError(f"Unknown label: {label}")

def load_semeval_dir(folder_path: str, output_folder: str):
output_folder = Path(output_folder) / "laptops"
folder = Path(folder_path)
output = Path(output_folder)
output.mkdir(parents=True, exist_ok=True)
for file in tqdm(list(folder.glob("*.csv")), desc="Processing SemEval"):
if file.name not in {
"Laptop_Train_v2.csv",
"Laptops_Test_Data_PhaseA.csv",
"Laptops_Test_Data_PhaseB.csv"
}:
continue

data = []
with open(file, newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
headers = reader.fieldnames
if headers is None:
continue

is_labeled = "Aspect Term" in headers and "polarity" in headers

for row in reader:
text = row.get("Sentence")
if is_labeled:
aspect = row.get("Aspect Term")
polarity = row.get("polarity")
if text and aspect and polarity:
try:
label = unify_label(polarity)
data.append({"text": text, "aspect": aspect, "label": label})
except ValueError:
continue
else:
if text:
data.append({
"text": text,
"aspect": None,
"label": None,
"inference_only": True
})
save_to_json(data, output / file.with_suffix(".json").name)

def load_ecare_dir(folder_path: str, output_folder: str):
folder = Path(folder_path)
output = Path(output_folder)
output.mkdir(parents=True, exist_ok=True)

for file in tqdm(list(folder.glob("*.jsonl")), desc="Processing e-CARE"):
data = []
with open(file, encoding='utf-8') as f:
for line in f:
try:
entry = json.loads(line)
premise = entry.get("premise")
h1 = entry.get("hypothesis1")
h2 = entry.get("hypothesis2")
label = entry.get("label")
ask_for = entry.get("ask-for", "")

if premise and h1 and h2 and label is not None:
# 正确的 hypothesis → label=1,错误的 → label=0
data.append({
"text": f"{premise} [SEP] {h1}",
"aspect": ask_for,
"label": 1 if label == 0 else 0
})
data.append({
"text": f"{premise} [SEP] {h2}",
"aspect": ask_for,
"label": 0 if label == 0 else 1
})
except Exception:
continue
save_to_json(data, output / file.with_suffix(".json").name)


def load_mams_dir(folder_path: str, output_folder: str):
folder = Path(folder_path)
output = Path(output_folder)
output.mkdir(parents=True, exist_ok=True)
for file in tqdm(list(folder.glob("*.xml")), desc="Processing MAMS"):
data = []
tree = ET.parse(file)
root = tree.getroot()
for sentence in root.findall("sentence"):
text = sentence.find("text").text
aspect_terms = sentence.find("aspectTerms")
if aspect_terms is not None and list(aspect_terms):
for aspect in aspect_terms:
try:
term = aspect.attrib["term"]
polarity = unify_label(aspect.attrib["polarity"])
data.append({"text": text, "aspect": term, "label": polarity})
except (KeyError, ValueError):
continue
else:
data.append({
"text": text,
"aspect": None,
"label": None,
"inference_only": True
})
save_to_json(data, output / file.with_suffix(".json").name)


def load_sst2_dir(folder_path: str, output_folder: str):
import pandas as pd

folder = Path(folder_path)
output = Path(output_folder)
output.mkdir(parents=True, exist_ok=True)

for file in tqdm(list(folder.glob("*.parquet")), desc="Processing SST-2"):
data = []
df = pd.read_parquet(file)
for _, row in df.iterrows():
text = row.get("sentence")
label = row.get("label")
if text is not None and label is not None:
try:
data.append({
"text": text,
"aspect": None,
"label": int(label)
})
except ValueError:
continue
save_to_json(data, output / file.with_suffix(".json").name)



def save_to_json(data, output_path):
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
print(f"Saved {len(data)} samples to {output_path}")

def preprocess_all_datasets(dataset_dirs: dict, output_root: str):
"""
dataset_dirs: dict with keys 'semeval', 'ecare', 'mams'
output_root: directory to save all processed files, grouped by dataset
"""
output_root = Path(output_root)
if "semeval" in dataset_dirs:
load_semeval_dir(dataset_dirs["semeval"], output_root / "semeval")
if "ecare" in dataset_dirs:
load_ecare_dir(dataset_dirs["ecare"], output_root / "ecare")
if "mams" in dataset_dirs:
load_mams_dir(dataset_dirs["mams"], output_root / "mams")
if "sst2" in dataset_dirs:
load_sst2_dir(dataset_dirs["sst2"], output_root / "sst2")

if __name__ == '__main__':
preprocess_all_datasets(
dataset_dirs={
"semeval": "data/SemEval_2014_Task_4/",
"ecare": "data/e-CARE/data/",
"mams": "data/MAMS/MAMS-ATSA/raw",
"sst2": "data/sst-2/"
},
output_root="processed/"
)
4 changes: 4 additions & 0 deletions research/arxiv_papers/PL_FGSA/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
tqdm
scikit-learn
numpy
pyarrow
Loading