Skip to content

C++ and Python implementation of TopWORDS (Deng, K., 2016)

Notifications You must be signed in to change notification settings

yuhuang-cst/topwords

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TopWORDS

This is an implementation of TopWORDS. TopWORDS is an unigram language model, which assumes words are independent and the probability of a sentence is the cumprod of probablitie of words. We use EM algorithm to optimize parameters.

Reference:

[1] Deng, K. (2016). On the unsupervised analysis of domain-specific Chinese texts.

Installation

We implemented TopWORDS in C++ and also provided a Python wrapper using Cython. gcc-7 compiler (for -std=c++17) and OpenMP parallel library (for -fopenmp) are needed to install this package.

C++

Write a .cpp and includes topwords/topwords_lib.h header. For example, the demo program examples/main.cpp looks like:

#include "../topwords/topwords_lib.h"

using namespace std;

void output_result(const unordered_map<wstring_view, double>& vocab2freq, const unordered_map<wstring_view, double>& vocab2psi, int topk=200, int step=10) {
    // ...
}

void process1() {
    cout << "process1: =========================" << endl;
    string input_txt = "./story_of_stone/corpus.txt";
    vector<wstring> corpus;
    read_wlines(input_txt, corpus);
    unordered_map<wstring_view, double> vocab2freq, vocab2psi;
    topwords_em(corpus, vocab2freq, vocab2psi, 10, 1.0, 4, 1e-5, true, -1);
    output_result(vocab2freq, vocab2psi);
}

void process2() {
    cout << "process2: =========================" << endl;
    string input_txt = "./story_of_stone/corpus.txt";
    string vocab2freq_txt = "./cpp_vocab2freq.txt";
    string vocab2psi_txt = "./cpp_vocab2psi.txt";
    topwords_em(input_txt, vocab2freq_txt, vocab2psi_txt, "|", "", 10, 1.0, 4, 1e-5, true, -1);
}

int main() {
    process1();
    process2();
    return 0;
}

Then run the Makefile to generate executable program:

cd examples
make
./main

Python

If you are not familiar with C++, you can use the Python wrapper. Simply run the setup script to install:

python3 setup.py install

or install from git repo:

pip3 install git+https://github.com/yuhuang-cst/topwords.git

The demo program examples/main.py looks like:

import os
import topwords

def process1():
	print('process1: =========================')
	corpus_path = os.path.join('story_of_stone', 'corpus.txt')
	corpus = open(corpus_path).read().splitlines()
	vocab2freq, vocab2psi = topwords.em(
		corpus, n_iter=10, freq_threshold=1.0, max_len=4, lamb=1e-5, verbose=True, n_jobs=-1, loc='')

	topk, step = 200, 10
	vocab_freq = sorted(vocab2freq.items(), key=lambda item: item[1], reverse=True)[:topk]
	print('Top 200 words (sorted by frequency):')
	for i, (word, freq) in enumerate(vocab_freq):
		print('{} {:.2f}; '.format(word, freq), end='')
		if (i % 10 == 0): print('')
	print('')

	print('Top 200 words (sorted by psi):')
	vocab_psi = sorted(vocab2psi.items(), key=lambda item:item[1], reverse=True)[:topk]
	for i, (word, freq) in enumerate(vocab_psi):
		print('{} {:.2f}; '.format(word, freq), end='')
		if (i % 10 == 0): print('')
	print('')


def process2():
	print('process2: =========================')
	corpus_path = os.path.join('story_of_stone', 'corpus.txt')
	vocab2freq_path = os.path.join('py_vocab2freq.txt')
	vocab2psi_path = os.path.join('py_vocab2psi.txt')
	topwords.file_em(corpus_path, vocab2freq_path, vocab2psi_path, sep='|',
		n_iter=10, freq_threshold=1.0, max_len=4, lamb=1e-5, verbose=True, n_jobs=-1, loc='')


if __name__ == '__main__':
	process1()
	process2()

Demo

The output of the demo looks like: (Full results are located in examples/)

process1: =========================
initializing vocabulary...
vocabulary initilazation done.
Iter = 1 / 10 , vocab size = 799822
Iter = 2 / 10 , vocab size = 22955
Iter = 3 / 10 , vocab size = 21846
Iter = 4 / 10 , vocab size = 20684
Iter = 5 / 10 , vocab size = 20295
Iter = 6 / 10 , vocab size = 20091
Iter = 7 / 10 , vocab size = 19980
Iter = 8 / 10 , vocab size = 19909
Iter = 9 / 10 , vocab size = 19871
Iter = 10 / 10 , vocab size = 19854
Final vocab size = 19839
Top 200 words (sorted by frequency):
的 4230.70; 了 3815.96; 我 2030.22; 你 1941.54; 来 1910.58; 他 1714.86; 宝玉 1443.93; 说 1355.07; 又 1312.92; 也 1229.71; 
去 1190.71; 便 1121.88; 人 986.92; 这 973.97; 那 932.89; 呢 894.50; 是 878.41; 不 767.89; 就 750.02; 都 746.56; 
笑道 694.93; 道 667.20; 说着 648.27; 有 644.85; 王夫人 641.05; 要 610.88; 老太太 604.00; 我们 600.53; 一 590.54; 如今 589.79; 
好 579.68; 因 574.01; 在 559.62; 上 529.44; 凤姐 524.75; 贾母 518.53; 你们 508.19; 只 506.10; 着 505.92; 还 487.39; 
等 470.11; 说道 466.90; 宝钗 463.93; 叫 461.42; 黛玉 453.30; 和 450.92; 去了 446.22; 儿 444.65; 宝玉道 441.40; 袭人 438.81; 
一个 427.11; 姑娘 421.23; 自己 413.54; 出来 409.02; 再 405.93; 见 401.10; 他们 396.15; 将 395.06; 看 386.03; 贾琏 380.17; 
罢 378.53; 来了 377.60; 只见 373.79; 到 371.53; 才 369.74; 贾政 368.79; 倒 367.39; 子 356.25; 起来 351.82; 把 341.09; 
先 333.71; 凤姐儿 332.88; ` 330.14; 什么 329.43; 只是 324.14; 太太 317.73; 出 317.53; 得 317.29; 一面 314.00; 进来 313.81; 
回来 313.26; 平儿 312.83; 问 311.70; 不知 311.10; 怎么 308.06; 薛姨妈 306.61; 大 304.71; 过 304.65; 奶奶 300.50; 所以 300.36; 
大家 299.56; 老爷 299.52; 一时 296.73; 时 293.47; 众人 289.76; 咱们 287.66; 与 282.37; 下 281.72; 不是 281.13; 没有 280.40; 
忙 280.18; 方 275.88; 若 274.12; 那里 272.78; 些 271.14; 这个 270.12; 打 268.36; 只得 266.35; 家 266.17; 事 265.69; 
这里 260.65; 请 257.67; 过来 256.03; 二爷 254.63; 吃 253.51; 听见 253.04; 作 252.45; 别 251.18; 的人 250.35; 正 249.78; 
今日 249.20; 连 248.78; 为 247.09; 的话 244.22; 出去 242.53; 这样 239.47; 同 237.77; 已 237.70; 宝玉笑道 233.60; 可 231.84; 
听了 230.38; 紫鹃 229.53; 自 229.50; 回 228.76; 个 221.49; 往 221.27; 却 216.75; 就是 212.68; 心 211.44; 谁 211.43; 
林黛玉 209.79; 的事 208.84; 邢夫人 207.73; 没 206.58; 给 205.50; 多 205.22; 竟 204.91; 探春 204.20; 也不 203.27; 送 202.88; 
不过 202.46; 向 201.20; 也是 199.17; 且 199.09; 命 198.46; 听 198.03; 鸳鸯 196.40; 拿 195.91; 知道 194.03; 姐姐 193.98; 
刘姥姥 191.65; 从 190.66; 都是 190.38; 晴雯 189.49; 找 188.96; 贾珍 188.46; 贾母道 188.11; 起 188.05; 想 187.14; 头 181.02; 
见了 178.18; 两个 177.95; 么 176.90; 早 176.11; 气 175.82; 如何 175.52; 叫他 174.25; 今儿 172.41; 用 171.25; 小 170.25; 
话 169.34; 在这里 168.31; 袭人道 168.17; 们 168.07; 东西 168.06; 周瑞家的 167.90; 比 167.43; 贾政道 166.17; 并 165.99; 薛蟠 165.91; 
香菱 164.69; 心里 164.09; 宝玉听了 162.54; 也有 161.88; 不好 161.42; 做 160.77; 李纨 159.46; 凤姐道 158.99; 进去 158.62; 只管 158.09; 

Top 200 words (sorted by psi):
桌围 inf; 残菊 inf; 港台 inf; 扇坠 inf; 菩萨 inf; 疯疯癫癫 inf; 科幻 inf; 慧绣 inf; 噗哧 inf; 轰去魂魄 inf; 
豆腐 inf; 咭咭呱呱 inf; 忆菊 inf; 蛇影杯弓 inf; 妆狐媚 inf; 孽障 inf; 琥珀 inf; 呜呜咽咽 inf; 杏帘 inf; 咕咚 inf; 
慧纹 inf; 唐伯虎 inf; 稻香老农 inf; 皱一回眉 inf; 浣葛山庄 inf; 拾头打滚 inf; 桌椅 inf; 潇湘馆 inf; 绛云轩 inf; 狗彘奴欺 inf; 
容俊俏 inf; 颓丧 inf; 冰片 inf; 惹人厌 inf; 鹦鹉 inf; 骢马使弹 inf; 抖衣而颤 inf; 诟谇谣诼 inf; 影影绰绰 inf; 魂不附体 inf; 
咳嗽 inf; 蘅芷清芬 inf; 簪菊 inf; 乞丐 inf; 葫芦 inf; 摇了一摇 inf; 蓼风轩 inf; 射覆 inf; 残忍乖僻 inf; 鸳鸯 inf; 
欺负 inf; 嗳哟 inf; 菊影 inf; 稻香村 inf; 吩咐 inf; 痰盒 inf; 呜呼 inf; 荡悠悠 inf; 蓼汀花溆 inf; —— inf; 
鸦雀无闻 inf; 吱喽 inf; 访菊 inf; 咏菊 inf; 婆婆 inf; 绮霰 inf; 藕香榭 inf; 陈设 inf; 爽利 inf; 钥匙 inf; 
飘飘摇摇 inf; 咯吱 inf; 蜡烛 inf; 曹雪芹 inf; 伴鹤 inf; 耗子精 inf; 荣二宅 inf; 卯正二刻 inf; 硬作保山 inf; 碧痕 inf; 
婶婶 inf; 绣桔 inf; 汉宫春晓 inf; …… inf; 牡丹亭 inf; 逢凶化吉 inf; 锞十锭 inf; 苇叶 inf; 偕鸳 inf; 浩浩荡荡 inf; 
鲸卿 inf; 武侠 inf; 渺茫 inf; 漱盂 inf; 枉费 inf; 杀律 inf; 陈皮 inf; 祭祀 inf; 供菊 inf; 巾帕 inf; 
威赫 inf; 绣鸾 inf; 粗鄙 inf; 冲一冲 inf; 温飞卿 inf; 玻璃 inf; 鸡蛋 inf; 荇叶渚 inf; 守庚申 inf; и着鞋 inf; 
嬷嬷 inf; 椅搭 inf; 蘼芜 inf; 尤氏 inf; 的 18290.63; 了 14816.04; 宝玉 8510.77; 你 7341.34; 来 7077.09; 我 7014.89; 
他 6271.53; 说 5868.91; 又 5565.78; 便 5542.95; 也 4734.43; 王夫人 4610.45; 去 4571.93; 笑道 4163.00; 呢 4113.70; 那 4029.09; 
这 3875.85; 凤姐 3666.55; 如今 3544.68; 都 3443.05; 老太太 3345.54; 说着 3315.36; 道 3310.15; 宝钗 3287.71; 黛玉 3195.22; 就 3162.51; 
人 3109.81; 贾母 3069.25; 是 3007.89; 因 2904.07; 贾琏 2801.28; 姑娘 2725.29; 袭人 2608.52; 薛姨妈 2510.03; 贾政 2451.42; 我们 2443.03; 
自己 2419.29; 好 2291.90; 要 2270.59; 只 2145.63; 不 2124.75; 有 2116.86; 和 2106.30; 宝玉道 2078.45; 你们 2046.33; 等 2031.82; 
叫 1986.99; 在 1972.22; 紫鹃 1963.04; 奶奶 1955.59; 所以 1919.55; 说道 1914.12; ` 1807.24; 再 1803.39; 平儿 1802.89; 还 1800.45; 
刘姥姥 1789.22; 才 1787.87; 将 1771.02; 什么 1753.53; 周瑞家的 1741.98; 罢 1736.18; 上 1730.33; 太太 1714.37; 怎么 1704.66; 咱们 1704.44; 
邢夫人 1701.42; 只见 1665.50; 一 1640.03; 老爷 1629.90; 一面 1589.77; 先 1578.09; 着 1566.62; 见 1540.10; 今日 1532.92; 二爷 1530.59; 
儿 1528.16; 看 1525.06; 探春 1510.26; 晴雯 1501.63; 大家 1500.79; 一个 1491.98; 倒 1487.99; 他们 1477.78; 出来 1473.27; 把 1472.82; 

process2: =========================
initializing vocabulary...
vocabulary initilazation done.
Iter = 1 / 10 , vocab size = 799822
Iter = 2 / 10 , vocab size = 22955
Iter = 3 / 10 , vocab size = 21846
Iter = 4 / 10 , vocab size = 20684
Iter = 5 / 10 , vocab size = 20295
Iter = 6 / 10 , vocab size = 20091
Iter = 7 / 10 , vocab size = 19980
Iter = 8 / 10 , vocab size = 19909
Iter = 9 / 10 , vocab size = 19871
Iter = 10 / 10 , vocab size = 19854
Final vocab size = 19839
Writing vocabulary and frequency...
Writing vocabulary and importance score...

Issue

  • Due to unknown reason (maybe bugs of OpenMP), sometimes the actually running cpu cores will be half of n_jobs set by user. In this case, just double the n_jobs to solve the problem.

About

C++ and Python implementation of TopWORDS (Deng, K., 2016)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published