Skip to content

Commit f9d3ef9

Browse files
committed
release code
1 parent 79bd301 commit f9d3ef9

25 files changed

+1174
-9
lines changed

.gitignore

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
5+
# C extensions
6+
*.so
7+
8+
# Distribution / packaging
9+
.Python
10+
env/
11+
bin/
12+
build/
13+
develop-eggs/
14+
dist/
15+
eggs/
16+
lib/
17+
lib64/
18+
parts/
19+
sdist/
20+
var/
21+
*.egg-info/
22+
.installed.cfg
23+
*.egg
24+
25+
# Installer logs
26+
pip-log.txt
27+
pip-delete-this-directory.txt
28+
29+
# Unit test / coverage reports
30+
htmlcov/
31+
.tox/
32+
.coverage
33+
.cache
34+
nosetests.xml
35+
coverage.xml
36+
37+
# Translations
38+
*.mo
39+
40+
# Mr Developer
41+
.mr.developer.cfg
42+
.project
43+
.pydevproject
44+
45+
# Rope
46+
.ropeproject
47+
48+
# Django stuff:
49+
*.log
50+
*.pot
51+
52+
# Sphinx documentation
53+
docs/_build/
54+
*.db
55+
*.0
56+
*.cPickle
57+
*.json

README.markdown

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
DetDup
2+
======================
3+
Detect duplicated items. 内容排重框架。
4+
5+
Usage
6+
----------------------
7+
见 USAGE.markdown
8+
9+
演讲稿
10+
-------------
11+
https://speakerdeck.com/mvj3/detdup
12+
13+
内容排重功能列表
14+
----------------------
15+
1. 返回 重复题目列表。
16+
2. 发送题目ID,服务器端载入对应题库到内存中,查找和该项重复的条目,并返回题目ID列表。
17+
18+
常见内容重复特征
19+
----------------------
20+
1. [长度] 基本相似或相等, 两者长度的平方根相差不超过1。
21+
2. [重复] 在任意位置, 多个逗号, 空格, s字符等。
22+
3. [同义] 全角半角编码。分隔符号不同。am, 'm。
23+
4. [顺序] 内部句子位子换了,比如从连线题里抽取的数据
24+
25+
导致不能使用基于分词的倒排索引。
26+
27+
召回率 和 正确率
28+
----------------------
29+
召回率: 如果特征抽取不是太准确的话,会导致有些groups漏了一两个。
30+
正确率: 几乎100%的,因为是按原文本相似度算的。
31+
32+
DetDup 和 simhash, shingling 的关系。
33+
----------------------
34+
1. 其中 1 和 2 的功能 类似与simhash里 把文本变短为01序列的局部敏感hash 以及分块快速检索比较。
35+
simhash不利于 题库排重的原因见 #参考文献# , 这边几十个字符占很大比例, simhash适合于大网页的排重,
36+
而且simhash调hash参数应该比较繁琐和难以解释。
37+
2. 3 类似于 shingling, 区别是 shingling 用的是分词, 这边直接比较全部字符。
38+
以兼容类似 `_range``orange` 的比较。
39+
40+
文本相似度定义
41+
----------------------
42+
把两段文本共同的字母都取出来 除以 两者文本的总长度得出的比率。比如 of 和 off 的文本相似度为 4 / 5 = 80%
43+
44+
参考文献
45+
-----------------------
46+
[海量数据相似度计算之simhash和海明距离](http://www.lanceyan.com/tech/arch/simhash_hamming_distance_similarity.html)
47+
48+
```txt
49+
2、通过大量测试,simhash用于比较大文本,比如500字以上效果都还蛮好,距离小于3的基本都是相似,误判率也比较低。但是如果我们处理的是微博信息,最多也就140个字,使用simhash的效果并不那么理想。看如下图,在距离为3时是一个比较折中的点,在距离为10时效果已经很差了,不过我们测试短文本很多看起来相似的距离确实为10。如果使用距离为3,短文本大量重复信息不会被过滤,如果使用距离为10,长文本的错误率也非常高,如何解决?
50+
```

README.md

Lines changed: 0 additions & 9 deletions
This file was deleted.

TODO.markdown

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
* Support int type item\_id
2+
* In addition to pymongo, support more ORM

USAGE.markdown

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
Requirements
2+
----------------------
3+
1. Install required python library from requirements.txt
4+
2. storage: 1. sqlite library, 2. cPickle, 3. redis(optional)
5+
6+
项目架构流程
7+
----------------------
8+
具体代码流程见 services/task.py
9+
10+
### extract
11+
12+
* 为要处理的数据 继承DetDupDataModel类, 提供 多种数据特征 和 清洗后的文本内容, 具体见文档注释。
13+
* 并导入特征数据库。
14+
15+
### train
16+
17+
* 去除特征数据库里没有同类的
18+
* 检测重复条目
19+
* 合并并打印结果列表
20+
* 检测召回率
21+
22+
Usage
23+
----------------------
24+
1. 接口服务见 detdup/services
25+
2. 示例见 tests
26+
27+
文本相似度 性能数据
28+
-----------------------
29+
1. 文本相似度在 0.95时,排重几乎全是正确的, 重复元素有3199个, 组有1463个。
30+
2. 文本相似度在 0.90时,排重一点点错误,重复元素有3297个, 组有1507个。
31+
32+
相当于重复元素多了98个, 重复组多了44个, 重复[]90-95之间多了 44 / 1463.0 = 3.0%, 重复元素90-100%元素约为 7.4%。
33+
在文本相似度为90%时,误判率大概在 重复元素 19 / 3297.0 = 0.57%, 重复组在 9 / 1507.0 = 0.59%;
34+
35+
性能和总数以及重复元素总量成线性增长关系。
36+
37+
90万数据
38+
data_extract 13分钟, 8核
39+
data_index 11分钟, 1核
40+
data_index_remove_unneeded 1分钟, 1核
41+
data_detdup 5.5分钟, 8核
42+
43+
160万数据
44+
data_extract 26-32分钟, 8核,比上面慢的原因是90万数据是用SSD读的。
45+
data_index 25分钟, 1核
46+
data_index_remove_unneeded 1分钟, 1核
47+
data_detdup 5.75分钟, 8核
48+
49+
读取数据 编程接口
50+
-----------------------
51+
```txt
52+
>>> import json
53+
>>> data = json.load(open("detdup.json", "rb"))
54+
>>> data.result[0:3]
55+
[[a3c67f3da591b518cb535bd7, 76d6aeed4b31b569310db1a6], [e05f6e6da5aff02a81411342, 75a8e395b87ad910e0cef062],
56+
[75e7db33f06264d80c77b669, 99b6ef2b6a32d2f8317763fc, 770e993816f258edc7f3fe6b],]
57+
```

detdup/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from .core import DetDupCore

detdup/core.py

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# -*- coding: utf-8 -*-
2+
3+
from .utils import *
4+
5+
from etl_utils import String, Speed, BufferLogger, ItemsGroupAndIndexes
6+
7+
# TODO 检测 item.typename() 存在
8+
9+
from .features import DefaultFeatures
10+
11+
class DetDupCore(object):
12+
"""
13+
Detect duplicated items, use decision tree.
14+
15+
Usage:
16+
-----------
17+
"""
18+
19+
similarity_rate = 0.90
20+
21+
def __init__(self, features_dir, detdup_data_model):
22+
self.features_dir = features_dir
23+
24+
self.model = detdup_data_model
25+
26+
self.features = [DefaultFeatures()]
27+
self.features_map = dict()
28+
29+
self.storage_type = ['memory', 'disk'][0]
30+
31+
self.is_logger = True
32+
self.is_inspect_detail = False
33+
self.buffer_logger = BufferLogger(os.path.join(self.features_dir, 'process.log'))
34+
35+
self.result = ItemsGroupAndIndexes()
36+
self.count = 0
37+
38+
self.candidate_dup_count = None
39+
40+
def select_feature(self, item1):
41+
f1 = item1.typename
42+
if not isinstance(f1, str) and not isinstance(f1, unicode): f1 = f1()
43+
return self.features_map[f1].insert_item(item1)
44+
45+
def feeded(self):
46+
for feature1 in self.features:
47+
# 这个Feature是否有效
48+
if not feature1.link_to_detdup:
49+
continue
50+
# 之前已经导出数据库啦?!
51+
if os.path.exists(feature1.sqlite3db_path()):
52+
return True
53+
return False
54+
55+
def load_features_from_db(self):
56+
for feature1 in self.features: feature1.load_features_tree()
57+
58+
def dump_features_from_memory(self):
59+
for feature1 in self.features: feature1.dump_features_tree()
60+
61+
def feed_items(self, obj, persist=True):
62+
""" Feed items to features """
63+
# 1. insert it into memory
64+
[self.select_feature(item1).feed_item() for item1 in process_notifier(obj)]
65+
# 2. backup into files fully!
66+
if persist:
67+
self.dump_features_from_memory()
68+
return self
69+
70+
def plug_features(self, features1):
71+
"""
72+
1. Plug features, and bind typename to classify items
73+
2. init features tree, memory or disk
74+
"""
75+
if not isinstance(features1, list): features1 = [features1]
76+
self.features.extend(features1)
77+
for f1 in self.features:
78+
f1.link_to_detdup = self
79+
f1.build_features_tree()
80+
81+
for f1 in self.features:
82+
self.features_map[f1.typename] = f1
83+
return self
84+
85+
time_sql = 0
86+
time_calculate_text_similarity = 0
87+
time_fetch_content = 0
88+
89+
def detect_duplicated_items(self, item1):
90+
feature1 = self.select_feature(item1)
91+
speed = Speed()
92+
93+
t1 = datetime.now()
94+
item_ids = feature1.fetch_matched_item_ids()
95+
t2 = datetime.now(); self.time_sql += (t2 - t1).total_seconds();
96+
97+
# 4. 看看题目相似度
98+
# 相似度得大于 95%
99+
new_ids = list()
100+
for item_id1 in item_ids:
101+
# 2. 排除自己
102+
if item_id1 == unicode(item1.item_id): continue
103+
104+
if item_id1 not in self.model:
105+
# 删除不一致数据, 以在self.model里为准
106+
feature1.delete_item_ids([item_id1])
107+
continue
108+
109+
t11 = datetime.now()
110+
content1 = self.model[item_id1].item_content
111+
t12 = datetime.now(); self.time_fetch_content += (t12 - t11).total_seconds();
112+
113+
t11 = datetime.now()
114+
res1 = String.calculate_text_similarity(item1.item_content,
115+
content1,
116+
inspect=True,
117+
skip_special_chars=True,
118+
similar_rate_baseline=self.similarity_rate)
119+
t12 = datetime.now(); self.time_calculate_text_similarity += (t12 - t11).total_seconds();
120+
121+
if res1['similarity_rate'] > self.similarity_rate:
122+
new_ids.append(item_id1)
123+
self.buffer_logger.append(res1['info'])
124+
self.buffer_logger.inspect()
125+
print "字符串相似度 [前]", (len(item_ids) - 1), "个,[后]", len(new_ids), "个"
126+
127+
item_ids = new_ids
128+
129+
# 如果要排除已处理过为排重的
130+
speed.tick().inspect()
131+
132+
print "self.time_sql", self.time_sql
133+
print "self.time_calculate_text_similarity", self.time_calculate_text_similarity
134+
print "self.time_fetch_content", self.time_fetch_content
135+
136+
return item_ids
137+
138+
def detect_duplicated_items_verbose(self, item_id1, verbose=False):
139+
self.count += 1
140+
print "\n"*5, "从", self.candidate_dup_count, "个候选题目中 排重第", self.count, "个题目。", item_id1
141+
142+
# 如果结果已经计算出来
143+
if self.result.exists(item_id1):
144+
return self.result.find(item_id1)
145+
146+
self.buffer_logger.append("-"*80)
147+
self.buffer_logger.append("要处理的记录")
148+
149+
item1 = self.model[item_id1]
150+
if verbose: item1.inspect()
151+
152+
self.buffer_logger.append("")
153+
item_ids = self.detect_duplicated_items(item1)
154+
self.buffer_logger.append("疑似和", item1.item_id, "重复的条目有", len(item_ids), "个")
155+
for item_id1 in item_ids:
156+
if verbose: self.model[item_id1].inspect()
157+
self.buffer_logger.append("")
158+
159+
# 输出日志
160+
if (len(item_ids) > 0) and self.is_logger:
161+
self.buffer_logger.inspect()
162+
else:
163+
self.buffer_logger.clear()
164+
165+
item_ids.append(unicode(item1.item_id))
166+
167+
# 有重复结果,就存储一下
168+
if len(item_ids) > 1:
169+
self.result.add([i1 for i1 in item_ids])
170+
171+
return item_ids

detdup/data_model/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# -*- coding: utf-8 -*-
2+
3+
from .base import DetDupDataModel

0 commit comments

Comments
 (0)