17zuoye
diff --git a/‎.gitignore‎
Lines changed: 57 additions & 0 deletions b/‎.gitignore‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎README.markdown‎
Lines changed: 50 additions & 0 deletions b/‎README.markdown‎
Lines changed: 50 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 0 additions & 9 deletions b/‎README.md‎
Lines changed: 0 additions & 9 deletions
diff --git a/‎TODO.markdown‎
Lines changed: 2 additions & 0 deletions b/‎TODO.markdown‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎USAGE.markdown‎
Lines changed: 57 additions & 0 deletions b/‎USAGE.markdown‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎detdup/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎detdup/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎detdup/core.py‎
Lines changed: 171 additions & 0 deletions b/‎detdup/core.py‎
Lines changed: 171 additions & 0 deletions
diff --git a/‎detdup/data_model/__init__.py‎
Lines changed: 3 additions & 0 deletions b/‎detdup/data_model/__init__.py‎
Lines changed: 3 additions & 0 deletions
@@ -0,0 +1,57 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+bin/
+build/
+develop-eggs/
+dist/
+eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.cache
+nosetests.xml
+coverage.xml
+
+# Translations
+*.mo
+
+# Mr Developer
+.mr.developer.cfg
+.project
+.pydevproject
+
+# Rope
+.ropeproject
+
+# Django stuff:
+*.log
+*.pot
+
+# Sphinx documentation
+docs/_build/
+*.db
+*.0
+*.cPickle
+*.json
@@ -0,0 +1,50 @@
+DetDup
+======================
+Detect duplicated items. 内容排重框架。
+
+Usage
+----------------------
+见 USAGE.markdown
+
+演讲稿
+-------------
+https://speakerdeck.com/mvj3/detdup
+
+内容排重功能列表
+----------------------
+1. 返回 重复题目列表。
+2. 发送题目ID，服务器端载入对应题库到内存中，查找和该项重复的条目，并返回题目ID列表。
+
+常见内容重复特征
+----------------------
+1. [长度] 基本相似或相等, 两者长度的平方根相差不超过1。
+2. [重复] 在任意位置, 多个逗号, 空格, s字符等。
+3. [同义] 全角半角编码。分隔符号不同。am, 'm。
+4. [顺序] 内部句子位子换了，比如从连线题里抽取的数据
+
+导致不能使用基于分词的倒排索引。
+
+召回率 和 正确率
+----------------------
+召回率: 如果特征抽取不是太准确的话，会导致有些groups漏了一两个。
+正确率: 几乎100%的，因为是按原文本相似度算的。
+
+DetDup 和 simhash, shingling 的关系。
+----------------------
+1. 其中 1 和 2 的功能 类似与simhash里 把文本变短为01序列的局部敏感hash 以及分块快速检索比较。
+   simhash不利于 题库排重的原因见 #参考文献# , 这边几十个字符占很大比例, simhash适合于大网页的排重,
+   而且simhash调hash参数应该比较繁琐和难以解释。
+2. 3 类似于 shingling, 区别是 shingling 用的是分词, 这边直接比较全部字符。
+   以兼容类似 `_range` 和 `orange` 的比较。
+
+文本相似度定义
+----------------------
+把两段文本共同的字母都取出来 除以 两者文本的总长度得出的比率。比如 of 和 off 的文本相似度为 4 / 5 = 80%
+
+参考文献
+-----------------------
+[海量数据相似度计算之simhash和海明距离](http://www.lanceyan.com/tech/arch/simhash_hamming_distance_similarity.html)
+
+```txt
+2、通过大量测试，simhash用于比较大文本，比如500字以上效果都还蛮好，距离小于3的基本都是相似，误判率也比较低。但是如果我们处理的是微博信息，最多也就140个字，使用simhash的效果并不那么理想。看如下图，在距离为3时是一个比较折中的点，在距离为10时效果已经很差了，不过我们测试短文本很多看起来相似的距离确实为10。如果使用距离为3，短文本大量重复信息不会被过滤，如果使用距离为10，长文本的错误率也非常高，如何解决？
+```
@@ -0,0 +1,2 @@
+* Support int type item\_id
+* In addition to pymongo, support more ORM
@@ -0,0 +1,57 @@
+Requirements
+----------------------
+1. Install required python library from requirements.txt
+2. storage: 1. sqlite library,  2. cPickle, 3. redis(optional)
+
+项目架构流程
+----------------------
+具体代码流程见 services/task.py
+
+### extract
+
+* 为要处理的数据 继承DetDupDataModel类, 提供 多种数据特征 和 清洗后的文本内容, 具体见文档注释。
+* 并导入特征数据库。
+
+### train
+
+* 去除特征数据库里没有同类的
+* 检测重复条目
+* 合并并打印结果列表
+* 检测召回率
+
+Usage
+----------------------
+1. 接口服务见 detdup/services
+2. 示例见     tests
+
+文本相似度 性能数据
+-----------------------
+1. 文本相似度在 0.95时，排重几乎全是正确的, 重复元素有3199个, 组有1463个。
+2. 文本相似度在 0.90时，排重一点点错误，重复元素有3297个, 组有1507个。
+
+相当于重复元素多了98个, 重复组多了44个, 重复[组]90-95之间多了 44 / 1463.0 = 3.0%, 重复元素90-100%元素约为 7.4%。
+在文本相似度为90%时，误判率大概在 重复元素 19 / 3297.0 = 0.57%, 重复组在 9 / 1507.0 = 0.59%;
+
+性能和总数以及重复元素总量成线性增长关系。
+
+90万数据
+data_extract 13分钟, 8核
+data_index 11分钟, 1核
+data_index_remove_unneeded 1分钟, 1核
+data_detdup 5.5分钟, 8核
+
+160万数据
+data_extract 26-32分钟, 8核，比上面慢的原因是90万数据是用SSD读的。
+data_index 25分钟, 1核
+data_index_remove_unneeded 1分钟, 1核
+data_detdup 5.75分钟, 8核
+
+读取数据 编程接口
+-----------------------
+```txt
+>>> import json
+>>> data = json.load(open("detdup.json", "rb"))
+>>> data.result[0:3]
+[[a3c67f3da591b518cb535bd7, 76d6aeed4b31b569310db1a6], [e05f6e6da5aff02a81411342, 75a8e395b87ad910e0cef062],
+[75e7db33f06264d80c77b669, 99b6ef2b6a32d2f8317763fc, 770e993816f258edc7f3fe6b],]
+```
@@ -0,0 +1 @@
+from .core import DetDupCore
@@ -0,0 +1,171 @@
+# -*- coding: utf-8 -*-
+
+from .utils import *
+
+from etl_utils import String, Speed, BufferLogger, ItemsGroupAndIndexes
+
+# TODO 检测 item.typename() 存在
+
+from .features import DefaultFeatures
+
+class DetDupCore(object):
+    """
+    Detect duplicated items, use decision tree.
+
+    Usage:
+    -----------
+    """
+
+    similarity_rate = 0.90
+
+    def __init__(self, features_dir, detdup_data_model):
+        self.features_dir          = features_dir
+
+        self.model                 = detdup_data_model
+
+        self.features              = [DefaultFeatures()]
+        self.features_map          = dict()
+
+        self.storage_type          = ['memory', 'disk'][0]
+
+        self.is_logger             = True
+        self.is_inspect_detail     = False
+        self.buffer_logger         = BufferLogger(os.path.join(self.features_dir, 'process.log'))
+
+        self.result                = ItemsGroupAndIndexes()
+        self.count                 = 0
+
+        self.candidate_dup_count   = None
+
+    def select_feature(self, item1):
+        f1 = item1.typename
+        if not isinstance(f1, str) and not isinstance(f1, unicode): f1 = f1()
+        return self.features_map[f1].insert_item(item1)
+
+    def feeded(self):
+        for feature1 in self.features:
+            # 这个Feature是否有效
+            if not feature1.link_to_detdup:
+                continue
+            # 之前已经导出数据库啦?!
+            if os.path.exists(feature1.sqlite3db_path()):
+                return True
+        return False
+
+    def load_features_from_db(self):
+        for feature1 in self.features: feature1.load_features_tree()
+
+    def dump_features_from_memory(self):
+        for feature1 in self.features: feature1.dump_features_tree()
+
+    def feed_items(self, obj, persist=True):
+        """ Feed items to features """
+        # 1. insert it into memory
+        [self.select_feature(item1).feed_item() for item1 in process_notifier(obj)]
+        # 2. backup into files fully!
+        if persist:
+            self.dump_features_from_memory()
+        return self
+
+    def plug_features(self, features1):
+        """
+        1. Plug features, and bind typename to classify items
+        2. init features tree, memory or disk
+        """
+        if not isinstance(features1, list): features1 = [features1]
+        self.features.extend(features1)
+        for f1 in self.features:
+            f1.link_to_detdup = self
+            f1.build_features_tree()
+
+        for f1 in self.features:
+            self.features_map[f1.typename] = f1
+        return self
+
+    time_sql = 0
+    time_calculate_text_similarity = 0
+    time_fetch_content = 0
+
+    def detect_duplicated_items(self, item1):
+        feature1 = self.select_feature(item1)
+        speed   = Speed()
+
+        t1 = datetime.now()
+        item_ids = feature1.fetch_matched_item_ids()
+        t2 = datetime.now(); self.time_sql += (t2 - t1).total_seconds();
+
+        # 4. 看看题目相似度
+        # 相似度得大于 95%
+        new_ids = list()
+        for item_id1 in item_ids:
+            # 2. 排除自己
+            if item_id1 == unicode(item1.item_id): continue
+
+            if item_id1 not in self.model:
+                # 删除不一致数据, 以在self.model里为准
+                feature1.delete_item_ids([item_id1])
+                continue
+
+            t11 = datetime.now()
+            content1 = self.model[item_id1].item_content
+            t12 = datetime.now(); self.time_fetch_content += (t12 - t11).total_seconds();
+
+            t11 = datetime.now()
+            res1 = String.calculate_text_similarity(item1.item_content,
+                            content1,
+                            inspect=True,
+                            skip_special_chars=True,
+                            similar_rate_baseline=self.similarity_rate)
+            t12 = datetime.now(); self.time_calculate_text_similarity += (t12 - t11).total_seconds();
+
+            if res1['similarity_rate'] > self.similarity_rate:
+                new_ids.append(item_id1)
+                self.buffer_logger.append(res1['info'])
+                self.buffer_logger.inspect()
+        print "字符串相似度 [前]", (len(item_ids) - 1), "个，[后]", len(new_ids), "个"
+
+        item_ids = new_ids
+
+        # 如果要排除已处理过为排重的
+        speed.tick().inspect()
+
+        print "self.time_sql", self.time_sql
+        print "self.time_calculate_text_similarity", self.time_calculate_text_similarity
+        print "self.time_fetch_content", self.time_fetch_content
+
+        return item_ids
+
+    def detect_duplicated_items_verbose(self, item_id1, verbose=False):
+        self.count += 1
+        print "\n"*5, "从", self.candidate_dup_count, "个候选题目中 排重第", self.count, "个题目。", item_id1
+
+        # 如果结果已经计算出来
+        if self.result.exists(item_id1):
+            return self.result.find(item_id1)
+
+        self.buffer_logger.append("-"*80)
+        self.buffer_logger.append("要处理的记录")
+
+        item1 = self.model[item_id1]
+        if verbose: item1.inspect()
+
+        self.buffer_logger.append("")
+        item_ids = self.detect_duplicated_items(item1)
+        self.buffer_logger.append("疑似和", item1.item_id, "重复的条目有", len(item_ids), "个")
+        for item_id1 in item_ids:
+            if verbose: self.model[item_id1].inspect()
+        self.buffer_logger.append("")
+
+        # 输出日志
+        if (len(item_ids) > 0) and self.is_logger:
+            self.buffer_logger.inspect()
+        else:
+            self.buffer_logger.clear()
+
+        item_ids.append(unicode(item1.item_id))
+
+        # 有重复结果，就存储一下
+        if len(item_ids) > 1:
+            self.result.add([i1 for i1 in item_ids])
+
+        return item_ids
@@ -0,0 +1,3 @@
+# -*- coding: utf-8 -*-
+
+from .base import DetDupDataModel
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+* Support int type item\_id`
	`2`	`+* In addition to pymongo, support more ORM`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# -- coding: utf-8 --`
	`2`	`+`
	`3`	`+from .base import DetDupDataModel`