Add neural search application codes #1463

w5688414 · 2021-12-14T12:46:44Z

PR types

New features

PR changes

Description

Add Neural Search Code

…develop

Updata

tianxin1860

效果展示表格那里把相应的结论直接写出来用户更容易理解我们的方案吧？

tianxin1860 · 2021-12-14T14:41:59Z

application/neural_search/README.md

+
+检索系统存在于我们日常使用的很多产品中，比如商品搜索系统、学术文献检索系等等，本方案提供了检索系统完整实现。限定场景是用户通过输入检索词 Query，快速在海量数据中查找相似文档。
+
+所谓语义检索（也称基于向量的检索），是指检索系统不再拘泥于用户 Query 字面本身，而是能精准捕捉到用户 Query 后面的真正意图并以此来搜索，从而更准确地向用户返回最符合的结果。通过使用最先进的语言模型找到文本的向量表示，在高维向量空间中对它们进行索引，并度量查询向量与索引文档的相似程度，从而解决了关键词索引带来的缺陷。


基于语义索引模型得到文本的向量表示，不是基于语言模型。

tianxin1860 · 2021-12-14T14:43:55Z

application/neural_search/README.md

+ 低门槛
+    + 手把手搭建起检索系统
+    + 无需标注数据也能构建检索系统
+    + 提供 训练、预测、ANN 引擎一站式能力


多了空格

为了展示出来清晰好看，所有英文术语前后都加了空格（如果前后有标点符号，就没加空格了）

tianxin1860 · 2021-12-14T14:45:47Z

application/neural_search/README.md

+    + 进一步优化方案: 面向领域的预训练 Domain-adaptive Pretraining 
+ 性能快
+    + 基于 Paddle Inference 快速抽取向量
+    + 建库性能和 ANN 查询性能快


基于 Milvus 快速查询和高性能建库

tianxin1860 · 2021-12-14T14:48:55Z

application/neural_search/README.md

+
+#### 2.2.2 召回模块
+
+召回模块需要从千亿、万亿等海量数据中快速召回候选数据。首先需要抽取语料库中文本的 Embedding，然后借助向量搜索引擎实现高效 ANN，从而实现候选集召回。


千亿、万亿太夸张了，直接删掉或者用千万？咱们当前也就验证了千万级别的建库规模。

已经修改

tianxin1860 · 2021-12-14T14:49:55Z

application/neural_search/README.md

+|  无 |  多 | InBatchNegative|
+|  有 | 有  | SimCSE+ InBatchNegative |
+
+最基本的情况是只有无监督数据，我们推荐您使用 SimCSE 进行无监督训练；另一种方案是只有有监督数据，我们推荐您使用 In-batch Negative 的方法进行有监督训练。


In-batch Negatives, 统一术语吧

tianxin1860 · 2021-12-14T15:01:50Z

application/neural_search/README.md

+
+第一步：无监督训练 Domain-adaptive Pretraining
+
+训练用时16hour55min，可参考：[ERNIE 1.0](./recall/domain_adaptive_pretraining/)  


数字与中文字符间需要空格

tianxin1860 · 2021-12-14T15:02:06Z

application/neural_search/README.md

+
+第二步：无监督训练 SimCSE
+
+训练用时16hour53min，可参考：[SimCSE](./recall/simcse/)   


tianxin1860 · 2021-12-14T15:03:34Z

application/neural_search/README.md

+几分钟内训练完成，可参考 [In-batch Negatives](./recall/in_batch_negative/)  
+
+
+此外，我们进行了多组实践，用来对比说明召回阶段各方案的效果：


根据以下表格数据得出的结论是否直接展示在 README 里？用户直接看这个表格不一定能快速理解我们的意图。

已经修改

tianxin1860 · 2021-12-14T15:04:44Z

application/neural_search/README.md

+我们展示一下系统的效果，输入的文本如下：
+
+```
+{0:'中西方语言与文化的差异'}


中西方语言与文化的差异

tianxin1860 · 2021-12-14T15:05:10Z

application/neural_search/README.md

+{0:'中西方语言与文化的差异'}
+
+```
+下面是召回的部分结果，第一个是召回的title，第二个数字是计算的相似度距离


中英空格

chenxiaozeng · 2021-12-15T01:45:59Z

application/neural_search/recall/in_batch_negative/README.md

+也可以使用下面的bash脚本：
+
+```
+bash scripts/evaluate.sh


bash or sh 保持统一吧

chenxiaozeng · 2021-12-15T01:46:18Z

application/neural_search/recall/in_batch_negative/README.md

+```
+python inference.py
+```
+预测结果位256维的向量：


chenxiaozeng · 2021-12-15T01:46:50Z

application/neural_search/recall/in_batch_negative/README.md

+
+### 7.1 功能一：抽取文本的语义向量
+
+修改inference.py文件里面输入文本id2corpus和模型路径；params_path：


多余的标点符号

chenxiaozeng · 2021-12-15T01:49:12Z

application/neural_search/recall/in_batch_negative/README.md

+
+<a name="部署"></a>
+
+## 8. 部署


这里提测后记得加一下txt pair相似度计算的脚本

chenxiaozeng · 2021-12-15T01:50:20Z

application/neural_search/recall/milvus/README.md

+ **目录**
+
+* [背景介绍](#背景介绍)
+* [MilVus召回](#MilVus召回)


标准写法是Milvus，V不大写，其他地方同

chenxiaozeng · 2021-12-15T01:51:03Z

application/neural_search/recall/milvus/README.md

+
+### 技术方案
+
+使用milvus搭建召回系统，然后使用训练好的语义索引模型，抽取向量，插入到milvus中，然后进行检索。


建议统一下，milvus or Milvus

chenxiaozeng · 2021-12-15T01:51:20Z

application/neural_search/recall/milvus/README.md

+* python >= 3.x
+* paddlepaddle >= 2.1.3
+* paddlenlp >= 2.2
+* milvus >=1.1.1


版本号前加空格

chenxiaozeng · 2021-12-15T01:52:19Z

application/neural_search/recall/milvus/README.md

+├── embedding_recall.py # 检索
+├── inference.py # 动态图模型向量抽取脚本
+├── feature_extract.py # 批量抽取向量脚本
+├── milvus_insert.py # 插入向量


embedding_insert.py milvus_insert.py 都是插入向量，区别是啥？

milvus_insert.py是工具类，embedding_insert.py是批量插入脚本

chenxiaozeng · 2021-12-15T01:53:06Z

application/neural_search/recall/milvus/README.md

+├── config.py  # milvus配置文件
+├── data.py # 数据处理函数
+├── embedding_insert.py # 插入向量
+├── embedding_recall.py # 检索


尽量写详细些，检索 -> 检索topK相似结果 / ANN

已经修改

chenxiaozeng · 2021-12-15T01:58:13Z

application/neural_search/recall/simcse/README.md

+
+<a name="部署"></a>
+
+## 8. 部署


同In-batch Negatives，建议提测后加一下text pair相似度计算的静态图预测版本

ZeyuChen · 2021-12-15T03:35:41Z

application/neural_search/recall/domain_adaptive_pretraining/README.md

+### 训练环境说明
+
+```
+NVIDIA Driver Version: 440.64.00 


Cuda和CuDNN版本也很重要
同时对于这类非代码的，不需要用code block符号套起来

用
- 标注一些点就行了

已经修改

ZeyuChen · 2021-12-15T03:59:50Z

application/neural_search/README.md

+a. 软件环境：
+
+```
+python >= 3.x


不要用code block圈这些非代码块的地方

改用
- 然后逐行描述

python paddlenlp限制了python >= 3.6

已经修改

ZeyuChen · 2021-12-15T04:00:35Z

application/neural_search/README.md

+paddlenlp >= 2.2.1        
+paddlepaddle-gpu >=2.2
+CUDA Version: 10.2
+NVIDIA Driver Version: 440.64.00 


你这里是想表明你指支持CUDA 10.2呢，还是想说你的结果是基于CUDA 10.2复现出来的。

如果是前者，那不需要提CUDA和NVIDIA Driver的额事情

只是想说明一下做实验的实验环境

ZeyuChen · 2021-12-15T04:01:29Z

application/neural_search/README.md

+
+```
+NVIDIA Tesla V100 16GB x4卡
+Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz


还是那个问题，你想告诉大家你只能在V100跑吗？如果不是，不需要加上这些信息。要么是说你要做性能benchmark，告诉大家这个数据是在这个环境上复现的。

就想告诉用户，我们的实验环境，包括硬件信息，软件信息等。不是说只能在v100跑

ZeyuChen · 2021-12-15T04:02:12Z

application/neural_search/README.md

+
+## 4. 动手实践——搭建自己的检索系统
+
+这里展示了能够从头至尾跑通的完整代码，您使用自己的业务数据，照着跑，能搭建出一个给定 Query，返回 topK 相关文档的小型检索系统。您可以参照我们给出的效果、性能数据，check 自己的运行过程是否正确。


check 自己运行过程。
这句话中英混杂的口语化表达不适合作为文档。

检查就检查，要么就全英。

已经修改

ZeyuChen · 2021-12-15T04:03:53Z

application/neural_search/README.md

+
+排序阶段使用的模型是 ERNIE-Gram，用时20h，可参考：
+
+[ernie_matching](./sort/ernie_matching/)


sort -> ranking.
Relevance, Recall, and Ranking
相关性、召回、排序的标准英文

已经修改

…develop

tianxin1860

LGTM

tianxin1860

LGTM

w5688414 and others added 30 commits November 11, 2021 11:55

add semantic indexing files

58a87d0

update indexing code

4b48274

update milvus config

75c416e

update readme

0d02be3

update readme

86ff3dc

update readme

b00aa90

update readme

b50fdeb

Merge remote-tracking branch 'upstream/develop' into develop

d6887c5

Merge remote-tracking branch 'upstream/develop' into develop

ce7e1f7

Merge branch 'develop' of https://github.com/w5688414/PaddleNLP into …

e4c2855

…develop

add code

d978f7a

add export model

f5af100

add base model

3c0b331

add inference code

5463ec0

fix the inference bug

8e7846c

rename dir

dbd7a91

add ernie matching code

1816fca

update ernie matching code

711f0d7

update readme

2ab1fc6

Merge remote-tracking branch 'upstream/develop' into develop

f016c61

update readme

48b34f0

update readme

61bcf97

rename dir

35f2a22

update ernie 1.0

bb8415a

update ernie readme

0ce373b

add simcse

84358f7

update simcse readme

2fca40a

add inbach negative code

00adc76

update readme

9722344

add batch neg train code

302b8bc

w5688414 and others added 2 commits December 14, 2021 13:14

update

78a5443

Tiny Fix

3410402

Updata

tianxin1860 reviewed Dec 14, 2021

View reviewed changes

tianxin1860 requested a review from ZeyuChen December 14, 2021 15:08

chenxiaozeng reviewed Dec 15, 2021

View reviewed changes

ZeyuChen reviewed Dec 15, 2021

View reviewed changes

w5688414 and others added 16 commits December 15, 2021 07:31

update readme

4152c07

Merge branch 'develop' of https://github.com/w5688414/PaddleNLP into …

cbed6fa

…develop

Merge branch 'develop' into develop

33f997c

update readme

811880a

Merge branch 'develop' of https://github.com/w5688414/PaddleNLP into …

b2fae6b

…develop

update readme

562f9c9

update readme

33675ba

update

511ad09

adjust the readme format

1d6fd20

update readme

c5961e7

update readme

6f0c0eb

add RocketQA

a05428b

update readme

d8f4d10

Update

5daba92

update readme

7288d07

update data sample

e5588ea

tianxin1860 previously approved these changes Dec 17, 2021

View reviewed changes

Update README.md

33a4caa

chenxiaozeng dismissed tianxin1860’s stale review via 33a4caa December 17, 2021 07:12

Merge branch 'develop' into develop

806cea3

tianxin1860 approved these changes Dec 17, 2021

View reviewed changes

tianxin1860 merged commit 374d31b into PaddlePaddle:develop Dec 17, 2021

tianxin1860 mentioned this pull request Dec 17, 2021

PaddleNLP 2.2.1 Release Note Candidate #1467

Closed

ZHUI mentioned this pull request Dec 22, 2021

[BUGFIX] Fix ln link of files. #1500

Merged


		检索系统存在于我们日常使用的很多产品中，比如商品搜索系统、学术文献检索系等等，本方案提供了检索系统完整实现。限定场景是用户通过输入检索词 Query，快速在海量数据中查找相似文档。

		所谓语义检索（也称基于向量的检索），是指检索系统不再拘泥于用户 Query 字面本身，而是能精准捕捉到用户 Query 后面的真正意图并以此来搜索，从而更准确地向用户返回最符合的结果。通过使用最先进的语言模型找到文本的向量表示，在高维向量空间中对它们进行索引，并度量查询向量与索引文档的相似程度，从而解决了关键词索引带来的缺陷。


		#### 2.2.2 召回模块

		召回模块需要从千亿、万亿等海量数据中快速召回候选数据。首先需要抽取语料库中文本的 Embedding，然后借助向量搜索引擎实现高效 ANN，从而实现候选集召回。


		第一步：无监督训练 Domain-adaptive Pretraining

		训练用时16hour55min，可参考：[ERNIE 1.0](./recall/domain_adaptive_pretraining/)


		第二步：无监督训练 SimCSE

		训练用时16hour53min，可参考：[SimCSE](./recall/simcse/)

		几分钟内训练完成，可参考 [In-batch Negatives](./recall/in_batch_negative/)


		此外，我们进行了多组实践，用来对比说明召回阶段各方案的效果：


		### 7.1 功能一：抽取文本的语义向量

		修改inference.py文件里面输入文本id2corpus和模型路径；params_path：


		### 技术方案

		使用milvus搭建召回系统，然后使用训练好的语义索引模型，抽取向量，插入到milvus中，然后进行检索。


		## 4. 动手实践——搭建自己的检索系统

		这里展示了能够从头至尾跑通的完整代码，您使用自己的业务数据，照着跑，能搭建出一个给定 Query，返回 topK 相关文档的小型检索系统。您可以参照我们给出的效果、性能数据，check 自己的运行过程是否正确。


		排序阶段使用的模型是 ERNIE-Gram，用时20h，可参考：

		[ernie_matching](./sort/ernie_matching/)

Add neural search application codes #1463

Add neural search application codes #1463

Conversation

w5688414 commented Dec 14, 2021

PR types

PR changes

Description

tianxin1860 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianxin1860 left a comment

Choose a reason for hiding this comment

tianxin1860 left a comment

Choose a reason for hiding this comment

tianxin1860 left a comment •

edited

Loading