Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add neural search application codes #1463

Merged
merged 145 commits into from
Dec 17, 2021
Merged

Add neural search application codes #1463

merged 145 commits into from
Dec 17, 2021

Conversation

w5688414
Copy link
Contributor

PR types

New features

PR changes

Description

Add Neural Search Code

w5688414 and others added 2 commits December 14, 2021 13:14
Copy link

@tianxin1860 tianxin1860 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 效果展示表格那里把相应的结论直接写出来用户更容易理解我们的方案吧?


检索系统存在于我们日常使用的很多产品中,比如商品搜索系统、学术文献检索系等等,本方案提供了检索系统完整实现。限定场景是用户通过输入检索词 Query,快速在海量数据中查找相似文档。

所谓语义检索(也称基于向量的检索),是指检索系统不再拘泥于用户 Query 字面本身,而是能精准捕捉到用户 Query 后面的真正意图并以此来搜索,从而更准确地向用户返回最符合的结果。通过使用最先进的语言模型找到文本的向量表示,在高维向量空间中对它们进行索引,并度量查询向量与索引文档的相似程度,从而解决了关键词索引带来的缺陷。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

基于语义索引模型得到文本的向量表示,不是基于语言模型。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

+ 低门槛
+ 手把手搭建起检索系统
+ 无需标注数据也能构建检索系统
+ 提供 训练、预测、ANN 引擎一站式能力

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多了空格

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为了展示出来清晰好看,所有英文术语前后都加了空格(如果前后有标点符号,就没加空格了)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

+ 进一步优化方案: 面向领域的预训练 Domain-adaptive Pretraining
+ 性能快
+ 基于 Paddle Inference 快速抽取向量
+ 建库性能和 ANN 查询性能快

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

基于 Milvus 快速查询和高性能建库

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


#### 2.2.2 召回模块

召回模块需要从千亿、万亿等海量数据中快速召回候选数据。首先需要抽取语料库中文本的 Embedding,然后借助向量搜索引擎实现高效 ANN,从而实现候选集召回。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

千亿、万亿太夸张了,直接删掉或者用千万?咱们当前也就验证了千万级别的建库规模。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

| 无 | 多 | InBatchNegative|
| 有 | 有 | SimCSE+ InBatchNegative |

最基本的情况是只有无监督数据,我们推荐您使用 SimCSE 进行无监督训练;另一种方案是只有有监督数据,我们推荐您使用 In-batch Negative 的方法进行有监督训练。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-batch Negatives, 统一术语吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


第一步:无监督训练 Domain-adaptive Pretraining

训练用时16hour55min,可参考:[ERNIE 1.0](./recall/domain_adaptive_pretraining/)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数字与中文字符间需要空格

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


第二步:无监督训练 SimCSE

训练用时16hour53min,可参考:[SimCSE](./recall/simcse/)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

几分钟内训练完成,可参考 [In-batch Negatives](./recall/in_batch_negative/)


此外,我们进行了多组实践,用来对比说明召回阶段各方案的效果:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据以下表格数据得出的结论是否直接展示在 README 里?用户直接看这个表格不一定能快速理解我们的意图。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

我们展示一下系统的效果,输入的文本如下:

```
{0:'中西方语言与文化的差异'}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中西方语言与文化的差异

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

{0:'中西方语言与文化的差异'}

```
下面是召回的部分结果,第一个是召回的title,第二个数字是计算的相似度距离

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中英空格

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

也可以使用下面的bash脚本:

```
bash scripts/evaluate.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bash or sh 保持统一吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

```
python inference.py
```
预测结果位256维的向量:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

位 -> 为

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


### 7.1 功能一:抽取文本的语义向量

修改inference.py文件里面输入文本id2corpus和模型路径;params_path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多余的标点符号

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


<a name="部署"></a>

## 8. 部署
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里提测后记得加一下txt pair相似度计算的脚本

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

**目录**

* [背景介绍](#背景介绍)
* [MilVus召回](#MilVus召回)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

标准写法是Milvus,V不大写,其他地方同

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


### 技术方案

使用milvus搭建召回系统,然后使用训练好的语义索引模型,抽取向量,插入到milvus中,然后进行检索。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议统一下,milvus or Milvus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

* python >= 3.x
* paddlepaddle >= 2.1.3
* paddlenlp >= 2.2
* milvus >=1.1.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

版本号前加空格

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

├── embedding_recall.py # 检索
├── inference.py # 动态图模型向量抽取脚本
├── feature_extract.py # 批量抽取向量脚本
├── milvus_insert.py # 插入向量
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedding_insert.py milvus_insert.py 都是 插入向量,区别是啥?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

milvus_insert.py是工具类,embedding_insert.py是批量插入脚本

├── config.py # milvus配置文件
├── data.py # 数据处理函数
├── embedding_insert.py # 插入向量
├── embedding_recall.py # 检索
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

尽量写详细些,检索 -> 检索topK相似结果 / ANN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改


<a name="部署"></a>

## 8. 部署
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同In-batch Negatives,建议提测后加一下text pair相似度计算的静态图预测版本

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

### 训练环境说明

```
NVIDIA Driver Version: 440.64.00
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cuda和CuDNN版本也很重要
同时对于这类非代码的,不需要用code block符号套起来


- 标注一些点就行了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

a. 软件环境:

```
python >= 3.x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要用code block圈这些非代码块的地方

改用
- 然后逐行描述

python paddlenlp限制了python >= 3.6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

paddlenlp >= 2.2.1
paddlepaddle-gpu >=2.2
CUDA Version: 10.2
NVIDIA Driver Version: 440.64.00
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你这里是想表明你指支持CUDA 10.2呢,还是想说你的结果是基于CUDA 10.2复现出来的。

如果是前者,那不需要提CUDA和NVIDIA Driver的额事情

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只是想说明一下做实验的实验环境


```
NVIDIA Tesla V100 16GB x4卡
Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还是那个问题,你想告诉大家你只能在V100跑吗?如果不是,不需要加上这些信息。要么是说你要做性能benchmark,告诉大家这个数据是在这个环境上复现的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就想告诉用户,我们的实验环境,包括硬件信息,软件信息等。不是说只能在v100跑


## 4. 动手实践——搭建自己的检索系统

这里展示了能够从头至尾跑通的完整代码,您使用自己的业务数据,照着跑,能搭建出一个给定 Query,返回 topK 相关文档的小型检索系统。您可以参照我们给出的效果、性能数据,check 自己的运行过程是否正确。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check 自己运行过程。
这句话中英混杂的口语化表达不适合作为文档。

检查就检查,要么就全英。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改


排序阶段使用的模型是 ERNIE-Gram,用时20h,可参考:

[ernie_matching](./sort/ernie_matching/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort -> ranking.
Relevance, Recall, and Ranking
相关性、召回、排序的标准英文

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

tianxin1860
tianxin1860 previously approved these changes Dec 17, 2021
Copy link

@tianxin1860 tianxin1860 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@tianxin1860 tianxin1860 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tianxin1860 tianxin1860 merged commit 374d31b into PaddlePaddle:develop Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants