-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add neural search application codes #1463
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 效果展示表格那里把相应的结论直接写出来用户更容易理解我们的方案吧?
application/neural_search/README.md
Outdated
|
||
检索系统存在于我们日常使用的很多产品中,比如商品搜索系统、学术文献检索系等等,本方案提供了检索系统完整实现。限定场景是用户通过输入检索词 Query,快速在海量数据中查找相似文档。 | ||
|
||
所谓语义检索(也称基于向量的检索),是指检索系统不再拘泥于用户 Query 字面本身,而是能精准捕捉到用户 Query 后面的真正意图并以此来搜索,从而更准确地向用户返回最符合的结果。通过使用最先进的语言模型找到文本的向量表示,在高维向量空间中对它们进行索引,并度量查询向量与索引文档的相似程度,从而解决了关键词索引带来的缺陷。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
基于语义索引模型得到文本的向量表示,不是基于语言模型。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
+ 低门槛 | ||
+ 手把手搭建起检索系统 | ||
+ 无需标注数据也能构建检索系统 | ||
+ 提供 训练、预测、ANN 引擎一站式能力 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多了空格
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为了展示出来清晰好看,所有英文术语前后都加了空格(如果前后有标点符号,就没加空格了)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
application/neural_search/README.md
Outdated
+ 进一步优化方案: 面向领域的预训练 Domain-adaptive Pretraining | ||
+ 性能快 | ||
+ 基于 Paddle Inference 快速抽取向量 | ||
+ 建库性能和 ANN 查询性能快 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
基于 Milvus 快速查询和高性能建库
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
application/neural_search/README.md
Outdated
|
||
#### 2.2.2 召回模块 | ||
|
||
召回模块需要从千亿、万亿等海量数据中快速召回候选数据。首先需要抽取语料库中文本的 Embedding,然后借助向量搜索引擎实现高效 ANN,从而实现候选集召回。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
千亿、万亿太夸张了,直接删掉或者用千万?咱们当前也就验证了千万级别的建库规模。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
application/neural_search/README.md
Outdated
| 无 | 多 | InBatchNegative| | ||
| 有 | 有 | SimCSE+ InBatchNegative | | ||
|
||
最基本的情况是只有无监督数据,我们推荐您使用 SimCSE 进行无监督训练;另一种方案是只有有监督数据,我们推荐您使用 In-batch Negative 的方法进行有监督训练。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-batch Negatives, 统一术语吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
application/neural_search/README.md
Outdated
|
||
第一步:无监督训练 Domain-adaptive Pretraining | ||
|
||
训练用时16hour55min,可参考:[ERNIE 1.0](./recall/domain_adaptive_pretraining/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
数字与中文字符间需要空格
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
application/neural_search/README.md
Outdated
|
||
第二步:无监督训练 SimCSE | ||
|
||
训练用时16hour53min,可参考:[SimCSE](./recall/simcse/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
几分钟内训练完成,可参考 [In-batch Negatives](./recall/in_batch_negative/) | ||
|
||
|
||
此外,我们进行了多组实践,用来对比说明召回阶段各方案的效果: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据以下表格数据得出的结论是否直接展示在 README 里?用户直接看这个表格不一定能快速理解我们的意图。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
application/neural_search/README.md
Outdated
我们展示一下系统的效果,输入的文本如下: | ||
|
||
``` | ||
{0:'中西方语言与文化的差异'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
中西方语言与文化的差异
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
{0:'中西方语言与文化的差异'} | ||
|
||
``` | ||
下面是召回的部分结果,第一个是召回的title,第二个数字是计算的相似度距离 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
中英空格
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
也可以使用下面的bash脚本: | ||
|
||
``` | ||
bash scripts/evaluate.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bash or sh 保持统一吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
``` | ||
python inference.py | ||
``` | ||
预测结果位256维的向量: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
位 -> 为
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
|
||
### 7.1 功能一:抽取文本的语义向量 | ||
|
||
修改inference.py文件里面输入文本id2corpus和模型路径;params_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多余的标点符号
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
|
||
<a name="部署"></a> | ||
|
||
## 8. 部署 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里提测后记得加一下txt pair相似度计算的脚本
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
**目录** | ||
|
||
* [背景介绍](#背景介绍) | ||
* [MilVus召回](#MilVus召回) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
标准写法是Milvus,V不大写,其他地方同
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
|
||
### 技术方案 | ||
|
||
使用milvus搭建召回系统,然后使用训练好的语义索引模型,抽取向量,插入到milvus中,然后进行检索。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议统一下,milvus or Milvus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
* python >= 3.x | ||
* paddlepaddle >= 2.1.3 | ||
* paddlenlp >= 2.2 | ||
* milvus >=1.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
版本号前加空格
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
├── embedding_recall.py # 检索 | ||
├── inference.py # 动态图模型向量抽取脚本 | ||
├── feature_extract.py # 批量抽取向量脚本 | ||
├── milvus_insert.py # 插入向量 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embedding_insert.py milvus_insert.py 都是 插入向量,区别是啥?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
milvus_insert.py是工具类,embedding_insert.py是批量插入脚本
├── config.py # milvus配置文件 | ||
├── data.py # 数据处理函数 | ||
├── embedding_insert.py # 插入向量 | ||
├── embedding_recall.py # 检索 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
尽量写详细些,检索 -> 检索topK相似结果 / ANN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
|
||
<a name="部署"></a> | ||
|
||
## 8. 部署 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同In-batch Negatives,建议提测后加一下text pair相似度计算的静态图预测版本
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
### 训练环境说明 | ||
|
||
``` | ||
NVIDIA Driver Version: 440.64.00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cuda和CuDNN版本也很重要
同时对于这类非代码的,不需要用code block符号套起来
用
-
标注一些点就行了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
application/neural_search/README.md
Outdated
a. 软件环境: | ||
|
||
``` | ||
python >= 3.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要用code block圈这些非代码块的地方
改用
-
然后逐行描述
python paddlenlp限制了python >= 3.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
application/neural_search/README.md
Outdated
paddlenlp >= 2.2.1 | ||
paddlepaddle-gpu >=2.2 | ||
CUDA Version: 10.2 | ||
NVIDIA Driver Version: 440.64.00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你这里是想表明你指支持CUDA 10.2呢,还是想说你的结果是基于CUDA 10.2复现出来的。
如果是前者,那不需要提CUDA和NVIDIA Driver的额事情
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
只是想说明一下做实验的实验环境
application/neural_search/README.md
Outdated
|
||
``` | ||
NVIDIA Tesla V100 16GB x4卡 | ||
Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还是那个问题,你想告诉大家你只能在V100跑吗?如果不是,不需要加上这些信息。要么是说你要做性能benchmark,告诉大家这个数据是在这个环境上复现的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
就想告诉用户,我们的实验环境,包括硬件信息,软件信息等。不是说只能在v100跑
application/neural_search/README.md
Outdated
|
||
## 4. 动手实践——搭建自己的检索系统 | ||
|
||
这里展示了能够从头至尾跑通的完整代码,您使用自己的业务数据,照着跑,能搭建出一个给定 Query,返回 topK 相关文档的小型检索系统。您可以参照我们给出的效果、性能数据,check 自己的运行过程是否正确。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check 自己运行过程。
这句话中英混杂的口语化表达不适合作为文档。
检查就检查,要么就全英。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
application/neural_search/README.md
Outdated
|
||
排序阶段使用的模型是 ERNIE-Gram,用时20h,可参考: | ||
|
||
[ernie_matching](./sort/ernie_matching/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort -> ranking.
Relevance, Recall, and Ranking
相关性、召回、排序的标准英文
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
Description
Add Neural Search Code