Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text semantic matching for taskflow #3003

Merged
merged 14 commits into from
Aug 19, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 20 additions & 8 deletions docs/model_zoo/taskflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,14 @@ PaddleNLP提供**开箱即用**的产业级NLP预置任务能力,无需训练
| [信息抽取](#信息抽取) | `Taskflow("information_extraction")`| ✅ | ✅ | ✅ | ✅ | ✅ | 适配多场景的开放域通用信息抽取工具 |
| [『解语』-知识标注](#解语知识标注) | `Taskflow("knowledge_mining")` | ✅ | ✅ | ✅ | ✅ | ✅ | 覆盖所有中文词汇的知识标注工具 |
| [文本纠错](#文本纠错) | `Taskflow("text_correction")` | ✅ | ✅ | ✅ | ✅ | ✅ | 融合拼音特征的端到端文本纠错模型ERNIE-CSC |
| [文本相似度](#文本相似度) | `Taskflow("text_similarity")` | ✅ | ✅ | ✅ | | | 基于百度知道2200万对相似句组训练 |
| [文本相似度](#文本相似度) | `Taskflow("text_similarity")` | ✅ | ✅ | ✅ | | | 基于百万量级Dureader Retrieval数据集训练RocketQA并达到前沿文本相似效果|
| [情感倾向分析](#情感倾向分析) | `Taskflow("sentiment_analysis")` | ✅ | ✅ | ✅ | | ✅ | 基于情感知识增强预训练模型SKEP达到业界SOTA |
| [生成式问答](#生成式问答) | `Taskflow("question_answering")` | ✅ | ✅ | ✅ | | | 使用最大中文开源CPM模型完成问答 |
| [智能写诗](#智能写诗) | `Taskflow("poetry_generation")` | ✅ | ✅ | ✅ | | | 使用最大中文开源CPM模型完成写诗 |
| [开放域对话](#开放域对话) | `Taskflow("dialogue")` | ✅ | ✅ | ✅ | | | 十亿级语料训练最强中文闲聊模型PLATO-Mini,支持多轮对话 |
| [代码生成](#代码生成) | `Taskflow("code_generation")` | ✅ | ✅ | ✅ | | | 代码生成大模型 |
| [文图生成](#文图生成) | `Taskflow("text2image_generation")` | ✅ | ✅ | ✅ | | | 文图生成大模型 |


## QuickStart

**环境依赖**
Expand Down Expand Up @@ -1156,31 +1155,44 @@ from paddlenlp import Taskflow
</div></details>

### 文本相似度
<details><summary>&emsp;基于百度知道2200万对相似句组训练SimBERT达到前沿文本相似效果</summary><div>
<details><summary>&emsp;基于百万量级Dureader Retrieval数据集训练RocketQA并达到前沿文本相似效果</summary><div>

#### 单条输入

```python
>>> from paddlenlp import Taskflow
>>> similarity = Taskflow("text_similarity")
>>> similarity([["春天适合种什么花?", "春天适合种什么菜?"]])
[{'text1': '春天适合种什么花?', 'text2': '春天适合种什么菜?', 'similarity': 0.8340253}]
[{'text1': '春天适合种什么花?', 'text2': '春天适合种什么菜?', 'similarity': 0.0048632388934493065}]
```

#### 批量样本输入,平均速度更快

```python
>>> from paddlenlp import Taskflow
>>> similarity([["光眼睛大就好看吗", "眼睛好看吗?"], ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"]])
[{'text1': '光眼睛大就好看吗', 'text2': '眼睛好看吗?', 'similarity': 0.74502707}, {'text1': '小蝌蚪找妈妈怎么样', 'text2': '小蝌蚪找妈妈是谁画的', 'similarity': 0.8192149}]
>>> text_similarity([['春天适合种什么花?','春天适合种什么菜?'],['谁有狂三这张高清的','这张高清图,谁有']])
[{'text1': '春天适合种什么花?', 'text2': '春天适合种什么菜?', 'similarity': 0.0048632388934493065}, {'text1': '谁有狂三这张高清的', 'text2': '这张高清图,谁有', 'similarity': 0.7050786018371582}]
```

#### 可配置参数说明
* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。
* `max_seq_len`:最大序列长度,默认为128
* `max_seq_len`:最大序列长度,默认为384
* `task_path`:自定义任务路径,默认为None。
</div></details>

#### 模型选择

- 多模型选择,满足精度、速度要求

| 模型 | 结构 | 语言 |
| :---: | :--------: | :--------: |
| `rocketqa-zh-dureader-cross-encoder` (默认) | 12-layers, 768-hidden, 12-heads | 中文 |
| `simbert-base-chinese` | 12-layers, 768-hidden, 12-heads | 中文 |
| `rocketqa-base-cross-encoder` | 12-layers, 768-hidden, 12-heads | 中文 |
| `rocketqa-medium-cross-encoder`| 6-layers, 768-hidden, 12-heads | 中文 |
| `rocketqa-mini-cross-encoder`| 6-layers, 384-hidden, 12-heads | 中文 |
| `rocketqa-micro-cross-encoder`| 4-layers, 384-hidden, 12-heads | 中文 |
| `rocketqa-nano-cross-encoder`| 4-layers, 312-hidden, 12-heads | 中文 |

### 情感倾向分析
<details><summary>&emsp;基于情感知识增强预训练模型SKEP达到业界SOTA </summary><div>

Expand Down
27 changes: 26 additions & 1 deletion paddlenlp/taskflow/taskflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,9 +192,34 @@
"task_class": TextSimilarityTask,
"task_flag": "text_similarity-simbert-base-chinese"
},
"rocketqa-zh-dureader-cross-encoder": {
"task_class": TextSimilarityTask,
"task_flag":
'text_similarity-rocketqa-zh-dureader-cross-encoder',
},
"rocketqa-base-cross-encoder": {
"task_class": TextSimilarityTask,
"task_flag": 'text_similarity-rocketqa-base-cross-encoder',
},
"rocketqa-medium-cross-encoder": {
"task_class": TextSimilarityTask,
"task_flag": 'text_similarity-rocketqa-medium-cross-encoder',
},
"rocketqa-mini-cross-encoder": {
"task_class": TextSimilarityTask,
"task_flag": 'text_similarity-rocketqa-mini-cross-encoder',
},
"rocketqa-micro-cross-encoder": {
"task_class": TextSimilarityTask,
"task_flag": 'text_similarity-rocketqa-micro-cross-encoder',
},
"rocketqa-nano-cross-encoder": {
"task_class": TextSimilarityTask,
"task_flag": 'text_similarity-rocketqa-nano-cross-encoder',
},
},
"default": {
"model": "simbert-base-chinese"
"model": "rocketqa-zh-dureader-cross-encoder"
}
},
"word_segmentation": {
Expand Down
199 changes: 147 additions & 52 deletions paddlenlp/taskflow/text_similarity.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import paddle
from paddlenlp.transformers import BertModel, BertTokenizer
from ..transformers import ErnieCrossEncoder, ErnieTokenizer

from ..data import Pad, Tuple
from .utils import static_mode_guard
Expand Down Expand Up @@ -59,17 +60,83 @@ class TextSimilarityTask(Task):
"https://bj.bcebos.com/paddlenlp/taskflow/text_similarity/simbert-base-chinese/model_config.json",
"1254bbd7598457a9dad0afcb2e24b70c"
],
}
},
"rocketqa-zh-dureader-cross-encoder": {
"model_state": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-zh-dureader-cross-encoder/model_state.pdparams",
"88bc3e1a64992a1bdfe4044ecba13bc7"
],
"model_config": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-zh-dureader-cross-encoder/model_config.json",
"b69083c2895e8f68e1a10467b384daab"
],
},
"rocketqa-base-cross-encoder": {
"model_state": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-base-cross-encoder/model_state.pdparams",
"6d845a492a2695e62f2be79f8017be92"
],
"model_config": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-base-cross-encoder/model_config.json",
"18ce260ede18bc3cb28dcb2e7df23b1a"
],
},
"rocketqa-medium-cross-encoder": {
"model_state": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-medium-cross-encoder/model_state.pdparams",
"4b929f4fc11a1df8f59fdf2784e23fa7"
],
"model_config": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-medium-cross-encoder/model_config.json",
"10997db96bc86e29cd113e1bf58989d7"
],
},
"rocketqa-mini-cross-encoder": {
"model_state": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-mini-cross-encoder/model_state.pdparams",
"c411111df990132fb88c070d8b8cf3f7"
],
"model_config": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-mini-cross-encoder/model_config.json",
"271e6d779acbe8e8acdd596b1c835546"
],
},
"rocketqa-micro-cross-encoder": {
"model_state": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-micro-cross-encoder/model_state.pdparams",
"3d643ff7d6029c8ceab5653680167dc0"
],
"model_config": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-micro-cross-encoder/model_config.json",
"b32d1a932d8c367fab2a6216459dd0a7"
],
},
"rocketqa-nano-cross-encoder": {
"model_state": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-nano-cross-encoder/model_state.pdparams",
"4c1d36e5e94f5af09f665fc7ad0be140"
],
"model_config": [
"https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-nano-cross-encoder/model_config.json",
"dcff14cd671e1064be2c5d63734098bb"
],
},
}

def __init__(self, task, model, batch_size=1, max_seq_len=128, **kwargs):
def __init__(self, task, model, batch_size=1, max_seq_len=384, **kwargs):
super().__init__(task=task, model=model, **kwargs)
self._static_mode = True
self._check_task_files()
self._construct_tokenizer(model)
self._get_inference_model()
if self._static_mode:
self._get_inference_model()
else:
self._construct_model(model)
self._construct_tokenizer(model)
self._batch_size = batch_size
self._max_seq_len = max_seq_len
self._usage = usage
self.model_name = model

def _construct_input_spec(self):
"""
Expand All @@ -79,7 +146,7 @@ def _construct_input_spec(self):
paddle.static.InputSpec(shape=[None, None],
dtype="int64",
name='input_ids'),
paddle.static.InputSpec(shape=[None],
paddle.static.InputSpec(shape=[None, None],
dtype="int64",
name='token_type_ids'),
]
Expand All @@ -88,15 +155,21 @@ def _construct_model(self, model):
"""
Construct the inference model for the predictor.
"""
self._model = BertModel.from_pretrained(self._task_path,
pool_act='linear')
if ("rocketqa" in model):
self._model = ErnieCrossEncoder(model)
else:
self._model = BertModel.from_pretrained(self._task_path,
pool_act='linear')
self._model.eval()

def _construct_tokenizer(self, model):
"""
Construct the tokenizer for the predictor.
"""
self._tokenizer = BertTokenizer.from_pretrained(model)
if ("rocketqa" in model):
self._tokenizer = ErnieTokenizer.from_pretrained(model)
else:
self._tokenizer = BertTokenizer.from_pretrained(model)

def _check_input_text(self, inputs):
inputs = inputs[0]
Expand All @@ -118,40 +191,52 @@ def _preprocess(self, inputs):
'lazy_load'] if 'lazy_load' in self.kwargs else False

examples = []

for data in inputs:
text1, text2 = data[0], data[1]
if ("rocketqa" in self.model_name):
encoded_inputs = self._tokenizer(text=text1,
text_pair=text2,
max_seq_len=self._max_seq_len)
ids = encoded_inputs["input_ids"]
segment_ids = encoded_inputs["token_type_ids"]
examples.append((ids, segment_ids))
else:
text1_encoded_inputs = self._tokenizer(
text=text1, max_seq_len=self._max_seq_len)
text1_input_ids = text1_encoded_inputs["input_ids"]
text1_token_type_ids = text1_encoded_inputs["token_type_ids"]

text1_encoded_inputs = self._tokenizer(
text=text1, max_seq_len=self._max_seq_len)
text1_input_ids = text1_encoded_inputs["input_ids"]
text1_token_type_ids = text1_encoded_inputs["token_type_ids"]
text2_encoded_inputs = self._tokenizer(
text=text2, max_seq_len=self._max_seq_len)
text2_input_ids = text2_encoded_inputs["input_ids"]
text2_token_type_ids = text2_encoded_inputs["token_type_ids"]

text2_encoded_inputs = self._tokenizer(
text=text2, max_seq_len=self._max_seq_len)
text2_input_ids = text2_encoded_inputs["input_ids"]
text2_token_type_ids = text2_encoded_inputs["token_type_ids"]

examples.append((text1_input_ids, text1_token_type_ids,
text2_input_ids, text2_token_type_ids))
examples.append((text1_input_ids, text1_token_type_ids,
text2_input_ids, text2_token_type_ids))

batches = [
examples[idx:idx + self._batch_size]
for idx in range(0, len(examples), self._batch_size)
]

batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=self._tokenizer.pad_token_id, dtype='int64'
), # text1_input_ids
Pad(axis=0,
pad_val=self._tokenizer.pad_token_type_id,
dtype='int64'), # text1_token_type_ids
Pad(axis=0, pad_val=self._tokenizer.pad_token_id, dtype='int64'
), # text2_input_ids
Pad(axis=0,
pad_val=self._tokenizer.pad_token_type_id,
dtype='int64'), # text2_token_type_ids
): [data for data in fn(samples)]
if ("rocketqa" in self.model_name):
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=self._tokenizer.pad_token_id), # input ids
Pad(axis=0, pad_val=self._tokenizer.pad_token_type_id
), # token type ids
): [data for data in fn(samples)]
else:
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=self._tokenizer.pad_token_id, dtype='int64'
), # text1_input_ids
Pad(axis=0,
pad_val=self._tokenizer.pad_token_type_id,
dtype='int64'), # text1_token_type_ids
Pad(axis=0, pad_val=self._tokenizer.pad_token_id, dtype='int64'
), # text2_input_ids
Pad(axis=0,
pad_val=self._tokenizer.pad_token_type_id,
dtype='int64'), # text2_token_type_ids
): [data for data in fn(samples)]

outputs = {}
outputs['data_loader'] = batches
Expand All @@ -164,26 +249,36 @@ def _run_model(self, inputs):
Run the task model from the outputs of the `_tokenize` function.
"""
results = []
with static_mode_guard():
for batch in inputs['data_loader']:
text1_ids, text1_segment_ids, text2_ids, text2_segment_ids = self._batchify_fn(
batch)
self.input_handles[0].copy_from_cpu(text1_ids)
self.input_handles[1].copy_from_cpu(text1_segment_ids)
self.predictor.run()
vecs_text1 = self.output_handle[1].copy_to_cpu()

self.input_handles[0].copy_from_cpu(text2_ids)
self.input_handles[1].copy_from_cpu(text2_segment_ids)
self.predictor.run()
vecs_text2 = self.output_handle[1].copy_to_cpu()

vecs_text1 = vecs_text1 / (vecs_text1**2).sum(
axis=1, keepdims=True)**0.5
vecs_text2 = vecs_text2 / (vecs_text2**2).sum(
axis=1, keepdims=True)**0.5
similarity = (vecs_text1 * vecs_text2).sum(axis=1)
results.extend(similarity)
if ("rocketqa" in self.model_name):
with static_mode_guard():
for batch in inputs['data_loader']:
input_ids, segment_ids = self._batchify_fn(batch)
self.input_handles[0].copy_from_cpu(input_ids)
self.input_handles[1].copy_from_cpu(segment_ids)
self.predictor.run()
scores = self.output_handle[0].copy_to_cpu().tolist()
results.extend(scores)
else:
with static_mode_guard():
for batch in inputs['data_loader']:
text1_ids, text1_segment_ids, text2_ids, text2_segment_ids = self._batchify_fn(
batch)
self.input_handles[0].copy_from_cpu(text1_ids)
self.input_handles[1].copy_from_cpu(text1_segment_ids)
self.predictor.run()
vecs_text1 = self.output_handle[1].copy_to_cpu()

self.input_handles[0].copy_from_cpu(text2_ids)
self.input_handles[1].copy_from_cpu(text2_segment_ids)
self.predictor.run()
vecs_text2 = self.output_handle[1].copy_to_cpu()

vecs_text1 = vecs_text1 / (vecs_text1**2).sum(
axis=1, keepdims=True)**0.5
vecs_text2 = vecs_text2 / (vecs_text2**2).sum(
axis=1, keepdims=True)**0.5
similarity = (vecs_text1 * vecs_text2).sum(axis=1)
results.extend(similarity)
inputs['result'] = results
return inputs

Expand Down
13 changes: 7 additions & 6 deletions paddlenlp/transformers/semantic_search/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,9 +268,10 @@ def forward(self,
position_ids=position_ids,
attention_mask=attention_mask,
return_prob_distributation=True)
accuracy = paddle.metric.accuracy(input=probs, label=labels)
loss = F.cross_entropy(input=logits, label=labels)

outputs = {"loss": loss, "accuracy": accuracy}

return outputs
if (labels is not None):
accuracy = paddle.metric.accuracy(input=probs, label=labels)
loss = F.cross_entropy(input=probs, label=labels)
outputs = {"loss": loss, "accuracy": accuracy}
return outputs
else:
return probs[:, 1]