Add FasterTokenizer model in experiment #1220

Steffy-zxf · 2021-10-22T07:50:32Z

PR types

New features

PR changes

APIs

Description

add FasterTokenizer model in experiment
add demo with FasterTokenizer usage in experiment

2. add demo with FasterTokenizer usage in experiment

ZeyuChen

to_string_tensor -> to_tensor, 移入experimental中
动转静需要内置到FasterModelForXXXX类中，在上层动转静接口中屏蔽STRINGS对象暴露
动转静导出建议同时导出probs和argmax后结果，可以使推理结果更加便捷。
不要让用户额外撰写softmax的算子实现

ZeyuChen · 2021-10-26T11:48:20Z

examples/experimental/faster_ernie/cpp_deploy/compile.sh

+rm -rf *
+
+# same with the demo.cc
+DEMO_NAME=demo


不要叫DEMO，这不是DEMO

修改为text_cls_infer

ZeyuChen · 2021-10-26T11:48:38Z

examples/experimental/faster_ernie/cpp_deploy/demo.cc

@@ -0,0 +1,64 @@
+


不要这些莫名其妙的空行

文件名不要定义为demo，改为infer.cc

或者是ernie_infer。同时后面应该还得区分下句子分类还是序列标注任务

done，修改为text_cls_infer

seq_cls_infer/token_cls_infer可能可以跟类名保持更好一致

ZeyuChen · 2021-10-26T11:50:00Z

examples/experimental/faster_ernie/cpp_deploy/demo.cc

+      "办理入住手续，节省时间。"};
+
+  std::vector<float> probs;
+  Run(predictor.get(), &data, &probs);


需要要给出print的结果

输出应该是const引用，保持data输入

如果这个demo就是分类，那就写清楚分类的，和序列标注的分开

seq_cls_infer/token_cls_infer可能可以跟类名保持更好一致

Done，已修改为seq_cls_infer

ZeyuChen · 2021-10-26T11:51:21Z

examples/experimental/faster_ernie/cpp_deploy/demo.cc

+}
+
+void Run(Predictor* predictor,
+         std::vector<std::string>* input_data,


输入应该是用const引用，输出才是指针

修改为

void Run(Predictor* predictor, const std::vector<std::string>& input_data, std::vector<float>* logits, std::vector<int64_t>* predictions)

ZeyuChen · 2021-10-26T11:52:47Z

examples/experimental/faster_ernie/export_model.py

+
+import paddle
+import paddlenlp
+from paddlenlp.experimental import FastSequenceClassificationModel


Faster，我们整个技术代号统一使用Faster

ZeyuChen · 2021-10-27T03:29:46Z

paddlenlp/experimental/model.py

+        return logits
+
+
+class FastSequenceClassificationModel(object):


FasterModelForSequenceClassification

ZeyuChen · 2021-10-27T03:30:54Z

paddlenlp/ops/strings.py

 import paddle.fluid.core as core

+__all__ = ['to_string_tensor', 'to_vocab_tensor']


整体挪到paddlenlp.experimental中去

ZeyuChen · 2021-10-27T03:32:00Z

paddlenlp/experimental/model.py

+            raise ValueError("Unknown name %s. Now %s surports  %s" %
+                             (pretrained_model_name_or_path, cls.__name__,
+                              list(name_model.keys())))
+


基于这个类新增to_static接口，屏蔽STRINGS类型对外暴露

ZeyuChen · 2021-10-27T03:32:05Z

paddlenlp/experimental/model.py

+        else:
+            raise ValueError("Unknown name %s. Now %s surports  %s" %
+                             (pretrained_model_name_or_path, cls.__name__,
+                              list(name_model.keys())))


基于这个类新增to_static接口，屏蔽STRINGS类型对外暴露

ZeyuChen · 2021-10-27T03:32:23Z

paddlenlp/ops/strings.py

@@ -12,10 +12,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import paddle


移入paddlenlp/experimental/中

2. rename to_vocab_tensor to to_vocab_buff 3. add to_static() api to FasterModel

2. remove some redudant code

2. suuport from_pretrained() with given a local directory

ZeyuChen · 2021-10-28T01:57:30Z

examples/experimental/faster_ernie/text_cls/cpp_deploy/CMakeLists.txt

+    set(CUDA_LIB "/usr/local/cuda/lib64/" CACHE STRING "CUDA Library")
+  else()
+    if(CUDA_LIB STREQUAL "")
+      set(CUDA_LIB "C:\\Program\ Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\lib\\x64")


这个命令写死路径可能不一定正确，回头得windows测试验证下

ZeyuChen · 2021-10-28T14:30:53Z

examples/experimental/faster_ernie/text_cls/train.py

+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        accu = metric.accumulate()


此处的accumulte应该在循环外还是在循环内？

ZeyuChen · 2021-10-28T14:31:02Z

examples/experimental/faster_ernie/text_cls/train.py

+
+
+def create_dataloader(dataset, mode='train', batch_size=1):
+


去掉无用空行

ZeyuChen · 2021-10-28T14:32:28Z

examples/experimental/faster_tokenizer/demo.py

+text = '小说是文学的一种样式，一般描写人物故事，塑造多种多样的人物形象，但亦有例外。它是拥有不完整布局、发展及主题的文学作品。而对话是不是具有鲜明的个性，每个人物说的没有独特的语言风格，是衡量小说水准的一个重要标准。与其他文学样式相比，小说的容量较大，它可以细致的展现人物性格和命运，可以表现错综复杂的矛盾冲突，同时还可以描述人物所处的社会生活环境。小说一词，最早见于《庄子·外物》：“饰小说以干县令，其于大达亦远矣。”这里所说的小说，是指琐碎的言谈、小的道理，与现时所说的小说相差甚远。文学中，小说通常指长篇小说、中篇、短篇小说和诗的形式。小说是文学的一种样式，一般描写人物故事，塑造多种多样的人物形象，但亦有例外。它是拥有不完整布局、发展及主题的文学作品。而对话是不是具有鲜明的个性，每个人物说的没有独特的语言风格，是衡量小说水准的一个重要标准。与其他文学样式相比，小说的容量较大，它可以细致的展现人物性格和命运，可以表现错综复杂的矛盾冲突，同时还可以描述人物所处的社会生活环境。小说一词，最早见于《庄子·外物》：“饰小说以干县令，其于大达亦远矣。”这里所说的小说，是指琐碎的言谈、小的道理，与现时所说的小说相差甚远。文学中'
+data = [text[:max_seq_length]] * 100
+
+pp_tokenizer = FasterTokenizer(vocab, do_lower_case=False)


此处接口是否需要与XXXTokenizer.from_pretrained的API体验打平？以及是否需要

ZeyuChen

LGTM

Steffy-zxf added 5 commits October 22, 2021 06:12

1. add FasterTokenizer model in experiment

d8eacc9

2. add demo with FasterTokenizer usage in experiment

add cpp deploy demo

18587f9

add test time script

9b74b1b

add example

f6ef389

mv python_deploy location

6e32df5

ZeyuChen self-assigned this Oct 26, 2021

Steffy-zxf added 3 commits October 26, 2021 22:13

optimize cpp demo

771af1b

add faster model

3775622

add faster tokenizer export model

8eec9ca

ZeyuChen reviewed Oct 27, 2021

View reviewed changes

Steffy-zxf added 6 commits October 27, 2021 17:00

1. rename to_string_tensor to to_tensor

d5ff0ee

2. rename to_vocab_tensor to to_vocab_buff 3. add to_static() api to FasterModel

1. add text cls example with FasterModelForSequenceClassification

84959f6

2. remove some redudant code

update cpp deploy

f96c478

1. rename text_cls with seq_cls

d6c77a3

2. suuport from_pretrained() with given a local directory

add experimental.model_utils.py

bcb5d76

update seq_cls cpp demo, drop to print the label string

734b93e

ZeyuChen reviewed Oct 30, 2021

View reviewed changes

Steffy-zxf added 2 commits November 1, 2021 10:29

add FasterTokenizer.from_pretrained() api

046d9ce

use FasterTokenizer.from_pretrained() api

de328db

ZeyuChen approved these changes Nov 1, 2021

View reviewed changes

Merge branch 'develop' into exp-faster-tokenizer

fda2ec2

ZeyuChen merged commit 9e198d0 into PaddlePaddle:develop Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FasterTokenizer model in experiment #1220

Add FasterTokenizer model in experiment #1220

Steffy-zxf commented Oct 22, 2021

ZeyuChen left a comment

ZeyuChen Oct 26, 2021

Steffy-zxf Oct 27, 2021

ZeyuChen Oct 26, 2021

ZeyuChen Oct 26, 2021

ZeyuChen Oct 26, 2021

Steffy-zxf Oct 27, 2021

ZeyuChen Oct 28, 2021

ZeyuChen Oct 26, 2021

ZeyuChen Oct 26, 2021

ZeyuChen Oct 26, 2021

ZeyuChen Oct 28, 2021

Steffy-zxf Oct 29, 2021

ZeyuChen Oct 26, 2021

Steffy-zxf Oct 27, 2021

ZeyuChen Oct 26, 2021

Steffy-zxf Oct 27, 2021

ZeyuChen Oct 27, 2021

Steffy-zxf Oct 27, 2021

ZeyuChen Oct 27, 2021

Steffy-zxf Oct 27, 2021

ZeyuChen Oct 27, 2021

Steffy-zxf Oct 27, 2021

ZeyuChen Oct 27, 2021

Steffy-zxf Oct 27, 2021

ZeyuChen Oct 27, 2021

Steffy-zxf Oct 27, 2021

ZeyuChen Oct 28, 2021

ZeyuChen Oct 28, 2021

ZeyuChen Oct 28, 2021

ZeyuChen Oct 28, 2021

ZeyuChen left a comment

		import paddle.fluid.core as core

		__all__ = ['to_string_tensor', 'to_vocab_tensor']

Add FasterTokenizer model in experiment #1220

Add FasterTokenizer model in experiment #1220

Conversation

Steffy-zxf commented Oct 22, 2021

PR types

PR changes

Description

ZeyuChen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZeyuChen left a comment

Choose a reason for hiding this comment