Merge branch 'develop' into update_codegen_doc

PaddlePaddle · Sep 21, 2022 · 2ba7b97 · 2ba7b97
2 parents 358acb8 + 2080001
commit 2ba7b97
Show file tree

Hide file tree

Showing 35 changed files with 716 additions and 326 deletions.
diff --git a/README_en.md b/README_en.md
@@ -29,11 +29,29 @@
 **PaddleNLP** is an *easy-to-use* and *powerful* NLP library with **Awesome** pre-trained model zoo, supporting wide-range of NLP tasks from research to industrial applications.
 
 ## News 📢
-* 📝 2022.8.1 **PaddleNLP v2.3.5** Released！
-  * Release the dialogic code generation model [**CodeGen**](./examples/code_generation/codegen), which can be easily used via [Taskflow](./docs/model_zoo/taskflow.md).
-  * Release [**UIE en**](./model_zoo/uie), supports for multiple tasks in **open-domain** information extraction.
-  * Release [**RGL**](./examples/few_shot/RGL), an independent research prompt-base tuning approach for few-shot learning, the paper is accepted by NAACL 2022.
-* 🍭 2022.6.29 **PaddleNLP v2.3.4** Released! Whole series of Chinese pretrained models [**ERNIE Tiny**](./model_zoo/ernie-3.0) are released to quickly improve deployment efficiency. We also provides smaller and faster models [**UIE Tiny**](./model_zoo/uie) for universal information extraction.
+* 🔥 **2022.9.6 [PaddleNLPv2.4](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.4.0) Released!**
+
+  * 💎 NLP Tool：**[Pipelines](./pipelines)** released. Supports for fast construction of search engine and question answering systems, and it is expandable to all kinds of NLP systems. Building end-to-end pipelines for NLP tasks like playing Lego!
+
+  * 💢 Industrial application：Release **[Complete Solution of Text Classification](./applications/text_classification)** covering various scenarios of text classification: multi-class, multi-label and hierarchical, it also supports for **few-shot learning** and the training and optimization of **TrustAI**. Upgrade for [**Universal Information Extraction**](./model_zoo/uie) and release **UIE-M**, support both Chinese and English information extraction in a single model; release the data distillation solution for UIE to break the bottleneck of time-consuming of inference.
+
+  * 🍭 AIGC: Release code generation SOTA model [**CodeGen**](./examples/code_generation/codegen), supports for multiple programming languages code generation. Integrate [**Text to Image Model**](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/model_zoo/taskflow.md#%E6%96%87%E5%9B%BE%E7%94%9F%E6%88%90) DALL·E Mini, Disco Diffusion, Stable Diffusion, let's play and have some fun! Release [**Chinese Text Summarization Application**](./applications/text_summarization), first release of chinese text summarization model pretrained on a large scale of corpus, it can be use via Taskflow API and support for finetuning on your own data.
+
+  * 💪 Framework upgrade: Release [**Auto Model Compression API**](./docs/compression.md), supports for pruning and quantization automatically, lower the barriers of model compression; Release [**Few-shot Prompt**](./applications/text_classification/multi_class/few-shot), includes the algorithms such as PET, P-Tuning and RGL.
+
+
+* 👀 **2022.9.6 PaddlePaddle intelligent financial industry series live course**
+
+  * Centering on the industrial practice and development trend of deep learning technology in the financial industry, experts in the industry are invited to share the industrial practice. Discussion on the Future Development of Science and Technology Finance.
+
+  * Release the practical examples of industrial practice: Financial document information extraction based on UIE; FAQ question answering system based on Pipelines.
+
+  * **Live broadcast at 19:00 on Tuesdays and Thursdays from September 6th.**, scan the QR code to join the WeChat group and get the live link for free, discuss the experience with experts:
+
+    <div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/188596360-264415d4-5462-43ad-8517-5b7e690061ce.jpg" width="150" height="150" />
+    </div>
+
 * 🔥 2022.5.16 PaddleNLP [v2.3](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.3.0) Released!🎉
   * 💎 Release [**UIE** (Universal Information Extraction)](./model_zoo/uie) technique, single model supports multiple **open-domain** IE tasks. Super easy to use and finetune with few examples via [Taskflow](./docs/model_zoo/taskflow.md).
   * 😊 Release [**ERNIE 3.0**](./model_zoo/ernie-3.0) light-weight model achieved better results compared to ERNIE 2.0 on [CLUE](https://www.cluebenchmarks.com/), also including **🗜️lossless model compression** and **⚙️end-to-end deployment**.

diff --git a/applications/text_classification/hierarchical/analysis/README.md b/applications/text_classification/hierarchical/analysis/README.md
@@ -126,7 +126,8 @@ python sparse.py \
 * `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
 * `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
 * `seed`：随机种子，默认为3。
-* `rationale_num`：计算样本置信度时支持训练证据数量，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
 * `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
 * `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
 * `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。
@@ -185,7 +186,8 @@ python sparse.py \
 * `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
 * `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
 * `seed`：随机种子，默认为3。
-* `rationale_num`：计算样本置信度时支持训练证据数量，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
 * `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
 * `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
 * `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。

diff --git a/applications/text_classification/hierarchical/analysis/sparse.py b/applications/text_classification/hierarchical/analysis/sparse.py
@@ -42,7 +42,8 @@
 parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
 parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
 parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
-parser.add_argument("--rationale_num", type=int, default=3, help="Number of rationales per example.")
+parser.add_argument("--rationale_num_sparse", type=int, default=3, help="Number of rationales per example for sparse data.")
+parser.add_argument("--rationale_num_support", type=int, default=6, help="Number of rationales per example for support data.")
 parser.add_argument("--sparse_num", type=int, default=100, help="Number of sparse data.")
 parser.add_argument("--support_threshold", type=float, default="0.7", help="The threshold to select support data.")
 parser.add_argument("--support_num", type=int, default=100, help="Number of support data.")
@@ -180,7 +181,8 @@ def find_sparse_data():
     # Feature similarity analysis & select sparse data
     analysis_result = []
     for batch in dev_data_loader:
-        analysis_result += feature_sim(batch, sample_num=args.rationale_num)
+        analysis_result += feature_sim(batch,
+                                       sample_num=args.rationale_num_sparse)
     sparse_indexs, sparse_scores, preds = get_sparse_data(
         analysis_result, args.sparse_num)
 
@@ -285,7 +287,8 @@ def find_support_data():
     # Feature similarity analysis
     analysis_result = []
     for batch in sparse_data_loader:
-        analysis_result += feature_sim(batch, sample_num=-1)
+        analysis_result += feature_sim(batch,
+                                       sample_num=args.rationale_num_support)
 
     support_indexs, support_scores = get_support_data(analysis_result,
                                                       args.support_num,

diff --git a/applications/text_classification/multi_class/analysis/README.md b/applications/text_classification/multi_class/analysis/README.md
@@ -124,7 +124,8 @@ python sparse.py \
 * `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
 * `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
 * `seed`：随机种子，默认为3。
-* `rationale_num`：计算样本置信度时支持训练证据数量，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
 * `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
 * `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
 * `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。
@@ -182,7 +183,8 @@ python sparse.py \
 * `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
 * `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
 * `seed`：随机种子，默认为3。
-* `rationale_num`：计算样本置信度时支持训练证据数量，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
 * `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
 * `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
 * `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。

diff --git a/applications/text_classification/multi_class/analysis/sparse.py b/applications/text_classification/multi_class/analysis/sparse.py
@@ -42,7 +42,8 @@
 parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
 parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
 parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
-parser.add_argument("--rationale_num", type=int, default=3, help="Number of rationales per example.")
+parser.add_argument("--rationale_num_sparse", type=int, default=3, help="Number of rationales per example for sparse data.")
+parser.add_argument("--rationale_num_support", type=int, default=6, help="Number of rationales per example for support data.")
 parser.add_argument("--sparse_num", type=int, default=100, help="Number of sparse data.")
 parser.add_argument("--support_threshold", type=float, default="0.7", help="The threshold to select support data.")
 parser.add_argument("--support_num", type=int, default=100, help="Number of support data.")
@@ -180,7 +181,8 @@ def find_sparse_data():
     # Feature similarity analysis & select sparse data
     analysis_result = []
     for batch in dev_data_loader:
-        analysis_result += feature_sim(batch, sample_num=args.rationale_num)
+        analysis_result += feature_sim(batch,
+                                       sample_num=args.rationale_num_sparse)
     sparse_indexs, sparse_scores, preds = get_sparse_data(
         analysis_result, args.sparse_num)
 
@@ -290,7 +292,8 @@ def find_support_data():
     # Feature similarity analysis
     analysis_result = []
     for batch in sparse_data_loader:
-        analysis_result += feature_sim(batch, sample_num=-1)
+        analysis_result += feature_sim(batch,
+                                       sample_num=args.rationale_num_support)
 
     support_indexs, support_scores = get_support_data(analysis_result,
                                                       args.support_num,

diff --git a/applications/text_classification/multi_label/analysis/README.md b/applications/text_classification/multi_label/analysis/README.md
@@ -124,7 +124,8 @@ python sparse.py \
 * `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
 * `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
 * `seed`：随机种子，默认为3。
-* `rationale_num`：计算样本置信度时支持训练证据数量，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
 * `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
 * `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
 * `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。
@@ -183,7 +184,8 @@ python sparse.py \
 * `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
 * `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
 * `seed`：随机种子，默认为3。
-* `rationale_num`：计算样本置信度时支持训练证据数量，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
 * `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
 * `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
 * `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。

diff --git a/applications/text_classification/multi_label/analysis/sparse.py b/applications/text_classification/multi_label/analysis/sparse.py
@@ -42,7 +42,8 @@
 parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
 parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
 parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
-parser.add_argument("--rationale_num", type=int, default=3, help="Number of rationales per example.")
+parser.add_argument("--rationale_num_sparse", type=int, default=3, help="Number of rationales per example for sparse data.")
+parser.add_argument("--rationale_num_support", type=int, default=6, help="Number of rationales per example for support data.")
 parser.add_argument("--sparse_num", type=int, default=100, help="Number of sparse data.")
 parser.add_argument("--support_threshold", type=float, default="0.7", help="The threshold to select support data.")
 parser.add_argument("--support_num", type=int, default=100, help="Number of support data.")
@@ -180,7 +181,8 @@ def find_sparse_data():
     # Feature similarity analysis & select sparse data
     analysis_result = []
     for batch in dev_data_loader:
-        analysis_result += feature_sim(batch, sample_num=args.rationale_num)
+        analysis_result += feature_sim(batch,
+                                       sample_num=args.rationale_num_sparse)
     sparse_indexs, sparse_scores, preds = get_sparse_data(
         analysis_result, args.sparse_num)
 
@@ -280,7 +282,8 @@ def find_support_data():
     # Feature similarity analysis
     analysis_result = []
     for batch in sparse_data_loader:
-        analysis_result += feature_sim(batch, sample_num=-1)
+        analysis_result += feature_sim(batch,
+                                       sample_num=args.rationale_num_support)
 
     support_indexs, support_scores = get_support_data(analysis_result,
                                                       args.support_num,