Merge branch 'master' into docs

Newsouljulio · Oct 28, 2020 · aa969f6 · aa969f6
2 parents f75e71a + 68f8315
commit aa969f6
Show file tree

Hide file tree

Showing 14 changed files with 712 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -68,6 +68,7 @@
     |   排序   |                    [FGCNN](models/rank/fgcnn/model.py)                    |    ✓    |    ✓    |     ✓     |     ✓     | [WWW 2019][Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1904.04447.pdf)                                                                      |
     |   排序   |                  [Fibinet](models/rank/fibinet/model.py)                  |    ✓    |    ✓    |     ✓     |     ✓     | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf)                                                 |
     |   排序   |                     [Flen](models/rank/flen/model.py)                     |    ✓    |    ✓    |     ✓     |     ✓     | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf)                                                                                                           |
+    |  多任务  |                  [PLE](models/multitask/ple/model.py)                   |    ✓    |    ✓    |     ✓     |     ✓     | [RecSys 2020][Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations](https://dl.acm.org/doi/abs/10.1145/3383313.3412236)                                                              |
     |  多任务  |                  [ESMM](models/multitask/esmm/model.py)                   |    ✓    |    ✓    |     ✓     |     ✓     | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931)                                                              |
     |  多任务  |                  [MMOE](models/multitask/mmoe/model.py)                   |    ✓    |    ✓    |     ✓     |     ✓     | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007)                                                       |
     |  多任务  |           [ShareBottom](models/multitask/share-bottom/model.py)           |    ✓    |    ✓    |     ✓     |     ✓     | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf)                                                                                                               |
@@ -81,7 +82,7 @@
 
 ### 环境要求
 * Python 2.7/ 3.5 / 3.6 / 3.7
-* PaddlePaddle  >= 1.7.2
+* PaddlePaddle  >= 1.7.2 <= 1.8.5 
 * 操作系统: Windows/Mac/Linux
 
   > Windows下PaddleRec目前仅支持单机训练，分布式训练建议使用Linux环境
@@ -99,10 +100,10 @@
 
 - 安装方法二 **源码编译安装**
 
-  - 安装飞桨  **注：需要用户安装版本 == 1.7.2 的飞桨**
+  - 安装飞桨  **注：需要用户安装版本 == 1.8.5 的飞桨**
 
     ```shell
-    python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple
+    python -m pip install paddlepaddle==1.8.5 -i https://mirror.baidu.com/pypi/simple
     ```
 
   - 源码安装PaddleRec
@@ -175,6 +176,7 @@ python -m paddlerec.run -m models/rank/dnn/config.yaml
 <p>
 
 ### 版本历史
+- 2020.10.12 - PaddleRec v1.8.5
 - 2020.06.17 - PaddleRec v0.1.0
 - 2020.06.03 - PaddleRec v0.0.2
 - 2020.05.14 - PaddleRec v0.0.1

diff --git a/README_EN.md b/README_EN.md
@@ -76,7 +76,7 @@
 
 ### Environmental requirements
 * Python 2.7/ 3.5 / 3.6 / 3.7
-* PaddlePaddle  >= 1.7.2
+* PaddlePaddle  >= 1.7.2 <= 1.8.5 
 * operating system: Windows/Mac/Linux
 
   > Linux is recommended for distributed training
@@ -97,7 +97,7 @@
   - Install PaddlePaddle  
 
     ```shell
-    python -m pip install paddlepaddle==1.7.2 -i https://mirror.baidu.com/pypi/simple
+    python -m pip install paddlepaddle==1.8.5 -i https://mirror.baidu.com/pypi/simple
     ```
 
   - Install PaddleRec by source code
@@ -169,6 +169,7 @@ python -m paddlerec.run -m models/rank/dnn/config.yaml
 <p>
 
 ### Version history
+- 2020.10.12 - PaddleRec v1.8.5
 - 2020.06.17 - PaddleRec v0.1.0
 - 2020.06.03 - PaddleRec v0.0.2
 - 2020.05.14 - PaddleRec v0.0.1

diff --git a/models/demo/movie_recommand/rank/model.py b/models/demo/movie_recommand/rank/model.py
@@ -83,10 +83,6 @@ def embedding_layer(input):
 
         predict = fluid.layers.scale(sim, scale=5)
         self.predict = predict
-        # auc, batch_auc, _ = fluid.layers.auc(input=self.predict,
-        #                                     label=self.label_input,
-        #                                     num_thresholds=10000,
-        #                                     slide_steps=20)
 
         if is_infer:
             self._infer_results["uid"] = self._sparse_data_var[2]
@@ -95,10 +91,6 @@ def embedding_layer(input):
             self._infer_results["predict"] = self.predict
             return
 
-        #self._metrics["AUC"] = auc
-        #self._metrics["BATCH_AUC"] = batch_auc
-        # cost = fluid.layers.cross_entropy(
-        #    input=self.predict, label=self.label_input)
         cost = fluid.layers.square_error_cost(
             self.predict,
             fluid.layers.cast(

diff --git a/models/multitask/ple/README.md b/models/multitask/ple/README.md
@@ -0,0 +1,107 @@
+# PLE
+
+ 以下是本例的简要目录结构及说明： 
+
+```
+├── data # 文档
+	├── train #训练数据
+		├── train_data.txt
+	├── test  #测试数据
+		├── test_data.txt
+	├── run.sh
+	├── data_preparation.py
+├── __init__.py 
+├── config.yaml #配置文件
+├── census_reader.py #数据读取文件
+├── model.py #模型文件
+```
+
+注：在阅读该示例前，建议您先了解以下内容：
+
+[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
+
+## 内容
+
+- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/ple#模型简介)
+- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/ple#数据准备)
+- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/ple#运行环境)
+- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/ple#快速开始)
+- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/ple#论文复现)
+- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/ple#进阶使用)
+- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/ple#FAQ)
+
+## 模型简介
+
+多任务模型通过学习不同任务的联系和差异，可提高每个任务的学习效率和质量。但在多任务场景中经常出现跷跷板现象，即有些任务表现良好，有些任务表现变差。  论文[《Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations》](https://dl.acm.org/doi/abs/10.1145/3383313.3412236 ) ，论文提出了Progressive Layered Extraction (简称PLE)，来解决多任务学习的跷跷板现象。 
+
+我们在Paddlepaddle定义PLE的网络结构，在开源数据集Census-income Data上验证模型效果。
+
+若进行精度验证，请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/ple#论文复现)部分。
+
+本项目支持功能
+
+训练：单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练，配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
+预测：单机CPU、单机单卡GPU ；配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
+
+## 数据准备
+
+数据地址： [Census-income Data](https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz )
+
+
+生成的格式以逗号为分割点
+
+```
+0,0,73,0,0,0,0,1700.09,0,0
+```
+
+完整的大数据参考论文复现部分。
+
+## 运行环境
+
+PaddlePaddle>=1.7.2
+
+python 2.7/3.5/3.6/3.7
+
+PaddleRec >=0.1
+
+os : windows/linux/macos
+
+## 快速开始
+
+### 单机训练
+
+CPU环境
+
+在config.yaml文件中设置好设备，epochs等。
+
+```
+dataset:
+- name: dataset_train
+  batch_size: 5
+  type: QueueDataset
+  data_path: "{workspace}/data/train"
+  data_converter: "{workspace}/census_reader.py"
+- name: dataset_infer
+  batch_size: 5
+  type: QueueDataset
+  data_path: "{workspace}/data/train"
+  data_converter: "{workspace}/census_reader.py"
+```
+
+### 单机预测
+
+CPU环境
+
+在config.yaml文件中设置好epochs、device等参数。
+```
+- name: infer_runner
+  class: infer
+  init_model_path: "increment/0"
+  device: cpu
+```
+
+## 论文复现
+
+## 进阶使用
+
+## FAQ
diff --git a/models/multitask/ple/__init__.py b/models/multitask/ple/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/models/multitask/ple/census_reader.py b/models/multitask/ple/census_reader.py
@@ -0,0 +1,52 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+from paddlerec.core.reader import ReaderBase
+
+
+class Reader(ReaderBase):
+    def init(self):
+        pass
+
+    def generate_sample(self, line):
+        """
+        Read the data line by line and process it as a dictionary
+
+        """
+
+        def reader():
+            """
+            This function needs to be implemented by the user, based on data format
+            """
+            l = line.strip().split(',')
+            l = list(map(float, l))
+            label_income = []
+            label_marital = []
+            data = l[2:]
+            if int(l[1]) == 0:
+                label_income = [1, 0]
+            elif int(l[1]) == 1:
+                label_income = [0, 1]
+            if int(l[0]) == 0:
+                label_marital = [1, 0]
+            elif int(l[0]) == 1:
+                label_marital = [0, 1]
+            # label_income = np.array(label_income)
+            # label_marital = np.array(label_marital)
+            feature_name = ["input", "label_income", "label_marital"]
+            yield zip(feature_name, [data] + [label_income] + [label_marital])
+
+        return reader
diff --git a/models/multitask/ple/config.yaml b/models/multitask/ple/config.yaml
@@ -0,0 +1,70 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+workspace: "models/multitask/ple"
+
+dataset:
+- name: dataset_train
+  batch_size: 5 # or big data set 32
+  type: DataLoader # or QueueDataset
+  data_path: "{workspace}/data/train"
+  data_converter: "{workspace}/census_reader.py"
+- name: dataset_infer
+  batch_size: 5 # or big data set 32
+  type: DataLoader # or QueueDataset
+  data_path: "{workspace}/data/train"
+  data_converter: "{workspace}/census_reader.py"
+
+hyper_parameters:
+  feature_size: 499
+  task_num: 2
+  shared_num: 2
+  exp_per_task: 3
+  level_number: 1
+  expert_size: 16
+  tower_size: 8
+  optimizer: 
+    class: adam
+    learning_rate: 0.001
+    strategy: async
+
+mode: [train_runner, infer_runner]
+
+runner:
+- name: train_runner
+  class: train
+  device: cpu # or gpu
+  selected_gpus: "0"
+  epochs: 10
+  save_checkpoint_interval: 1
+  save_inference_interval: 4
+  save_checkpoint_path: "increment_ple"
+  save_inference_path: "inference"
+  print_interval: 1 # big data set 10
+  phases: [train]
+- name: infer_runner
+  class: infer
+  init_model_path: "increment_ple/1"
+  device: cpu # or gpu
+  phases: [infer]
+
+phase:
+- name: train
+  model: "{workspace}/model.py"
+  dataset_name: dataset_train
+  thread_num: 1
+- name: infer
+  model: "{workspace}/model.py"
+  dataset_name: dataset_infer
+  thread_num: 1