Merge pull request PaddlePaddle#788 from simuler/metaheac

Add Metaheac model
shironano · Jun 15, 2022 · 08c367f · 08c367f
2 parents d10fd27 + a8db48b
commit 08c367f
Show file tree

Hide file tree

Showing 18 changed files with 1,251 additions and 0 deletions.
diff --git a/README_CN.md b/README_CN.md
@@ -182,6 +182,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml #  静态图训
   |  多任务  |                                 [Maml](models/multitask/maml/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/maml.html))                                 |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238412)  |      x      |     x     | >=2.1.0 | [PMLR 2017][Model-agnostic meta-learning for fast adaptation of deep networks](https://arxiv.org/pdf/1703.03400.pdf)                                                                                                               |
   |  多任务  |                         [DSelect_K](models/multitask/dselect_k/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/dselect_k.html))                          |  -  |      x      |     x     | >=2.1.0 | [NeurIPS 2021][DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning](https://arxiv.org/pdf/2106.03760v3.pdf)                                                                                                               |
   |  多任务  |                         [ESCM2](models/multitask/escm2/)                          |  -  |      x      |     x     | >=2.1.0 | [SIGIR 2022][ESCM2: Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation](https://arxiv.org/pdf/2204.05125.pdf)                                                                                                               |
+  |  多任务  |                         [MetaHeac](models/multitask/metaheac/)                          |  -  |      x      |     x     | >=2.1.0 | [KDD 2021][Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising](https://arxiv.org/pdf/2105.14688.pdf)                                                                                                               |
   |  重排序  |                                      [Listwise](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rerank/listwise/)                                       |  -  |       ✓     |     x     | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf)                                                                           |
 
 

diff --git a/README_EN.md b/README_EN.md
@@ -173,6 +173,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml #  Training wit
   |      Multi-Task       |           [Maml](models/multitask/maml/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/maml.html))           |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238412)  |    x      |     x     | >=2.1.0 | [PMLR 2017][Model-agnostic meta-learning for fast adaptation of deep networks](https://arxiv.org/pdf/1703.03400.pdf)                                                                                                               |
   |  Multi-Task  |           [DSelect_K](models/multitask/dselect_k/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/dselect_k.html))           |  -  |      x      |     x     | >=2.1.0 | [NeurIPS 2021][DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning](https://arxiv.org/pdf/2106.03760v3.pdf)                                                                                                               |
   |  Multi-Task  |                         [ESCM2](models/multitask/escm2/)                          |  -  |      x      |     x     | >=2.1.0 | [SIGIR 2022][ESCM2: Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation](https://arxiv.org/pdf/2204.05125.pdf)                                                                                                               |
+  |  Multi-Task  |                         [MetaHeac](models/multitask/metaheac/)                          |  -  |      x      |     x     | >=2.1.0 | [KDD 2021][Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising](https://arxiv.org/pdf/2105.14688.pdf)                                                                                                               |
   |        Re-Rank        |                [Listwise](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rerank/listwise/)                |  -  |         ✓         |     x     |  [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf)                                                                           |
 
 <h2 align="center">Community</h2>

diff --git a/contributor.md b/contributor.md
@@ -21,6 +21,7 @@
   |                     [MHCN](models/recall/mhcn/)                     |  [Andy1314Chen](https://github.com/Andy1314Chen)  |    https://github.com/PaddlePaddle/PaddleRec/pull/679   | 论文复现赛第五期 |
   |                     [DCN_V2](models/rank/dcn_v2/)                     |  [LinJayan](https://github.com/LinJayan)  |    https://github.com/PaddlePaddle/PaddleRec/pull/677   | 论文复现赛第五期 |
   |                     [SIGN](models/rank/sign/)                     |  [BamLubi](https://github.com/BamLubi)  |    https://github.com/PaddlePaddle/PaddleRec/pull/748   | 论文复现赛第六期 |
+  |                     [MetaHeac](models/multitask/metaheac/)                     |  [simuler](https://github.com/simuler)  |    https://github.com/PaddlePaddle/PaddleRec/pull/788   | 论文复现赛第六期 |
   |                     [FGCNN](models/rank/fgcnn/)                     |  [yoreG123 chenjiyan2001](https://github.com/yoreG123)  |    https://github.com/PaddlePaddle/PaddleRec/pull/784   | 论文复现赛第六期 |
 
 </div> 
diff --git a/datasets/Lookalike/run.sh b/datasets/Lookalike/run.sh
@@ -0,0 +1,15 @@
+
+wget https://paddlerec.bj.bcebos.com/datasets/lookalike/Lookalike_data.rar
+rar e Lookalike_data.rar
+
+mkdir train_data
+mkdir test_cold_data
+mkdir test_hot_data
+
+mv train_stage1.pkl train_data
+mv test_hot_stage1.pkl test_hot_data
+mv test_hot_stage2.pkl test_hot_data
+mv test_cold_stage1.pkl test_cold_data
+mv test_cold_stage2.pkl test_cold_data
+
+rm -rf Lookalike_data.rar
diff --git a/doc/imgs/metaheac.png b/doc/imgs/metaheac.png
diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -92,6 +92,7 @@
    models/multitask/ple.md
    models/multitask/share_bottom.md
    models/multitask/dselect_k.md
+   models/multitask/metaheac.md
 
 .. toctree::
    :maxdepth: 1

diff --git a/doc/source/models/multitask/metaheac.md b/doc/source/models/multitask/metaheac.md
@@ -0,0 +1,108 @@
+# MetaHeac
+
+以下是本例的简要目录结构及说明：
+
+```
+├── data #样例数据
+    ├── train #训练数据
+        ├── train_stage1.pkl
+    ├── test #测试数据
+        ├── test_stage1.pkl
+        ├── test_stage2.pkl
+├── net.py # 核心模型组网
+├── config.yaml # sample数据配置
+├── config_big.yaml # 全量数据配置
+├── dygraph_model.py # 构建动态图
+├── reader_train.py # 训练数据读取程序
+├── reader_test.py # infer数据读取程序
+├── readme.md #文档
+```
+
+注：在阅读该示例前，建议您先了解以下内容：
+
+[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
+
+## 内容
+
+- [模型简介](#模型简介)
+- [数据准备](#数据准备)
+- [运行环境](#运行环境)
+- [快速开始](#快速开始)
+- [模型组网](#模型组网)
+- [效果复现](#效果复现)
+- [infer说明](#infer说明)
+- [进阶使用](#进阶使用)
+- [FAQ](#FAQ)
+
+## 模型简介
+在推荐系统和广告平台上，营销人员总是希望通过视频或者社交等媒体渠道向潜在用户推广商品、内容或者广告。扩充候选集技术（Look-alike建模）是一种很有效的解决方案，但look-alike建模通常面临两个挑战：（1）一家公司每天可以开展数百场营销活动，以推广完全不同类别的各种内容。（2）某项活动的种子集只能覆盖有限的用户，因此一个基于有限种子用户的定制化模型往往会产生严重的过拟合。为了解决以上的挑战，论文《Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising》提出了一种新的两阶段框架Meta Hybrid Experts and Critics (MetaHeac)，采用元学习的方法训练一个泛化初始化模型，从而能够快速适应新类别内容推广任务。
+
+## 数据准备
+使用Tencent Look-alike Dataset,该数据集包含几百个种子人群、海量候选人群对应的用户特征，以及种子人群对应的广告特征。出于业务数据安全保证的考虑，所有数据均为脱敏处理后的数据。本次复现使用处理过的数据集，直接下载[propocessed data](https://drive.google.com/file/d/11gXgf_yFLnbazjx24ZNb_Ry41MI5Ud1g/view?usp=sharing),mataheac/data/目录下存放了从全量数据集获取的少量数据集，用于对齐模型。
+
+## 运行环境
+PaddlePaddle>=2.0
+
+python 2.7/3.5/3.6/3.7
+
+os : windows/linux/macos
+
+## 快速开始
+本文提供了样例数据可以供您快速体验，在任意目录下均可执行。在MetaHeac模型目录的快速执行命令如下：
+```bash
+# 进入模型目录
+# cd PaddleRec/models/multitask/metaheac/ # 在任意目录均可运行
+# 动态图训练
+python -u ../../../tools/trainer.py -m config.yaml # 全量数据运行config_bigdata.yaml 
+# 动态图预测
+python -u ./infer_meta.py -m config.yaml
+```
+
+## 模型组网
+MetaHeac是发表在 KDD 2021 的论文[《Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising》](  https://arxiv.org/pdf/2105.14688  )文章提出一种新的两阶段框架Meta Hybrid Experts and Critics (MetaHeac),有效解决了真实场景中难以构建泛化模型,同时在所有内容领域中扩充高质量的受众候选集和基于有限种子用户的定制化模型容易产生严重过拟合的两个关键问题模型的主要组网结构如下：
+[MetaHeac](https://arxiv.org/pdf/2105.14688):
+<p align="center">
+<img align="center" src="../../../doc/imgs/metaheac.png">
+<p>
+
+## 效果复现
+为了方便使用者能够快速的跑通每一个模型，我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。
+在全量数据下模型的指标如下(train.py文件内 paddle.seed = 2021下效果)：
+
+| 模型    | auc    | batch_size | epoch_num| Time of each epoch |
+|:------|:-------| :------ | :------| :------ |
+| MetaHeac | 0.7112 | 1024 | 1 | 3个小时左右 |
+
+1. 确认您当前所在目录为PaddleRec/models/multitask/metaheac  
+2. 进入paddlerec/datasets/目录下，执行该脚本，会从国内源的服务器上下载我们预处理完成的Lookalike全量数据集，并解压到指定文件夹。
+``` bash
+cd ../../../datasets/Lookalike
+sh run.sh
+``` 
+3. 切回模型目录,执行命令运行全量数据
+```bash
+cd ../../models/multitask/metaheac/ # 切回模型目录
+# 动态图训练
+# step1： train
+python -u ../../../tools/trainer.py -m config_big.yaml
+# 动态图预测
+# step2： infer 此时test数据集为hot
+python -u ./infer_meta.py -m config_big.yaml
+# step3：修改config_big.yaml文件中test_data_dir的路径为cold
+# python -u ./infer_meta.py -m config.yaml
+```
+
+## infer说明
+### 数据集说明
+为了测试模型在不同规模的内容定向推广任务上的表现，将数据集根据内容定向推广任务给定的候选集大小进行了划分，分为大于T和小于T两部分。将腾讯广告大赛2018的Look-alike数据集中的T设置为4000，其中hot数据集中候选集大于T,cold数据集中候选集小于T.
+### infer_meta.py说明
+infer_meta.py是用于元学习模型infer的tool,在使用中主要有以下几点需要注意:
+1. 在对模型进行infer时(train时也可使用这样的操作),可以将runner.infer_batch_size注释掉,这样将禁用DataLoader的自动组batch功能,进而可以使用自定义的组batch方式.
+2. 由于元学习在infer时需要先对特定任务的少量数据集进行训练,因此在infer_meta.py的infer_dataloader中每次接收单个子任务的全量infer数据集(包括训练数据和测试数据).
+3. 实际组batch在infer.py中进行,在获取到单个子任务的数据后,获取config中的batch_size参数,对训练数据和测试数据进行组batch,并分别调用dygraph_model.py中的infer_train_forward和infer_forward进行训练和测试.
+4. 和普通infer不同,由于需要对单个子任务进行少量数据的train和test,对于每个子任务来说加载的都是train阶段训练好的泛化模型.
+5. 在对单个子任务infer时,创建了局部的paddle.metric.Auc("ROC"),可以查看每个子任务的AUC指标,在全局metric中维护包含所有子任务的AUC指标.
+
+## 进阶使用
+
+## FAQ
diff --git a/models/multitask/metaheac/config.yaml b/models/multitask/metaheac/config.yaml
@@ -0,0 +1,49 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+runner:
+  train_data_dir: "./data/train"
+  train_reader_path: "reader_train" # importlib format
+  use_gpu: False
+  use_auc: True
+#   train_batch_size: 32
+  epochs: 1
+  print_interval: 1
+  model_save_path: "output_model_metaheac"
+  test_data_dir: "./data/test"
+#   infer_batch_size: 32
+  infer_reader_path: "reader_infer" # importlib format
+  infer_load_path: "output_model_metaheac"
+  infer_start_epoch: 0
+  infer_end_epoch: 1
+  #use inference save model
+  use_inference: False
+  infer_train_epoch: 2
+
+hyper_parameters:
+  max_idxs: [[3, 2, 855, 5, 7, 2, 1], [124, 82, 12, 263312, 49780, 10002, 9984], [78, 137, 14, 39,32,3]]
+  embed_dim: 64
+  mlp_dims: [64, 64]
+  local_lr: 0.0002
+  num_expert: 8
+  num_output: 5
+  task_count: 5
+  batch_size: 32
+
+  optimizer:
+    class: adam
+    global_learning_rate: 0.001
+    local_test_learning_rate: 0.001
+    strategy: async
diff --git a/models/multitask/metaheac/config_big.yaml b/models/multitask/metaheac/config_big.yaml
@@ -0,0 +1,50 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+runner:
+  train_data_dir: "../../../datasets/Lookalike/train_data"
+  train_reader_path: "reader_train" # importlib format
+  use_gpu: True
+  use_auc: True
+#   train_batch_size: 32
+  epochs: 1
+  print_interval: 100
+  model_save_path: "output_model_metaheac_all"
+  test_data_dir: "../../../datasets/Lookalike/test_hot_data"
+#   test_data_dir: "../../../datasets/Lookalike/test_cold_data"
+#   infer_batch_size: 32
+  infer_reader_path: "reader_infer" # importlib format
+  infer_load_path: "output_model_metaheac_all"
+  infer_start_epoch: 0
+  infer_end_epoch: 1
+  #use inference save model
+  use_inference: False
+  infer_train_epoch: 2
+
+hyper_parameters:
+  max_idxs: [[3, 2, 855, 5, 7, 2, 1], [124, 82, 12, 263312, 49780, 10002, 9984], [78, 137, 14, 39,32,3]]
+  embed_dim: 64
+  mlp_dims: [64, 64]
+  local_lr: 0.0002
+  num_expert: 8
+  num_output: 5
+  task_count: 5
+  batch_size: 1024
+
+  optimizer:
+    class: adam
+    global_learning_rate: 0.001
+    local_test_learning_rate: 0.001
+    strategy: async
diff --git a/models/multitask/metaheac/data/test/test_stage1.pkl b/models/multitask/metaheac/data/test/test_stage1.pkl
diff --git a/models/multitask/metaheac/data/test/test_stage2.pkl b/models/multitask/metaheac/data/test/test_stage2.pkl
diff --git a/models/multitask/metaheac/data/train/train_stage1.pkl b/models/multitask/metaheac/data/train/train_stage1.pkl