diff --git a/README_CN.md b/README_CN.md index 168a81b96..e74fdb8f4 100644 --- a/README_CN.md +++ b/README_CN.md @@ -182,6 +182,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # 静态图训 | 多任务 | [Maml](models/multitask/maml/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/maml.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238412) | x | x | >=2.1.0 | [PMLR 2017][Model-agnostic meta-learning for fast adaptation of deep networks](https://arxiv.org/pdf/1703.03400.pdf) | | 多任务 | [DSelect_K](models/multitask/dselect_k/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/dselect_k.html)) | - | x | x | >=2.1.0 | [NeurIPS 2021][DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning](https://arxiv.org/pdf/2106.03760v3.pdf) | | 多任务 | [ESCM2](models/multitask/escm2/) | - | x | x | >=2.1.0 | [SIGIR 2022][ESCM2: Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation](https://arxiv.org/pdf/2204.05125.pdf) | + | 多任务 | [MetaHeac](models/multitask/metaheac/) | - | x | x | >=2.1.0 | [KDD 2021][Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising](https://arxiv.org/pdf/2105.14688.pdf) | | 重排序 | [Listwise](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rerank/listwise/) | - | ✓ | x | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) | diff --git a/README_EN.md b/README_EN.md index 5cc8b4b1e..543f452b3 100644 --- a/README_EN.md +++ b/README_EN.md @@ -173,6 +173,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # Training wit | Multi-Task | [Maml](models/multitask/maml/)
([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/maml.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238412) | x | x | >=2.1.0 | [PMLR 2017][Model-agnostic meta-learning for fast adaptation of deep networks](https://arxiv.org/pdf/1703.03400.pdf) | | Multi-Task | [DSelect_K](models/multitask/dselect_k/)
([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/dselect_k.html)) | - | x | x | >=2.1.0 | [NeurIPS 2021][DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning](https://arxiv.org/pdf/2106.03760v3.pdf) | | Multi-Task | [ESCM2](models/multitask/escm2/) | - | x | x | >=2.1.0 | [SIGIR 2022][ESCM2: Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation](https://arxiv.org/pdf/2204.05125.pdf) | + | Multi-Task | [MetaHeac](models/multitask/metaheac/) | - | x | x | >=2.1.0 | [KDD 2021][Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising](https://arxiv.org/pdf/2105.14688.pdf) | | Re-Rank | [Listwise](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rerank/listwise/) | - | ✓ | x | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) |

Community

diff --git a/contributor.md b/contributor.md index 12634cf25..287c0c7e4 100644 --- a/contributor.md +++ b/contributor.md @@ -21,6 +21,7 @@ | [MHCN](models/recall/mhcn/) | [Andy1314Chen](https://github.com/Andy1314Chen) | https://github.com/PaddlePaddle/PaddleRec/pull/679 | 论文复现赛第五期 | | [DCN_V2](models/rank/dcn_v2/) | [LinJayan](https://github.com/LinJayan) | https://github.com/PaddlePaddle/PaddleRec/pull/677 | 论文复现赛第五期 | | [SIGN](models/rank/sign/) | [BamLubi](https://github.com/BamLubi) | https://github.com/PaddlePaddle/PaddleRec/pull/748 | 论文复现赛第六期 | + | [MetaHeac](models/multitask/metaheac/) | [simuler](https://github.com/simuler) | https://github.com/PaddlePaddle/PaddleRec/pull/788 | 论文复现赛第六期 | | [FGCNN](models/rank/fgcnn/) | [yoreG123 chenjiyan2001](https://github.com/yoreG123) | https://github.com/PaddlePaddle/PaddleRec/pull/784 | 论文复现赛第六期 | diff --git a/datasets/Lookalike/run.sh b/datasets/Lookalike/run.sh new file mode 100644 index 000000000..2b9533949 --- /dev/null +++ b/datasets/Lookalike/run.sh @@ -0,0 +1,15 @@ + +wget https://paddlerec.bj.bcebos.com/datasets/lookalike/Lookalike_data.rar +rar e Lookalike_data.rar + +mkdir train_data +mkdir test_cold_data +mkdir test_hot_data + +mv train_stage1.pkl train_data +mv test_hot_stage1.pkl test_hot_data +mv test_hot_stage2.pkl test_hot_data +mv test_cold_stage1.pkl test_cold_data +mv test_cold_stage2.pkl test_cold_data + +rm -rf Lookalike_data.rar diff --git a/doc/imgs/metaheac.png b/doc/imgs/metaheac.png new file mode 100644 index 000000000..fd9eaf3ec Binary files /dev/null and b/doc/imgs/metaheac.png differ diff --git a/doc/source/index.rst b/doc/source/index.rst index 6fe9e83c0..250a5cee9 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -92,6 +92,7 @@ models/multitask/ple.md models/multitask/share_bottom.md models/multitask/dselect_k.md + models/multitask/metaheac.md .. toctree:: :maxdepth: 1 diff --git a/doc/source/models/multitask/metaheac.md b/doc/source/models/multitask/metaheac.md new file mode 100644 index 000000000..e433c95db --- /dev/null +++ b/doc/source/models/multitask/metaheac.md @@ -0,0 +1,108 @@ +# MetaHeac + +以下是本例的简要目录结构及说明: + +``` +├── data #样例数据 + ├── train #训练数据 + ├── train_stage1.pkl + ├── test #测试数据 + ├── test_stage1.pkl + ├── test_stage2.pkl +├── net.py # 核心模型组网 +├── config.yaml # sample数据配置 +├── config_big.yaml # 全量数据配置 +├── dygraph_model.py # 构建动态图 +├── reader_train.py # 训练数据读取程序 +├── reader_test.py # infer数据读取程序 +├── readme.md #文档 +``` + +注:在阅读该示例前,建议您先了解以下内容: + +[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md) + +## 内容 + +- [模型简介](#模型简介) +- [数据准备](#数据准备) +- [运行环境](#运行环境) +- [快速开始](#快速开始) +- [模型组网](#模型组网) +- [效果复现](#效果复现) +- [infer说明](#infer说明) +- [进阶使用](#进阶使用) +- [FAQ](#FAQ) + +## 模型简介 +在推荐系统和广告平台上,营销人员总是希望通过视频或者社交等媒体渠道向潜在用户推广商品、内容或者广告。扩充候选集技术(Look-alike建模)是一种很有效的解决方案,但look-alike建模通常面临两个挑战:(1)一家公司每天可以开展数百场营销活动,以推广完全不同类别的各种内容。(2)某项活动的种子集只能覆盖有限的用户,因此一个基于有限种子用户的定制化模型往往会产生严重的过拟合。为了解决以上的挑战,论文《Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising》提出了一种新的两阶段框架Meta Hybrid Experts and Critics (MetaHeac),采用元学习的方法训练一个泛化初始化模型,从而能够快速适应新类别内容推广任务。 + +## 数据准备 +使用Tencent Look-alike Dataset,该数据集包含几百个种子人群、海量候选人群对应的用户特征,以及种子人群对应的广告特征。出于业务数据安全保证的考虑,所有数据均为脱敏处理后的数据。本次复现使用处理过的数据集,直接下载[propocessed data](https://drive.google.com/file/d/11gXgf_yFLnbazjx24ZNb_Ry41MI5Ud1g/view?usp=sharing),mataheac/data/目录下存放了从全量数据集获取的少量数据集,用于对齐模型。 + +## 运行环境 +PaddlePaddle>=2.0 + +python 2.7/3.5/3.6/3.7 + +os : windows/linux/macos + +## 快速开始 +本文提供了样例数据可以供您快速体验,在任意目录下均可执行。在MetaHeac模型目录的快速执行命令如下: +```bash +# 进入模型目录 +# cd PaddleRec/models/multitask/metaheac/ # 在任意目录均可运行 +# 动态图训练 +python -u ../../../tools/trainer.py -m config.yaml # 全量数据运行config_bigdata.yaml +# 动态图预测 +python -u ./infer_meta.py -m config.yaml +``` + +## 模型组网 +MetaHeac是发表在 KDD 2021 的论文[《Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising》]( https://arxiv.org/pdf/2105.14688 )文章提出一种新的两阶段框架Meta Hybrid Experts and Critics (MetaHeac),有效解决了真实场景中难以构建泛化模型,同时在所有内容领域中扩充高质量的受众候选集和基于有限种子用户的定制化模型容易产生严重过拟合的两个关键问题模型的主要组网结构如下: +[MetaHeac](https://arxiv.org/pdf/2105.14688): +

+ +

+ +## 效果复现 +为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。 +在全量数据下模型的指标如下(train.py文件内 paddle.seed = 2021下效果): + +| 模型 | auc | batch_size | epoch_num| Time of each epoch | +|:------|:-------| :------ | :------| :------ | +| MetaHeac | 0.7112 | 1024 | 1 | 3个小时左右 | + +1. 确认您当前所在目录为PaddleRec/models/multitask/metaheac +2. 进入paddlerec/datasets/目录下,执行该脚本,会从国内源的服务器上下载我们预处理完成的Lookalike全量数据集,并解压到指定文件夹。 +``` bash +cd ../../../datasets/Lookalike +sh run.sh +``` +3. 切回模型目录,执行命令运行全量数据 +```bash +cd ../../models/multitask/metaheac/ # 切回模型目录 +# 动态图训练 +# step1: train +python -u ../../../tools/trainer.py -m config_big.yaml +# 动态图预测 +# step2: infer 此时test数据集为hot +python -u ./infer_meta.py -m config_big.yaml +# step3:修改config_big.yaml文件中test_data_dir的路径为cold +# python -u ./infer_meta.py -m config.yaml +``` + +## infer说明 +### 数据集说明 +为了测试模型在不同规模的内容定向推广任务上的表现,将数据集根据内容定向推广任务给定的候选集大小进行了划分,分为大于T和小于T两部分。将腾讯广告大赛2018的Look-alike数据集中的T设置为4000,其中hot数据集中候选集大于T,cold数据集中候选集小于T. +### infer_meta.py说明 +infer_meta.py是用于元学习模型infer的tool,在使用中主要有以下几点需要注意: +1. 在对模型进行infer时(train时也可使用这样的操作),可以将runner.infer_batch_size注释掉,这样将禁用DataLoader的自动组batch功能,进而可以使用自定义的组batch方式. +2. 由于元学习在infer时需要先对特定任务的少量数据集进行训练,因此在infer_meta.py的infer_dataloader中每次接收单个子任务的全量infer数据集(包括训练数据和测试数据). +3. 实际组batch在infer.py中进行,在获取到单个子任务的数据后,获取config中的batch_size参数,对训练数据和测试数据进行组batch,并分别调用dygraph_model.py中的infer_train_forward和infer_forward进行训练和测试. +4. 和普通infer不同,由于需要对单个子任务进行少量数据的train和test,对于每个子任务来说加载的都是train阶段训练好的泛化模型. +5. 在对单个子任务infer时,创建了局部的paddle.metric.Auc("ROC"),可以查看每个子任务的AUC指标,在全局metric中维护包含所有子任务的AUC指标. + +## 进阶使用 + +## FAQ diff --git a/models/multitask/metaheac/config.yaml b/models/multitask/metaheac/config.yaml new file mode 100644 index 000000000..a6c2eda52 --- /dev/null +++ b/models/multitask/metaheac/config.yaml @@ -0,0 +1,49 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +runner: + train_data_dir: "./data/train" + train_reader_path: "reader_train" # importlib format + use_gpu: False + use_auc: True +# train_batch_size: 32 + epochs: 1 + print_interval: 1 + model_save_path: "output_model_metaheac" + test_data_dir: "./data/test" +# infer_batch_size: 32 + infer_reader_path: "reader_infer" # importlib format + infer_load_path: "output_model_metaheac" + infer_start_epoch: 0 + infer_end_epoch: 1 + #use inference save model + use_inference: False + infer_train_epoch: 2 + +hyper_parameters: + max_idxs: [[3, 2, 855, 5, 7, 2, 1], [124, 82, 12, 263312, 49780, 10002, 9984], [78, 137, 14, 39,32,3]] + embed_dim: 64 + mlp_dims: [64, 64] + local_lr: 0.0002 + num_expert: 8 + num_output: 5 + task_count: 5 + batch_size: 32 + + optimizer: + class: adam + global_learning_rate: 0.001 + local_test_learning_rate: 0.001 + strategy: async diff --git a/models/multitask/metaheac/config_big.yaml b/models/multitask/metaheac/config_big.yaml new file mode 100644 index 000000000..6cd845445 --- /dev/null +++ b/models/multitask/metaheac/config_big.yaml @@ -0,0 +1,50 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +runner: + train_data_dir: "../../../datasets/Lookalike/train_data" + train_reader_path: "reader_train" # importlib format + use_gpu: True + use_auc: True +# train_batch_size: 32 + epochs: 1 + print_interval: 100 + model_save_path: "output_model_metaheac_all" + test_data_dir: "../../../datasets/Lookalike/test_hot_data" +# test_data_dir: "../../../datasets/Lookalike/test_cold_data" +# infer_batch_size: 32 + infer_reader_path: "reader_infer" # importlib format + infer_load_path: "output_model_metaheac_all" + infer_start_epoch: 0 + infer_end_epoch: 1 + #use inference save model + use_inference: False + infer_train_epoch: 2 + +hyper_parameters: + max_idxs: [[3, 2, 855, 5, 7, 2, 1], [124, 82, 12, 263312, 49780, 10002, 9984], [78, 137, 14, 39,32,3]] + embed_dim: 64 + mlp_dims: [64, 64] + local_lr: 0.0002 + num_expert: 8 + num_output: 5 + task_count: 5 + batch_size: 1024 + + optimizer: + class: adam + global_learning_rate: 0.001 + local_test_learning_rate: 0.001 + strategy: async diff --git a/models/multitask/metaheac/data/test/test_stage1.pkl b/models/multitask/metaheac/data/test/test_stage1.pkl new file mode 100644 index 000000000..4c272ee33 Binary files /dev/null and b/models/multitask/metaheac/data/test/test_stage1.pkl differ diff --git a/models/multitask/metaheac/data/test/test_stage2.pkl b/models/multitask/metaheac/data/test/test_stage2.pkl new file mode 100644 index 000000000..4c272ee33 Binary files /dev/null and b/models/multitask/metaheac/data/test/test_stage2.pkl differ diff --git a/models/multitask/metaheac/data/train/train_stage1.pkl b/models/multitask/metaheac/data/train/train_stage1.pkl new file mode 100644 index 000000000..4c272ee33 Binary files /dev/null and b/models/multitask/metaheac/data/train/train_stage1.pkl differ diff --git a/models/multitask/metaheac/dygraph_model.py b/models/multitask/metaheac/dygraph_model.py new file mode 100644 index 000000000..9d705fec4 --- /dev/null +++ b/models/multitask/metaheac/dygraph_model.py @@ -0,0 +1,151 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +import math +import numpy as np +import pickle +import net + + +class DygraphModel(): + # define model + def create_model(self, config): + max_idxs = config.get("hyper_parameters.max_idxs") + embed_dim = config.get("hyper_parameters.embed_dim") + mlp_dims = config.get("hyper_parameters.mlp_dims") + + num_expert = config.get("hyper_parameters.num_expert") + num_output = config.get("hyper_parameters.num_output") + + meta_model = net.WideAndDeepModel(max_idxs, embed_dim, mlp_dims, + num_expert, num_output) + + return meta_model + + # define feeds which convert numpy of batch data to paddle.tensor + def create_feeds(self, batch_data, config): + x_spt = batch_data[0] + y_spt = batch_data[1] + + x_qry = batch_data[2] + y_qry = batch_data[3] + return x_spt, y_spt, x_qry, y_qry + + # define loss function by predicts and label + def create_loss(self, pred, y_label): + + loss_ctr = paddle.nn.functional.log_loss( + input=pred, label=paddle.cast( + y_label, dtype="float32")) + return loss_ctr + + # define optimizer + def create_optimizer(self, dy_model, config, mode="train"): + if mode == "train": + lr = config.get("hyper_parameters.optimizer.global_learning_rate", + 0.001) + optimizer = paddle.optimizer.Adam( + learning_rate=lr, parameters=dy_model.parameters()) + else: + lr = config.get( + "hyper_parameters.optimizer.local_test_learning_rate", 0.001) + optimizer = paddle.optimizer.Adam( + learning_rate=lr, parameters=dy_model.parameters()) + return optimizer + + # define metrics such as auc/acc + # multi-task need to define multi metric + def create_metrics(self): + metrics_list_name = ["AUC"] + auc_ctr_metric = paddle.metric.Auc("ROC") + metrics_list = [auc_ctr_metric] + return metrics_list, metrics_list_name + + # construct train forward phase + def train_forward(self, dy_model, metric_list, batch, config): + # x_spt.shape = x_qry.shape = [task_count,batchsize,7+50+7+6] + # y_spt.shape = y_qry.shape = [task_count,batchsize,1] + x_spt, y_spt, x_qry, y_qry = self.create_feeds(batch, config) + + task_count = config.get("hyper_parameters.task_count", 5) + local_lr = config.get("hyper_parameters.local_lr", 0.0002) + criterion = paddle.nn.BCELoss() + + losses_q = [] + dy_model.clear_gradients() + for i in range(task_count): + ## local update -------------- + fast_parameters = list(dy_model.parameters()) + for weight in fast_parameters: + weight.fast = None + + support_set_y_pred = dy_model(x_spt[i]) + label = paddle.squeeze(y_spt[i].astype('float32')) + + loss = criterion(support_set_y_pred, label) + dy_model.clear_gradients() + loss.backward() + + fast_parameters = list(dy_model.parameters()) + for weight in fast_parameters: + if weight.grad is None: + continue + if weight.fast is None: + weight.fast = weight - local_lr * weight.grad # create weight.fast + else: + weight.fast = weight.fast - local_lr * weight.grad + dy_model.clear_gradients() + ## local update -------------- + + query_set_y_pred = dy_model(x_qry[i]) + label = paddle.squeeze(y_qry[i].astype('float32')) + loss_q = criterion(query_set_y_pred, label) + losses_q.append(loss_q) # Save the loss on the subtask dataset + + pred = paddle.unsqueeze(query_set_y_pred, 1) + pred = paddle.concat([1 - pred, pred], 1) + metric_list[0].update(preds=pred.numpy(), labels=label.numpy()) + + loss_average = paddle.stack(losses_q).mean(0) + print_dict = {'loss': loss_average} + + return loss_average, metric_list, print_dict + + def infer_train_forward(self, dy_model, batch, config): + batch_x, batch_y = batch[0], batch[1] + criterion = paddle.nn.BCELoss() + + pred = dy_model.forward(batch_x) + + label = paddle.squeeze(batch_y.astype('float32')) + loss_q = criterion(pred, label) + + return loss_q + + def infer_forward(self, dy_model, metric_list, metric_list_local, batch, + config): + batch_x, batch_y = batch[0], batch[1] + pred = dy_model.forward(batch_x) + label = paddle.squeeze(batch_y.astype('float32')) + + pred = paddle.unsqueeze(pred, 1) + pred = paddle.concat([1 - pred, pred], 1) + + metric_list[0].update(preds=pred.numpy(), labels=label.numpy()) + metric_list_local[0].update(preds=pred.numpy(), labels=label.numpy()) + + return metric_list, metric_list_local diff --git a/models/multitask/metaheac/infer_meta.py b/models/multitask/metaheac/infer_meta.py new file mode 100644 index 000000000..ad865842c --- /dev/null +++ b/models/multitask/metaheac/infer_meta.py @@ -0,0 +1,243 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import os +import time +import logging +import sys +import numpy as np + +__dir__ = os.path.dirname(os.path.abspath(__file__)) +print(os.path.abspath('/'.join(__dir__.split('/')[:-3]))) +sys.path.append(os.path.abspath(os.path.join(__dir__, '..'))) +sys.path.append(os.path.abspath('/'.join(__dir__.split('/')[:-3]))) + +from tools.utils.utils_single import load_yaml, load_dy_model_class, get_abs_model, create_data_loader +from tools.utils.save_load import save_model, load_model +from paddle.io import DataLoader +import argparse + +logging.basicConfig( + format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO) +logger = logging.getLogger(__name__) + + +def parse_args(): + parser = argparse.ArgumentParser(description='paddle-rec run') + parser.add_argument("-m", "--config_yaml", type=str) + parser.add_argument("-o", "--opt", nargs='*', type=str) + args = parser.parse_args() + args.abs_dir = os.path.dirname(os.path.abspath(args.config_yaml)) + args.config_yaml = get_abs_model(args.config_yaml) + return args + + +def main(args): + paddle.seed(2021) + # load config + config = load_yaml(args.config_yaml) + dy_model_class = load_dy_model_class(args.abs_dir) + config["config_abs_dir"] = args.abs_dir + # modify config from command + if args.opt: + for parameter in args.opt: + parameter = parameter.strip() + key, value = parameter.split("=") + if type(config.get(key)) is int: + value = int(value) + if type(config.get(key)) is float: + value = float(value) + if type(config.get(key)) is bool: + value = (True if value.lower() == "true" else False) + config[key] = value + + # tools.vars + use_gpu = config.get("runner.use_gpu", True) + use_xpu = config.get("runner.use_xpu", False) + use_npu = config.get("runner.use_npu", False) + use_visual = config.get("runner.use_visual", False) + test_data_dir = config.get("runner.test_data_dir", None) + print_interval = config.get("runner.print_interval", None) + infer_batch_size = config.get("runner.infer_batch_size", None) + model_load_path = config.get("runner.infer_load_path", "model_output") + start_epoch = config.get("runner.infer_start_epoch", 0) + end_epoch = config.get("runner.infer_end_epoch", 10) + infer_train_epoch = config.get("runner.infer_train_epoch", 2) + batchsize = config.get("hyper_parameters.batch_size", 32) + + logger.info("**************common.configs**********") + logger.info( + "use_gpu: {}, use_xpu: {}, use_npu: {}, use_visual: {}, infer_batch_size: {}, test_data_dir: {}, start_epoch: {}, end_epoch: {}, print_interval: {}, model_load_path: {}". + format(use_gpu, use_xpu, use_npu, use_visual, infer_batch_size, + test_data_dir, start_epoch, end_epoch, print_interval, + model_load_path)) + logger.info("**************common.configs**********") + + if use_xpu: + xpu_device = 'xpu:{0}'.format(os.getenv('FLAGS_selected_xpus', 0)) + place = paddle.set_device(xpu_device) + elif use_npu: + npu_device = 'npu:{0}'.format(os.getenv('FLAGS_selected_npus', 0)) + place = paddle.set_device(npu_device) + else: + place = paddle.set_device('gpu' if use_gpu else 'cpu') + + dy_model = dy_model_class.create_model(config) + + # Create a log_visual object and store the data in the path + if use_visual: + from visualdl import LogWriter + log_visual = LogWriter(args.abs_dir + "/visualDL_log/infer") + + # to do : add optimizer function + #optimizer = dy_model_class.create_optimizer(dy_model, config) + + logger.info("read data") + infer_dataloader = create_data_loader( + config=config, place=place, mode="test") + + epoch_begin = time.time() + interval_begin = time.time() + + metric_list, metric_list_name = dy_model_class.create_metrics() + step_num = 0 + print_interval = 1 + + for epoch_id in range(start_epoch, end_epoch): + logger.info("load model epoch {}".format(epoch_id)) + model_path = os.path.join(model_load_path, str(epoch_id)) + + infer_reader_cost = 0.0 + infer_run_cost = 0.0 + reader_start = time.time() + + assert any(infer_dataloader( + )), "test_dataloader is null, please ensure batch size < dataset size!" + + aid_flag = -1 + + for batch_id, batch in enumerate(infer_dataloader()): + infer_reader_cost += time.time() - reader_start + infer_start = time.time() + + aid_flag = batch[0][0].item() + x_spt, y_spt, x_qry, y_qry = batch[1], batch[2], batch[3], batch[4] + + load_model(model_path, dy_model) + # 对每个子任务进行训练 + optimizer = dy_model_class.create_optimizer(dy_model, config, + "infer") + dy_model.train() + + for i in range(infer_train_epoch): + n_samples = y_spt.shape[0] + n_batch = int(np.ceil(n_samples / batchsize)) + optimizer.clear_grad() + + for i_batch in range(n_batch): + batch_input = list() + batch_x = [] + batch_x.append(x_spt[0][i_batch * batchsize:(i_batch + 1) * + batchsize]) + batch_x.append(x_spt[1][i_batch * batchsize:(i_batch + 1) * + batchsize]) + batch_x.append(x_spt[2][i_batch * batchsize:(i_batch + 1) * + batchsize]) + batch_x.append(x_spt[3][i_batch * batchsize:(i_batch + 1) * + batchsize]) + + batch_y = y_spt[i_batch * batchsize:(i_batch + 1) * + batchsize] + + batch_input.append(batch_x) + batch_input.append(batch_y) + + loss = dy_model_class.infer_train_forward( + dy_model, batch_input, config) + + dy_model.clear_gradients() + loss.backward() + optimizer.step() + # 对每个子任务进行测试 + dy_model.eval() + metric_list_local, metric_list_local_name = dy_model_class.create_metrics( + ) + with paddle.no_grad(): + n_samples = y_qry.shape[0] + n_batch = int(np.ceil(n_samples / batchsize)) + + for i_batch in range(n_batch): + batch_input = list() + batch_x = [] + batch_x.append(x_qry[0][i_batch * batchsize:(i_batch + 1) * + batchsize]) + batch_x.append(x_qry[1][i_batch * batchsize:(i_batch + 1) * + batchsize]) + batch_x.append(x_qry[2][i_batch * batchsize:(i_batch + 1) * + batchsize]) + batch_x.append(x_qry[3][i_batch * batchsize:(i_batch + 1) * + batchsize]) + + batch_y = y_qry[i_batch * batchsize:(i_batch + 1) * + batchsize] + + batch_input.append(batch_x) + batch_input.append(batch_y) + + metric_list, metric_list_local = dy_model_class.infer_forward( + dy_model, metric_list, metric_list_local, batch_input, + config) + + infer_run_cost += time.time() - infer_start + + metric_str_local = "" + for metric_id in range(len(metric_list_local_name)): + metric_str_local += ( + metric_list_local_name[metric_id] + ": {:.6f},".format( + metric_list_local[metric_id].accumulate())) + if use_visual: + log_visual.add_scalar( + tag="infer/" + metric_list_local_name[metric_id], + step=step_num, + value=metric_list_local[metric_id].accumulate()) + logger.info( + "epoch: {}, batch_id: {}, aid: {} ".format( + epoch_id, batch_id, aid_flag) + metric_str_local + + " avg_reader_cost: {:.5f} sec, avg_batch_cost: {:.5f} sec, avg_samples: {:.5f}, ips: {:.2f} ins/s". + format(infer_reader_cost / print_interval, ( + infer_reader_cost + infer_run_cost) / print_interval, + batchsize, print_interval * batchsize / (time.time( + ) - interval_begin))) + + interval_begin = time.time() + infer_reader_cost = 0.0 + infer_run_cost = 0.0 + step_num = step_num + 1 + reader_start = time.time() + + metric_str = "" + for metric_id in range(len(metric_list_name)): + metric_str += ( + metric_list_name[metric_id] + + ": {:.6f},".format(metric_list[metric_id].accumulate())) + + logger.info("epoch: {} done, ".format(epoch_id) + metric_str + + " epoch time: {:.2f} s".format(time.time() - epoch_begin)) + epoch_begin = time.time() + + +if __name__ == '__main__': + args = parse_args() + main(args) diff --git a/models/multitask/metaheac/net.py b/models/multitask/metaheac/net.py new file mode 100644 index 000000000..10260b943 --- /dev/null +++ b/models/multitask/metaheac/net.py @@ -0,0 +1,221 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class Meta_Linear(nn.Linear): + #used in MAML to forward input with fast weight + def __init__(self, in_features, out_features): + super(Meta_Linear, self).__init__(in_features, out_features) + self.weight.fast = None + self.bias.fast = None + + def forward(self, x): + if self.weight.fast is not None and self.bias.fast is not None: + #weight.fast (fast weight) is the temporaily adapted weight + out = F.linear(x, self.weight.fast, self.bias.fast) + else: + out = super(Meta_Linear, self).forward(x) + return out + + +class Meta_Embedding(nn.Embedding): + #used in MAML to forward input with fast weight + def __init__(self, num_embedding, embedding_dim): + super(Meta_Embedding, self).__init__(num_embedding, embedding_dim) + self.weight.fast = None + + def forward(self, x): + if self.weight.fast is not None: + out = F.embedding( + x.astype('int64'), self.weight.fast, self._padding_idx, + self._sparse) + else: + out = F.embedding( + x.astype('int64'), self.weight, self._padding_idx, + self._sparse) + return out + + +class Emb(nn.Layer): + def __init__(self, max_idxs, embedding_size=4): + super(Emb, self).__init__() + self.static_emb = StEmb(max_idxs[0], embedding_size) + self.ad_emb = StEmb(max_idxs[2], embedding_size) + self.dynamic_emb = DyEmb(max_idxs[1], embedding_size) + + def forward(self, x): + static_emb = self.static_emb(x[0]) + dynamic_emb = self.dynamic_emb(x[1], x[2]) + concat_embeddings = paddle.concat([static_emb, dynamic_emb], 1) + ad_emb = self.ad_emb(x[3]) + + return concat_embeddings, ad_emb + + +class DyEmb(nn.Layer): + def __init__(self, max_idxs, embedding_size=4): + super(DyEmb, self).__init__() + self.max_idxs = max_idxs + self.embedding_size = embedding_size + + self.embeddings = nn.LayerList([ + Meta_Embedding(max_idxs + 1, self.embedding_size) + for max_idxs in self.max_idxs + ]) + + def masked_fill(self, x, mask, value): + y = paddle.full(x.shape, value, x.dtype) + return paddle.where(mask, y, x) + + def forward(self, dynamic_ids, dynamic_lengths): + concat_embeddings = [] + batch_size = dynamic_lengths.shape[0] + + dynamic_list = list() + dynamic_list.append(dynamic_ids[:, 0:10]) + dynamic_list.append(dynamic_ids[:, 10:20]) + dynamic_list.append(dynamic_ids[:, 20:30]) + dynamic_list.append(dynamic_ids[:, 30:35]) + dynamic_list.append(dynamic_ids[:, 35:40]) + dynamic_list.append(dynamic_ids[:, 40:45]) + dynamic_list.append(dynamic_ids[:, 45:50]) + + for i in range(len(self.max_idxs)): + # B*M + dynamic_ids_tensor = dynamic_list[i] + dynamic_lengths_tensor = dynamic_lengths[:, i].astype(float) + # embedding layer B*M*E + dynamic_embeddings_tensor = self.embeddings[i](dynamic_ids_tensor) + # average B*M*E --AVG--> B*E + dynamic_lengths_tensor = dynamic_lengths_tensor.unsqueeze(1) + mask = (paddle.arange( + paddle.shape(dynamic_embeddings_tensor).item(1)).unsqueeze(0) + .astype(float) < dynamic_lengths_tensor.unsqueeze(1)) + mask = mask.squeeze(1).unsqueeze(2) + + dynamic_embedding = self.masked_fill(dynamic_embeddings_tensor, + mask == 0, 0) + + dynamic_lengths_tensor[dynamic_lengths_tensor == 0] = 1 + + dynamic_embedding = ( + dynamic_embedding.sum(axis=1) / + dynamic_lengths_tensor.astype('float32')).unsqueeze(1) + concat_embeddings.append( + paddle.reshape(dynamic_embedding, + [batch_size, 1, self.embedding_size])) + # B*F*E + concat_embeddings = paddle.concat(concat_embeddings, 1) + return concat_embeddings + + +class StEmb(nn.Layer): + def __init__(self, max_idxs, embedding_size=4): + super(StEmb, self).__init__() + self.max_idxs = max_idxs + self.embedding_size = embedding_size + self.embeddings = nn.LayerList([ + Meta_Embedding(max_idx + 1, self.embedding_size) + for max_idx in self.max_idxs + ]) + + def forward(self, static_ids): + concat_embeddings = [] + batch_size = static_ids.shape[0] + # batch * feature_size + feature_size = static_ids.shape[1] + + for i in range(feature_size): + # B*1 + static_ids_tensor = static_ids[:, i] + static_embeddings_tensor = self.embeddings[i]( + static_ids_tensor.astype('int64')) + + concat_embeddings.append( + paddle.reshape(static_embeddings_tensor, + [batch_size, 1, self.embedding_size])) + # B*F*E + concat_embeddings = paddle.concat(concat_embeddings, 1) + return concat_embeddings + + +class MultiLayerPerceptron(nn.Layer): + def __init__(self, input_dim, embed_dims): + super().__init__() + layers = [] + for embed_dim in embed_dims: + layers.append(Meta_Linear(input_dim, embed_dim)) + layers.append(nn.ReLU()) + input_dim = embed_dim + self.mlp = nn.LayerList(layers) + + def forward(self, x): + out1 = self.mlp[0](x) + out2 = self.mlp[1](out1) + return out2 + + +class WideAndDeepModel(nn.Layer): + def __init__(self, max_ids, embed_dim, mlp_dims, num_expert, num_output): + super().__init__() + self.embedding = Emb(max_ids, embed_dim) + self.embed_output_dim = (len(max_ids[0]) + len(max_ids[1])) * embed_dim + self.ad_embed_dim = (len(max_ids[2]) + 1) * embed_dim + expert = [] + for i in range(num_expert): + expert.append( + MultiLayerPerceptron(self.embed_output_dim, mlp_dims)) + self.mlp = nn.LayerList(expert) + output_layer = [] + for i in range(num_output): + output_layer.append(Meta_Linear(mlp_dims[-1], 1)) + self.output_layer = nn.LayerList(output_layer) + + self.attention_layer = nn.Sequential( + Meta_Linear(self.ad_embed_dim, mlp_dims[-1]), + nn.ReLU(), + Meta_Linear(mlp_dims[-1], num_expert), + nn.Softmax(axis=1)) + self.output_attention_layer = nn.Sequential( + Meta_Linear(self.ad_embed_dim, mlp_dims[-1]), + nn.ReLU(), + Meta_Linear(mlp_dims[-1], num_output), + nn.Softmax(axis=1)) + + def forward(self, x): + emb, ad_emb = self.embedding(x) + ad_emb = paddle.concat( + [paddle.mean( + emb, axis=1, keepdim=True), ad_emb], 1) + + fea = 0 + att = self.attention_layer( + paddle.reshape(ad_emb, [-1, self.ad_embed_dim])) + for i in range(len(self.mlp)): + fea += ( + att[:, i].unsqueeze(1) * + self.mlp[i](paddle.reshape(emb, [-1, self.embed_output_dim]))) + + result = 0 + att2 = self.output_attention_layer( + paddle.reshape(ad_emb, [-1, self.ad_embed_dim])) + for i in range(len(self.output_layer)): + result += (att2[:, i].unsqueeze(1) * + F.sigmoid(self.output_layer[i](fea))) + + return result.squeeze(1) diff --git a/models/multitask/metaheac/reader_infer.py b/models/multitask/metaheac/reader_infer.py new file mode 100644 index 000000000..d96c0d543 --- /dev/null +++ b/models/multitask/metaheac/reader_infer.py @@ -0,0 +1,141 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function +import numpy as np +import pickle +from paddle.io import IterableDataset + + +class RecDataset(IterableDataset): + def __init__(self, file_list, config): + super(RecDataset, self).__init__() + self.file_list = file_list + self.config = config + self.task_count = config.get("hyper_parameters.task_count") + self.batchsize = config.get("hyper_parameters.batch_size") + + self.static_context_col = [ + 'carrier', + 'consumptionAbility', + 'LBS', + 'age', + 'education', + 'gender', + 'house', + ] + self.dynamic_context_col = [ + 'interest1', + 'interest2', + 'interest3', + 'kw1', + 'kw2', + 'topic1', + 'topic2', + ] + self.ad_col = [ + 'advertiserId', + 'campaignId', + 'creativeSize', + 'adCategoryId', + 'productId', + 'productType', + ] + self.col_length_name = [ + x + '_length' for x in self.dynamic_context_col + ] + self.label_col = 'label' + + self.train_col = self.static_context_col + self.dynamic_context_col + self.col_length_name + self.ad_col + + self.all_col = [ + self.label_col, 'aid' + ] + self.static_context_col + self.dynamic_context_col + self.col_length_name + self.ad_col + + def __iter__(self): + np.random.seed(2021) + self.file_list.sort() + print(self.file_list) + with open(self.file_list[0], "rb") as data_test_stage1: + with open(self.file_list[1], "rb") as data_test_stage2: + data_test_stage1 = pickle.load(data_test_stage1)[self.all_col] + data_test_stage2 = pickle.load(data_test_stage2)[self.all_col] + + aid_set = set(data_test_stage1.aid) + for aid in aid_set: + task_test_stage1 = data_test_stage1[data_test_stage1.aid == + aid] + task_test_stage2 = data_test_stage2[data_test_stage2.aid == + aid] + + data_train = task_test_stage1.sample(frac=1) + data_test = task_test_stage2 + + output_list = list() + + batch_sup_x = [] + batch_sup_x.append( + np.array(data_train[self.static_context_col]) + [:]) #shape=[*, 7] + + # data_stage1中dynamic部分 + temp_list = list() + for k in range(len(self.dynamic_context_col)): + dy_np = np.array(data_train[self.dynamic_context_col[ + k]])[:] + dy_np = np.vstack(dy_np) + temp_list.append(dy_np) + temp_np = np.concatenate(temp_list, axis=1) + batch_sup_x.append(temp_np) #shape = [*, 50] + + batch_sup_x.append( + np.array(data_train[self.col_length_name]) + [:]) #shape = [*,7] + batch_sup_x.append( + np.array(data_train[self.ad_col])[:]) #shape = [*,6] + + batch_sup_y = np.array(data_train[self.label_col] + .values)[:] #shape = [*,1] + + batch_qry_x = [] + batch_qry_x.append( + np.array(data_test[self.static_context_col]) + [:]) #shape=[*, 7] + + # data_stage2中dynamic部分 + temp_list = list() + for k in range(len(self.dynamic_context_col)): + dy_np = np.array(data_test[self.dynamic_context_col[ + k]])[:] + dy_np = np.vstack(dy_np) + temp_list.append(dy_np) + temp_np = np.concatenate(temp_list, axis=1) + batch_qry_x.append(temp_np) #shape = [*, 50] + + batch_qry_x.append( + np.array(data_test[self.col_length_name]) + [:]) #shape = [*,7] + batch_qry_x.append( + np.array(data_test[self.ad_col])[:]) #shape = [*,6] + + batch_qry_y = np.array(data_test[self.label_col] + .values)[:] #shape = [*,1] + + output_list.append(np.array([aid])) #本次子任务的aid + output_list.append(batch_sup_x) # shape = [*, 7+50+7+6] + output_list.append(batch_sup_y) # shape = [*, 1] + output_list.append(batch_qry_x) # shape = [*, 7+50+7+6] + output_list.append(batch_qry_y) # shape = [*, 1] + + yield output_list diff --git a/models/multitask/metaheac/reader_train.py b/models/multitask/metaheac/reader_train.py new file mode 100644 index 000000000..751288baf --- /dev/null +++ b/models/multitask/metaheac/reader_train.py @@ -0,0 +1,161 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function +import numpy as np +import pickle +from paddle.io import IterableDataset + + +class RecDataset(IterableDataset): + def __init__(self, file_list, config): + super(RecDataset, self).__init__() + self.file_list = file_list + self.config = config + self.task_count = config.get("hyper_parameters.task_count") + self.batchsize = config.get("hyper_parameters.batch_size") + + self.static_context_col = [ + 'carrier', + 'consumptionAbility', + 'LBS', + 'age', + 'education', + 'gender', + 'house', + ] + self.dynamic_context_col = [ + 'interest1', + 'interest2', + 'interest3', + 'kw1', + 'kw2', + 'topic1', + 'topic2', + ] + self.ad_col = [ + 'advertiserId', + 'campaignId', + 'creativeSize', + 'adCategoryId', + 'productId', + 'productType', + ] + self.col_length_name = [ + x + '_length' for x in self.dynamic_context_col + ] + self.label_col = 'label' + + self.train_col = self.static_context_col + self.dynamic_context_col + self.col_length_name + self.ad_col + self.all_col = [ + self.label_col, 'aid' + ] + self.static_context_col + self.dynamic_context_col + self.col_length_name + self.ad_col + + def __iter__(self): + np.random.seed(2021) + # for file in self.file_list: + file = self.file_list[0] + print(file) + with open(file, "rb") as rf: + data_train_stage1 = pickle.load(rf)[self.all_col] + n_samples = data_train_stage1.shape[0] + + aid_set = list(set(data_train_stage1.aid)) + data_train = data_train_stage1 + + n_batch = int(np.ceil(n_samples / + self.batchsize)) #总量除以batchsize * task_count + list_prob = [] + for aid in aid_set: + list_prob.append(data_train_stage1[data_train_stage1.aid == + aid].shape[0]) + list_prob_sum = sum(list_prob) + for i in range(len(list_prob)): + list_prob[i] = list_prob[i] / list_prob_sum + + for i_batch in range(n_batch): + batch_aid_set = np.random.choice( + aid_set, size=self.task_count, replace=True, p=list_prob) + + list_sup_x, list_sup_y, list_qry_x, list_qry_y = list(), list( + ), list(), list() + + for aid in batch_aid_set: + batch_sup = data_train[data_train.aid == + aid].sample(self.batchsize) + batch_qry = data_train[data_train.aid == + aid].sample(self.batchsize) + + batch_sup_x = [] + batch_sup_x.append( + np.array(batch_sup[self.static_context_col]) + [:]) #[batchsize,7] + + # sup中dynamic部分 + temp_list = list() + for k in range(len(self.dynamic_context_col)): + dy_np = np.array(batch_sup[self.dynamic_context_col[ + k]])[:] + dy_np = np.vstack(dy_np) + temp_list.append(dy_np) + temp_np = np.concatenate(temp_list, axis=1) + + batch_sup_x.append(temp_np) #[batchsize,50] + batch_sup_x.append( + np.array(batch_sup[self.col_length_name]) + [:]) #[batchsize,7] + batch_sup_x.append( + np.array(batch_sup[self.ad_col])[:]) #[batchsize,6] + + batch_sup_y = np.array(batch_sup[self.label_col] + .values)[:] #[batchsize,1] + + batch_qry_x = [] + batch_qry_x.append( + np.array(batch_qry[self.static_context_col]) + [:]) #[batchsize,7] + + # qry中dynamic部分 + temp_list = list() + for k in range(len(self.dynamic_context_col)): + dy_np = np.array(batch_qry[self.dynamic_context_col[ + k]])[:] + dy_np = np.vstack(dy_np) + temp_list.append(dy_np) + temp_np = np.concatenate(temp_list, axis=1) + + batch_qry_x.append(temp_np) #[batchsize,50] + batch_qry_x.append( + np.array(batch_qry[self.col_length_name]) + [:]) #[batchsize,7] + batch_qry_x.append( + np.array(batch_qry[self.ad_col])[:]) #[batchsize,6] + + batch_qry_y = np.array(batch_qry[self.label_col] + .values)[:] #[batchsize,1] + + list_sup_x.append( + batch_sup_x) # shape = [5,batchsize,7+50+7+6] + list_sup_y.append(batch_sup_y) # shape = [5,batchsize,1] + list_qry_x.append( + batch_qry_x) # shape = [5,batchsize,7+50+7+6] + list_qry_y.append(batch_qry_y) # shape = [5,batchsize,1] + + output_list = [] + output_list.append(list_sup_x) + output_list.append(list_sup_y) + output_list.append(list_qry_x) + output_list.append(list_qry_y) + + yield output_list diff --git a/models/multitask/metaheac/readme.md b/models/multitask/metaheac/readme.md new file mode 100644 index 000000000..e433c95db --- /dev/null +++ b/models/multitask/metaheac/readme.md @@ -0,0 +1,108 @@ +# MetaHeac + +以下是本例的简要目录结构及说明: + +``` +├── data #样例数据 + ├── train #训练数据 + ├── train_stage1.pkl + ├── test #测试数据 + ├── test_stage1.pkl + ├── test_stage2.pkl +├── net.py # 核心模型组网 +├── config.yaml # sample数据配置 +├── config_big.yaml # 全量数据配置 +├── dygraph_model.py # 构建动态图 +├── reader_train.py # 训练数据读取程序 +├── reader_test.py # infer数据读取程序 +├── readme.md #文档 +``` + +注:在阅读该示例前,建议您先了解以下内容: + +[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md) + +## 内容 + +- [模型简介](#模型简介) +- [数据准备](#数据准备) +- [运行环境](#运行环境) +- [快速开始](#快速开始) +- [模型组网](#模型组网) +- [效果复现](#效果复现) +- [infer说明](#infer说明) +- [进阶使用](#进阶使用) +- [FAQ](#FAQ) + +## 模型简介 +在推荐系统和广告平台上,营销人员总是希望通过视频或者社交等媒体渠道向潜在用户推广商品、内容或者广告。扩充候选集技术(Look-alike建模)是一种很有效的解决方案,但look-alike建模通常面临两个挑战:(1)一家公司每天可以开展数百场营销活动,以推广完全不同类别的各种内容。(2)某项活动的种子集只能覆盖有限的用户,因此一个基于有限种子用户的定制化模型往往会产生严重的过拟合。为了解决以上的挑战,论文《Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising》提出了一种新的两阶段框架Meta Hybrid Experts and Critics (MetaHeac),采用元学习的方法训练一个泛化初始化模型,从而能够快速适应新类别内容推广任务。 + +## 数据准备 +使用Tencent Look-alike Dataset,该数据集包含几百个种子人群、海量候选人群对应的用户特征,以及种子人群对应的广告特征。出于业务数据安全保证的考虑,所有数据均为脱敏处理后的数据。本次复现使用处理过的数据集,直接下载[propocessed data](https://drive.google.com/file/d/11gXgf_yFLnbazjx24ZNb_Ry41MI5Ud1g/view?usp=sharing),mataheac/data/目录下存放了从全量数据集获取的少量数据集,用于对齐模型。 + +## 运行环境 +PaddlePaddle>=2.0 + +python 2.7/3.5/3.6/3.7 + +os : windows/linux/macos + +## 快速开始 +本文提供了样例数据可以供您快速体验,在任意目录下均可执行。在MetaHeac模型目录的快速执行命令如下: +```bash +# 进入模型目录 +# cd PaddleRec/models/multitask/metaheac/ # 在任意目录均可运行 +# 动态图训练 +python -u ../../../tools/trainer.py -m config.yaml # 全量数据运行config_bigdata.yaml +# 动态图预测 +python -u ./infer_meta.py -m config.yaml +``` + +## 模型组网 +MetaHeac是发表在 KDD 2021 的论文[《Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising》]( https://arxiv.org/pdf/2105.14688 )文章提出一种新的两阶段框架Meta Hybrid Experts and Critics (MetaHeac),有效解决了真实场景中难以构建泛化模型,同时在所有内容领域中扩充高质量的受众候选集和基于有限种子用户的定制化模型容易产生严重过拟合的两个关键问题模型的主要组网结构如下: +[MetaHeac](https://arxiv.org/pdf/2105.14688): +

+ +

+ +## 效果复现 +为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。 +在全量数据下模型的指标如下(train.py文件内 paddle.seed = 2021下效果): + +| 模型 | auc | batch_size | epoch_num| Time of each epoch | +|:------|:-------| :------ | :------| :------ | +| MetaHeac | 0.7112 | 1024 | 1 | 3个小时左右 | + +1. 确认您当前所在目录为PaddleRec/models/multitask/metaheac +2. 进入paddlerec/datasets/目录下,执行该脚本,会从国内源的服务器上下载我们预处理完成的Lookalike全量数据集,并解压到指定文件夹。 +``` bash +cd ../../../datasets/Lookalike +sh run.sh +``` +3. 切回模型目录,执行命令运行全量数据 +```bash +cd ../../models/multitask/metaheac/ # 切回模型目录 +# 动态图训练 +# step1: train +python -u ../../../tools/trainer.py -m config_big.yaml +# 动态图预测 +# step2: infer 此时test数据集为hot +python -u ./infer_meta.py -m config_big.yaml +# step3:修改config_big.yaml文件中test_data_dir的路径为cold +# python -u ./infer_meta.py -m config.yaml +``` + +## infer说明 +### 数据集说明 +为了测试模型在不同规模的内容定向推广任务上的表现,将数据集根据内容定向推广任务给定的候选集大小进行了划分,分为大于T和小于T两部分。将腾讯广告大赛2018的Look-alike数据集中的T设置为4000,其中hot数据集中候选集大于T,cold数据集中候选集小于T. +### infer_meta.py说明 +infer_meta.py是用于元学习模型infer的tool,在使用中主要有以下几点需要注意: +1. 在对模型进行infer时(train时也可使用这样的操作),可以将runner.infer_batch_size注释掉,这样将禁用DataLoader的自动组batch功能,进而可以使用自定义的组batch方式. +2. 由于元学习在infer时需要先对特定任务的少量数据集进行训练,因此在infer_meta.py的infer_dataloader中每次接收单个子任务的全量infer数据集(包括训练数据和测试数据). +3. 实际组batch在infer.py中进行,在获取到单个子任务的数据后,获取config中的batch_size参数,对训练数据和测试数据进行组batch,并分别调用dygraph_model.py中的infer_train_forward和infer_forward进行训练和测试. +4. 和普通infer不同,由于需要对单个子任务进行少量数据的train和test,对于每个子任务来说加载的都是train阶段训练好的泛化模型. +5. 在对单个子任务infer时,创建了局部的paddle.metric.Auc("ROC"),可以查看每个子任务的AUC指标,在全局metric中维护包含所有子任务的AUC指标. + +## 进阶使用 + +## FAQ