Installation • Documentation • Paper • Citation
EasyRL4Rec is a comprehensive and easy-to-use library designed specifically for Reinforcement Learning (RL)-based Recommender Systems (RSs). This library provides lightweight and diverse RL environments based on five public datasets and includes core modules with rich options, simplifying model development. It provides unified evaluation standards focusing on long-term outcomes and offers tailored designs for state modeling and action representation for recommendation scenarios. The main contributions and key features of this library can be summarized as follows
-
An Easy-to-use Framework.
-
Unified Evaluation Standards
-
Tailored Designs for Recommendation Scenarios
- customizable modules for state modeling and action representation.
-
Insightful Experiments for RL-based RSs
We hope EasyRL4Rec can facilitate the model development and experimental process in the domain of RL-based RSs. More descriptions are available via this paper
If this work helps you, please kindly cite our paper:
@misc{yu2024easyrl4rec,
title={EasyRL4Rec: A User-Friendly Code Library for Reinforcement Learning Based Recommender Systems},
author={Yuanqing Yu and Chongming Gao and Jiawei Chen and Heng Tang and Yuefeng Sun and Qian Chen and Weizhi Ma and Min Zhang},
year={2024},
eprint={2402.15164},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
-
Lightweight Environment.
- bulit on five public datasets: Coat, MovieLens, Yahoo, KuaiRec, KuaiRand
-
StateTracker with rich options.
- Encompassing popular methods in sequential modeling: Average, GRU, Caser, SASRec, NextItNet
-
Comprehensive RL Policies.
-
extend RL policies in Tianshou.
-
include a mechanism to convert continuous actions to discrete items.
-
-
Two Training Paradigms.
-
Learning directly from offline logs.
-
Learning with a user model.
-
-
Unified Evaluation.
- Offline Evaluation focusing on long-term outcomes.
- Three modes:
- FreeB: allow repeated recommendations, interactions are terminated by quit mechanism.
- NX_0: prohibit repeated recommendations, interactions are terminated by quit mechanism.
- NX_X: prohibit repeated recommendations, interactions are fixed as X rounds without quit mechanism.
-
Clone this git repository and change directory to this repository:
git clone https://github.com/chongminggao/EasyRL4Rec.git cd EasyRL4Rec/
-
A new conda environment is suggested.
conda create --name easyrl4rec python=3.11 -y
-
Activate the newly created environment.
conda activate easyrl4rec
-
Install the required modules from pip.
sh install.sh
Install the tianshou package from my forked version:
cd src git clone https://github.com/yuyq18/tianshou.git cd ..
-
Download the compressed dataset
wget https://nas.chongminggao.top:4430/easyrl4rec/data.tar.gz
or you can manually download it from this website: https://rec.ustc.edu.cn/share/a3bdc320-d48e-11ee-8c50-4b1c32c31e9c
-
Uncompress the downloaded
data.tar.gz
. The following command will directly extractdata.tar.gz
into the.data/
directory and merge it with the existing files under.data/
.tar -zxvf data.tar.gz
Please note that the decompressed file size is as high as 8.1GB. This is due to the large space occupied by the ground-truth of the user-item interaction matrix.
If things go well, you can run the following examples now!Or you can just reproduce the results in the paper.
All running commands are saved in script files, which can be found in script/ Here presents some running examples.
The argument env
of all experiments can be set to one of the five environments: CoatEnv-v0, Yahoo-v0, MovieLensEnv-v0, KuaiEnv-v0, KuaiRand-v0
. The former two datasets (coat and yahoo) are small so the models can run very quickly.
python examples/usermodel/run_DeepFM_ensemble.py --env KuaiEnv-v0 --seed 2023 --cuda 0 --epoch 5 --loss "pointneg" --message "pointneg"
python examples/usermodel/run_DeepFM_IPS.py --env KuaiEnv-v0 --seed 2023 --cuda 1 --epoch 5 --loss "pointneg" --message "DeepFM-IPS"
python examples/usermodel/run_Egreedy.py --env KuaiEnv-v0 --num_leave_compute 4 --leave_threshold 0 --epoch 5 --seed 2023 --cuda 2 --loss "pointneg" --message "epsilon-greedy"
python examples/usermodel/run_LinUCB.py --env KuaiEnv-v0 --num_leave_compute 4 --leave_threshold 0 --epoch 5 --seed 2023 --cuda 3 --loss "pointneg" --message "UCB"
python examples/policy/run_SQN.py --env KuaiEnv-v0 --seed 2023 --cuda 0 --epoch 10 --num_leave_compute 1 --leave_threshold 0 --which_tracker avg --reward_handle "cat" --window_size 3 --read_message "pointneg" --message "SQN"
python examples/policy/run_CRR.py --env KuaiEnv-v0 --seed 2023 --cuda 0 --epoch 10 --num_leave_compute 1 --leave_threshold 0 --which_tracker avg --reward_handle "cat" --window_size 3 --read_message "pointneg" --message "CRR"
python examples/policy/run_CQL.py --env KuaiEnv-v0 --seed 2023 --cuda 1 --epoch 10 --num_leave_compute 1 --leave_threshold 0 --which_tracker avg --reward_handle "cat" --num-quantiles 20 --min-q-weight 10 --window_size 3 --read_message "pointneg" --message "CQL"
python examples/policy/run_BCQ.py --env KuaiEnv-v0 --seed 2023 --cuda 1 --epoch 10 --num_leave_compute 1 --leave_threshold 0 --which_tracker avg --reward_handle "cat" --unlikely-action-threshold 0.6 --window_size 3 --read_message "pointneg" --message "BCQ"
python examples/policy/run_A2C_IPS.py --env KuaiEnv-v0 --seed 2023 --cuda 1 --epoch 10 --num_leave_compute 1 --leave_threshold 0 --which_tracker avg --reward_handle "cat" --window_size 3 --read_message "DeepFM-IPS" --message "IPS"
python examples/policy/run_A2C.py --env KuaiEnv-v0 --seed 2023 --cuda 1 --epoch 10 --num_leave_compute 1 --leave_threshold 0 --which_tracker gru --reward_handle "cat" --window_size 3 --read_message "pointneg" --message "A2C"
python examples/policy/run_DQN.py --env KuaiEnv-v0 --seed 2023 --cuda 0 --epoch 10 --num_leave_compute 1 --leave_threshold 0 --which_tracker avg --reward_handle "cat" --window_size 3 --is_random_init --read_message "pointneg" --message "DQN-test"
python examples/advance/run_MOPO.py --env KuaiEnv-v0 --seed 2023 --cuda 3 --epoch 10 --num_leave_compute 1 --leave_threshold 0 --which_tracker avg --reward_handle "cat" --lambda_variance 0.05 --window_size 3 --read_message "pointneg" --message "MOPO"
python examples/advance/run_DORL.py --env KuaiEnv-v0 --seed 2023 --cuda 0 --epoch 10 --num_leave_compute 1 --leave_threshold 0 --which_tracker avg --reward_handle "cat" --lambda_entropy 5 --window_size 3 --read_message "pointneg" --message "DORL"
python examples/advance/run_Intrinsic.py --env KuaiEnv-v0 --seed 2023 --cuda 0 --epoch 10 --num_leave_compute 1 --leave_threshold 0 --which_tracker avg --reward_handle "cat" --lambda_diversity 0.1 --lambda_novelty 0.1 --window_size 3 --read_message "pointneg" --message "Intrinsic"
The Collector module serves as a crucial link facilitating interactions between Environment and Policy, responsible for collecting interaction trajectories into Buffer.
Considering a complete interaction from time
Visualzation of data/trajectories stored in Buffer, which support simultaneous interactions in multiple environments:
EasyRL4Rec offers two training settings:
This setting is similar to ChatGPT's RLHF learning paradigm, in which a reward model is learned in advanced to capture users' preferences and then is used to guide the learning of any RL policy.
Its learning pipeline is as the following figure. We first learn a user model
The implementation of this paradigm in this package is as follows:
This setting assume all data are users' behavior logs (instead of ratings). The policy directly learns from offline logs, which have been collected in the Buffer in advance. Hence, the classic offline RL methods such as BCQ, CQL, and CRR can be learned directly on such data.
In EasyRL4Rec, we implement three buffer construction methods:
- Sequential: logs would be split in chronological order.
- Convolution: logs would be augmented through convolution.
- Counterfactual: logs would be randomly shuffled over time.
Note that compared with the first setting, this setting has no planning stage in training. And its implementation is as follows:
In offline evaluation, we cannot obtain users' real-time feedback towards the recommended items. The are two options that we can choose to construct the test environment:
- Option 1: Use the offline test data to evaluate the policy directly through off-policy evaluation, such as paper, paper.
- Option 2: Creat a simulated environment using a simulated model. For example, using a MF model to predict the missing values in the user-itemp matrix (paper) and define a certain quit mechanism for ending the interaction, such as KuaiEnv.
The implementation is as follows:
Documentation of this project was generated by RepoAgent