Reginx is short for recommendation engine X. I plan to build most parts of modern recommendation engine from scratch.
Initial plan including:
- Popular machine learning models like CF, FM, XGBoost, TwoTower, W&D, DeepFM, DCN, MaskNet, SASRec, Bert4Rec, Transformer, etc.
- Online inference service written by Golang, including candidate generator, ranking and re-ranking layers
- Feature engineering and preprocessing, including both online and offline part
- Diversity approaches, like MMR, DPP
- Deduplication approaches, like LSH or BloomFilter
- Training data pipeline
- Model registry, monitoring and versioning
Tensorflow 2 and Google Cloud is used for model training and performance tracking. The conda environment config is here.
I have a personal blog in substack explaining the models and I put the corresponding links in the table below.
Here is an example to train a two-tower model in local machine.
Setup your conda environment using the conda config here.
conda env create -f environment.yml
conda activate tf
Set your PYTHONPATH to the root folder of this project. Or you can add it to your bashrc:
export PYTHONPATH=/your_project_folder/reginx
You can run this script to generate meta and training data in your local directory.
By default, it's using the movielens-1m from TensorFlow datasets.
And copy your dataset files to your local /tmp/train, /tmp/test, /tmp/item
folder. Notice that the TwoTower model implementation require 3 kinds of files, train files for training, test files for test and item files for mixing global negative samples.
If you want to use your dataset other than movielens, please prepare your own dataset and save it to your local directory.
There is example config file for candidate-retriever training.
If you want to use your dataset other than movielens, please prepare your own query and candidate embedding class.
model:
temperature: 0.05
# specify training model under models folder
base_model: TwoTower
# specify query embedding model under models/features folder
query_emb: MovieLensQueryEmb
# specify candidate embedding model under models/features folder
candidate_emb: MovieLensCandidateEmb
# specify the unique key for candidates
item_id_key: movie_id
train:
# specify task under tasks folder
task_name: CandidateRetrieverTrain
epochs: 1
batch_size: 256
mixed_negative_batch_size: 128
learning_rate: 0.05
train_data: movielens/data/ratings_train
test_data: movielens/data/ratings_test
candidate_data: movielens/data/movies
meta_data: trainer/meta/movie_lens.json
model_dir: trainer/saved_models/movielens_cr
log_dir: logs
Simply run the script below and specify your the config file in you activated conda environment.
python trainer/local_train.py -c movielens_candidate_retriever
By default, the training metrics show once per 1000 training steps for faster training. You can modify the setting by tuning the steps_per_execution hyperparameter while compiling model.
After the training, evaluation will be run on the test dataset. You should see metrics like:
391/391 [==============================] - 50s 129ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0036 - factorized_top_k/top_5_categorical_accuracy: 0.0181 - factorized_top_k/top_10_categorical_accuracy: 0.0349 - factorized_top_k/top_50_categorical_accuracy: 0.1428 - factorized_top_k/top_100_categorical_accuracy: 0.2409 - loss: 1406.8086 - regularization_loss: 7.9244 - total_loss: 1414.7329