GitHub - pyemma/Argo: A ML playground for education purpose

Argo

Argo, the ship that carried Jason and the Argonauts on their quest for the Golden Fleece

This is a playground to re-implement model architectures from industry/academic papers in Pytorch. The primary goal is educational and the target audience is for those who would like to start journey in machine learning & machine learning infra. The code implementation is optimized for readability and expandability, while not for the best of performance.

Repo structure

data: functions for dataset management, such as downloading public dataset, cache management, etc
embedding: scripts used for generating embedding
feature: functions for featuer engeering, right now primarily read data from benchmark and use Pandas to do certain feature engineer
get-started: some userful notebooks to help you get faimilar with common techniques & concept in machine learning and recommendation system
model: model code implementation
trainer: simple wrapper around train/val/eval loop
server: simple inference stack for recommendation system, including retrieval engine, feature server, model manager and inference engine
scripts: some scripts used for setup the system, such as DB ingestion

Prepare Step

Embedding Based Retrieval Setup

run python movie_len_embedding.py to generate the embeddings (only support the collabrative embedding)
run python movie_len_index.py to generate the FAISS index
run python scripts/vector_db.py to ingest embedding into DuckDB

How to run locally

Using uv (Recommended)

Install uv if you haven't already: curl -LsSf https://astral.sh/uv/install.sh | sh
Install dependencies and the package: uv sync (this creates a virtual environment and installs everything)
Activate the virtual environment: source .venv/bin/activate (or uv shell on some systems)
Run python main.py to train the model with current env config.
Run python -m server/ebr_server.py to start the grpc server for embedding based retrieval, it would listen on port 50051 by default; if you use DuckDB then this step could be skipped
Run python server/inference_engine.py to start the inference server, it would listen on 8000 port
Run bash scripts/server_request.sh to send a dummy request (there is one for DIN and one for TransAct as of now, will parameterized the request in the future)

Using pip (Legacy)

Install the dependency pip install -r requirements.txt, pip install -e .
Run python main.py to train the model with current env config.
Run python -m server/ebr_server.py to start the grpc server for embedding based retrieval, it would listen on port 50051 by default; if you use DuckDB then this step could be skipped
Run python server/inference_engine.py to start the inference server, it would listen on 8000 port
Run bash scripts/server_request.sh to send a dummy request (there is one for DIN and one for TransAct as of now, will parameterized the request in the future)

Papers

Road Map

Modeling

✅ Deep Interest Network E2E training & inference example, MovieLen Small
✅ TransAct training & inference example, MovieLen Large
✅ MovieLen item embedding generation, collaborative filtering, two-towers, LLM (QWen3-embedding is out)
🚧 HSTU training & inference example, MoiveLen Small
✅ RQ-VAE
Generative Retrieval via various strategies: NTP, MTP with semtanic ids, token represention with ANN

Data & Feature Engineering

✅ Kuaishou Dataset: https://kuairand.com/
Ray integration (DPP reader + trainer arch)
Daft, Polars exploartion

Infra

✅ Embedding Based Retrieval (EBR): DuckDB, FAISS
Nearline item embedding update
Feature store integration: FEAST
Feature logging & training data generation pipeline
Pytorch Lightening integration
Reinforcement learning training infrastructure for recommendation task

GPU

GPU training & inference enablement
Integrate profiling, benchmarking, tuning, and monitoring for accelerator optimization
Optimize representative models with auto-tuning, kernel fusion, quantization, dynamic batching, etc

Reference

DuckDB
QWen3

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
configs		configs
data		data
embedding		embedding
feature		feature
get-started		get-started
model		model
proto		proto
scripts		scripts
server		server
test		test
topics		topics
train		train
trainer		trainer
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
main.py		main.py
main_transact.py		main_transact.py
movie_len_embeddings.py		movie_len_embeddings.py
movie_len_index.py		movie_len_index.py
movie_len_llm_embedding.py		movie_len_llm_embedding.py
movie_len_rq_vae.py		movie_len_rq_vae.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Argo

Repo structure

Prepare Step

Embedding Based Retrieval Setup

How to run locally

Using uv (Recommended)

Using pip (Legacy)

Papers

Road Map

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

pyemma/Argo

Folders and files

Latest commit

History

Repository files navigation

Argo

Repo structure

Prepare Step

Embedding Based Retrieval Setup

How to run locally

Using uv (Recommended)

Using pip (Legacy)

Papers

Road Map

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages