CodeContext Explorer: An LLM+RAG-based Pipeline for Explaining Domain-Specific Usage of CatBoost Code

Intro

In this exploratory work, a Retrieval-Augmented Generation (RAG) system, CodeContext Explorer, is built to provide code explanations aimed at demonstrating the potential of combining RAG techniques with domain-specific code context. In this work, housing-data-related code is picked, in particular CatBoost code snippets applied to housing datasets (e.g. Zillow). CodeContext Explorer enables users to query contextualized code examples and their descriptions, supporting better understanding and usage of CatBoost in real estate modeling tasks.

The system relies upon synthetically created CatBoost code samples, used in the domain of housing data, with descriptive annotations. Using vector embeddings and a FAISS index, it retrieves the most relevant code-context pairs in response to user queries. These retrieved contexts are passed to the Mistral-7B-Instruct language model with custom prompts to generate explanations.

Results are stored in JSON format and presented through an interactive Flask web interface, allowing easy browsing of questions, related code snippets, and explanations — facilitating learning and exploration for data scientists and ML practitioners.

See here for an example.

Quick Start / Using the Model

Query the Model on Server

Run the retrieval-augmented generation pipeline with Mistral:

python rag_mistral_single_ip.py

Query the Model from Client Using FastAPI

Additional prequisites for Dynamic RAG Viewer

pip install fastapi uvicorn pydantic

Run the live viewer in the server on say port 8000 as follows:

$cd viewer && uvicorn live_rag_qa_viewer:app --host 0.0.0.0 --port 8000

You'll see the following on your screen, which indicates the FastAPI application is running:

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 55.04it/s]
Device set to use cuda:0
INFO:     Started server process [2657441]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Then connect your client machine to the port of the server where the app is running. We use SSH tunnel here:

ssh -L 8000:127.0.0.1:8000 username@server_address

Then in another terminal window in your client machine, run the client app provided in the viewer folder as follows:

python3 rag_client.py

You are now able to remotely send queries to the model, and receive its responses.

View Saved Results from Client Machine Using Flask

Launch the Flask web app to browse the saved QA pairs:

python viewer/rag_qa_viewer.py

The web app provides an interactive interface to navigate through your question-answer samples, displaying the context, question, and model-generated answers in a clean format.

Example

Accessing the Web App Remotely

From your local machine, create an SSH tunnel to securely access the app running on your server:

ssh -L 5000:localhost:5000 user@server_address

Then open http://localhost:5000 in your browser to interact with the app.

Running the RAG Reward Model & PPO Pipeline

This document describes the full workflow for building a reward model and training a language model using reinforcement learning with the reward model.

Step 1: Build Pairwise Preference Dataset

python3 RL_build_pairwise_prefs.py

Generates a dataset of pairwise preferences from model outputs. These pairwise comparisons are used to teach the reward model which outputs are better or preferred.

Step 2: Train the Reward Model

python3 RL_train_reward_model_pairwise.py

Trains a reward model using the pairwise preference dataset created in Step 1. This reward model will later provide feedback during reinforcement learning to guide the main model’s behavior.

Step 3: Train the Language Model with PPO

wandb login
python3 RL_ppo_train_with_reward.py

Uses Proximal Policy Optimization (PPO) to fine-tune the language model guided by the trained reward model. wandb login ensures your training logs and metrics are recorded on Weights & Biases for live monitoring. After starting the training, you can check the W&B dashboard to see live metrics, reward scores, and model performance.

✅ Following these three steps in order will allow you to:

Generate a preference dataset,
Train a reward model, and
Fine-tune your main model with reinforcement learning while monitoring progress live.

About the Dataset

We name our dataset, CatBoostCH 1.0 (CatBoost Code for Housing Data). This dataset consists of 1,000 synthetically generated Python code snippets demonstrating diverse uses of the CatBoost library applied to housing datasets similar to Zillow. The samples are created using multiple customizable templates, covering a variety of common data processing, model training, evaluation, and deployment scenarios. Each code snippet is paired with a concise description explaining its purpose and context. Designed as a prototype dataset, it offers a scalable foundation that can be expanded with additional templates and real-world data for broader applicability in AI/ML research. We have provided the scripts used to generate this dataset.

Dependencies

Make sure you have the following packages installed:

langchain
langchain-community
langchain-huggingface
transformers
faiss-cpu or faiss-gpu
flask
sentence-transformers
torch (required by transformers models)
python-dotenv (if you're using environment variables)

You can install them via pip:

pip install langchain langchain-community langchain-huggingface \
            transformers sentence-transformers faiss-cpu flask torch

About Catboost

CatBoost is a gradient boosting library developed by Yan- dex [1] that is widely used for machine learning tasks such as classification, regression, and ranking. It is particularly known for handling categorical features efficiently without requiring extensive preprocessing, which makes it very convenient in real-world datasets across various domains including healthcare, housing, environmental science, etc. [2-7].

References

[1] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “Catboost: unbiased boosting with categorical features,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc., 2018, p. 6639–6649.

[2] S. Shao, B. Zhao, X. Cui, Y. Dai, and B. Bao, “Housing rental information management and prediction system based on catboost algorithm - a case study of halifax region,” in Rough Sets: International Joint Conference, IJCRS 2024, Halifax, NS, Canada, May 17–20, 2024, Proceedings, Part II. Berlin, Heidelberg: Springer-Verlag, 2024, p. 230–246. [Online]. Available: https://doi.org/10.1007/978-3-031-65668-2_16

[3] C. Zou, “The house price prediction using machine learning algorithm: The case of jinan, china,” Highlights in Science, Engineering and Technology, vol. 39, pp. 327–333, 04 2023.

[4] J. T. Hancock and T. M. Khoshgoftaar, “Catboost for big data: an interdisciplinary review,” Journal of Big Data, vol. 7, no. 1, p. 94, 2020. [Online]. Available: https://doi.org/10.1186/s40537-020-00369-8

[5] X. Jin, W. Sun, Y. Li, Y. Su, L. Xu, and X. Zhu, “Use of catboost algorithm to identify the need for surgery in infants with necrotizing enterocolitis,” Frontiers in Pediatrics, vol. 13, p. 1465278, Feb. 2025. [Online]. Available: https://doi.org/10.3389/fped.2025.1465278

[6] M. Hamid, F. Hajjej, A. S. Alluhaidan, and N. W. bin Mannie, “Fine tuned catboost machine learning approach for early detection of cardiovascular disease through predictive modeling,” Scientific Reports, vol. 15, no. 1, p. 31199, Aug. 2025. [Online]. Available: https://doi.org/10.1038/s41598-025-13790-x

[7] Z. Guo, X. Wang, and L. Ge, “Classification prediction model of indoor pm2.5 concentration using catboost algorithm,” Frontiers in Built Environment, vol. 9, p. 1207193, 2023. [Online]. Available: https://doi.org/10.3389/fbuil.2023.1207193

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
archived		archived
data-gen		data-gen
data		data
debug		debug
docs		docs
output		output
viewer		viewer
README.md		README.md
RL_build_pairwise_prefs.py		RL_build_pairwise_prefs.py
RL_ppo_train_with_reward.py		RL_ppo_train_with_reward.py
RL_train_reward_model_pairwise.py		RL_train_reward_model_pairwise.py
faiss_basic_search.py		faiss_basic_search.py
get_recall_at_k_mrr.py		get_recall_at_k_mrr.py
output_GPT-2 and rag_vectorstore_db_v1.pdf		output_GPT-2 and rag_vectorstore_db_v1.pdf
rag_mistral_batch_ip.py		rag_mistral_batch_ip.py
rag_mistral_single_ip.py		rag_mistral_single_ip.py
rag_query_basic_prompt.py		rag_query_basic_prompt.py
rag_query_template_prompt.py		rag_query_template_prompt.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeContext Explorer: An LLM+RAG-based Pipeline for Explaining Domain-Specific Usage of CatBoost Code

Contents

Intro

Quick Start / Using the Model

Query the Model on Server

Query the Model from Client Using FastAPI

View Saved Results from Client Machine Using Flask

Example

Accessing the Web App Remotely

Running the RAG Reward Model & PPO Pipeline

Step 1: Build Pairwise Preference Dataset

Step 2: Train the Reward Model

Step 3: Train the Language Model with PPO

About the Dataset

Dependencies

About Catboost

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

AftabHussain/catboost-code-rag

Folders and files

Latest commit

History

Repository files navigation

CodeContext Explorer: An LLM+RAG-based Pipeline for Explaining Domain-Specific Usage of CatBoost Code

Contents

Intro

Quick Start / Using the Model

Query the Model on Server

Query the Model from Client Using FastAPI

View Saved Results from Client Machine Using Flask

Example

Accessing the Web App Remotely

Running the RAG Reward Model & PPO Pipeline

Step 1: Build Pairwise Preference Dataset

Step 2: Train the Reward Model

Step 3: Train the Language Model with PPO

About the Dataset

Dependencies

About Catboost

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages