TokenSmith

TokenSmith is a Retrieval-Augmented Generation (RAG) application that enables intelligent document search and question answering using local LLMs. Built with llama.cpp for efficient inference and FAISS for high-performance vector search, TokenSmith allows you to index PDF documents and chat with them using natural language queries.

🚀 Features

📚 PDF Document Processing: Extract and index content from PDF documents
🔍 Intelligent Retrieval: Fast semantic search using FAISS vector database
🤖 Local LLM Integration: Powered by llama.cpp for privacy-focused inference
⚡ Hardware Acceleration: Supports Metal (Apple Silicon), CUDA (NVIDIA), and CPU inference
🎯 Flexible Chunking: Token-based or character-based document segmentation
📊 Visualization Support: Optional indexing progress visualization
🛠️ Production-Ready: Conda-based environment management with automated builds
🔧 Configurable: YAML-based configuration system

📋 Requirements

Python: 3.9+
Conda/Miniconda: For environment management
System Requirements:
- macOS: Xcode Command Line Tools
- Linux: GCC, make, cmake
- Windows: Visual Studio Build Tools (for compilation)

🚀 Quick Start

1. Clone the Repository

git clone https://github.com/georgia-tech-db/TokenSmith.git
cd tokensmith

One-command setup: creates conda env, builds llama.cpp, installs dependencies

make build

This will:

Create a conda environment named tokensmith
Install all Python dependencies
Detect or build llama.cpp with platform-specific optimizations
Install TokenSmith in development mode

3. Activate the Environment

conda activate tokensmith

4. Prepare Your Documents

Place your PDF files in the data directory

mkdir -p data/chapters
cp your-documents.pdf data/chapters/

5. Index Your Documents

Index with default settings

make run-index

Or with custom parameters, eg.

make run-index ARGS="--pdf_range 1-10 --chunk_mode chars --visualize"

6. Start Chatting

Activate environment first (required for interactive mode)

conda activate tokensmith
python -m src.main chat

You might have to download qwen2.5-0.5b-instruct-q5_k_m.gguf into your llama.cpp/models if you get an error about a missing model.

7. Deactivate the Environment

conda deactivate

⚙️ Configuration

TokenSmith uses YAML configuration files with the following priority order:

Command-line --config argument
User config (~/.config/tokensmith/config.yaml)
Default config (config/config.yaml)

Sample Configuration

# config/config.yaml

embed_model: "sentence-transformers/all-MiniLM-L6-v2"
top_k: 5
max_gen_tokens: 400
halo_mode: "none"
seg_filter: null
# Model settings
model_path: "models/qwen2.5-0.5b-instruct-q5_k_m.gguf"
# Indexing settings
chunk_mode: "tokens" # or "chars"
chunk_tokens: 500
chunk_size_char: 20000

🎮 Usage

Basic indexing

make run-index

Index specific PDF range

make run-index ARGS="--pdf_range <start_page_number>-<end_page_number> --chunk_mode <tokens_or_chars>"

Index with visualization and table preservation

make run-index ARGS="--keep_tables --visualize --chunk_tokens <number_of_chunk_tokens>"

Custom paths and settings

make run-index ARGS="--pdf_dir <path_to_pdf> --index_prefix book_index --config <path_to_yaml_config_file>"

Chat with custom settings

python -m src.main chat --config <path_to_yaml_config_file> --model_path <path_to_llm_model>

Build with existing llama.cpp installation

export LLAMA_CPP_BINARY=/usr/local/bin/llama-cli
make build

Update environment with new dependencies

make update-env

Export environment for sharing

make export-env

Show installed packages

make show-deps

📊 Command Line Arguments

Core Arguments

mode: Operation mode (index or chat)
--config: Configuration file path
--pdf_dir: Directory containing PDF files
--index_prefix: Prefix for index files
--model_path: Path to GGUF model file

Indexing Arguments

--pdf_range: Process specific page range (e.g., "1-10")
--chunk_mode: Chunking strategy (tokens or chars)
--chunk_tokens: Tokens per chunk (default: 500)
--chunk_size_char: Characters per chunk (default: 20000)
--keep_tables: Preserve table formatting
--visualize: Show indexing progress visualization

🔨 Development

Available Make Targets

make help          # Show all available commands
make env           # Create conda environment
make build-llama   # Build llama.cpp from source
make install        # Install package in development mode
make build          # Full build process
make test # Run tests
make clean # Clean build artifacts
make show-deps # Show installed packages
make update-env # Update environment
make export-env # Export environment with exact versions

Adding Dependencies

# Add new conda package
conda activate tokensmith
conda install new-package

Add to environment.yml for persistence. Edit environment.yml, then:

make update-env

📊 Benchmark Testing Framework

TokenSmith includes a comprehensive benchmark testing framework that evaluates answer quality across multiple similarity metrics. The framework uses pytest with YAML-defined test cases for easy management and execution.

Adding New Test Cases

Test cases are defined in tests/benchmarks.yaml. Each benchmark includes a question, expected answer, keywords, and similarity threshold:

# tests/benchmarks.yaml
benchmarks:

  - id: "unique_test_id"
    question: "Your question here?"
    expected_answer: "The expected answer that should be generated."
    keywords: ["key", "terms", "to", "match"]
    similarity_threshold: 0.65 # Minimum score to pass (0.0-1.0)

  - id: "ml_basics"
    question: "What is machine learning?"
    expected_answer: "Machine learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed."
    keywords: ["machine learning", "artificial intelligence", "data", "learn", "algorithm"]
    similarity_threshold: 0.6

Test Case Configuration

Required fields:

id: Unique identifier for the test case
question: The question to ask TokenSmith
expected_answer: Reference answer for comparison
keywords: List of important terms to check for
similarity_threshold: Minimum similarity score (0.6-0.8 recommended)

Scoring weights:

Text similarity: 30%
Semantic similarity: 50%
Keyword matching: 20%

Run all benchmarks

make test-benchmarks

Run with custom parameters

make test-benchmarks ARGS="--index-prefix my_test_index --timeout 600 --model_path models/custom-model.gguf"

Skip slow tests (good for CI)

make test-quick

Run specific test by ID

conda activate tokensmith
pytest tests/test_benchmarks.py::test_tokensmith_benchmark -k "ml_basics" -v

Viewing Results

Test results are automatically generated in:

tests/results/benchmark_results.json - Detailed JSON data
tests/results/benchmark_summary.html - Visual HTML report
tests/results/failed_tests.log - Failed test details

Open HTML report:

make show-test-results

Clean previous results

make clean-test-results

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

License

georgia-tech-db/TokenSmith

Folders and files

Latest commit

History

Repository files navigation