Skip to content

georgia-tech-db/TokenSmith

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TokenSmith

TokenSmith is a Retrieval-Augmented Generation (RAG) application that enables intelligent document search and question answering using local LLMs. Built with llama.cpp for efficient inference and FAISS for high-performance vector search, TokenSmith allows you to index PDF documents and chat with them using natural language queries.

🚀 Features

  • 📚 PDF Document Processing: Extract and index content from PDF documents
  • 🔍 Intelligent Retrieval: Fast semantic search using FAISS vector database
  • 🤖 Local LLM Integration: Powered by llama.cpp for privacy-focused inference
  • ⚡ Hardware Acceleration: Supports Metal (Apple Silicon), CUDA (NVIDIA), and CPU inference
  • 🎯 Flexible Chunking: Token-based or character-based document segmentation
  • 📊 Visualization Support: Optional indexing progress visualization
  • 🛠️ Production-Ready: Conda-based environment management with automated builds
  • 🔧 Configurable: YAML-based configuration system

📋 Requirements

  • Python: 3.9+
  • Conda/Miniconda: For environment management
  • System Requirements:
    • macOS: Xcode Command Line Tools
    • Linux: GCC, make, cmake
    • Windows: Visual Studio Build Tools (for compilation)

🚀 Quick Start

1. Clone the Repository

git clone https://github.com/georgia-tech-db/TokenSmith.git
cd tokensmith

One-command setup: creates conda env, builds llama.cpp, installs dependencies

make build

This will:

  • Create a conda environment named tokensmith
  • Install all Python dependencies
  • Detect or build llama.cpp with platform-specific optimizations
  • Install TokenSmith in development mode

3. Activate the Environment

conda activate tokensmith

4. Prepare Your Documents

Place your PDF files in the data directory

mkdir -p data/chapters
cp your-documents.pdf data/chapters/

5. Index Your Documents

Index with default settings

make run-index

Or with custom parameters, eg.

make run-index ARGS="--pdf_range 1-10 --chunk_mode chars --visualize"

6. Start Chatting

Activate environment first (required for interactive mode)

conda activate tokensmith
python -m src.main chat

You might have to download qwen2.5-0.5b-instruct-q5_k_m.gguf into your llama.cpp/models if you get an error about a missing model.

7. Deactivate the Environment

conda deactivate

⚙️ Configuration

TokenSmith uses YAML configuration files with the following priority order:

  1. Command-line --config argument
  2. User config (~/.config/tokensmith/config.yaml)
  3. Default config (config/config.yaml)

Sample Configuration

# config/config.yaml

embed_model: "sentence-transformers/all-MiniLM-L6-v2"
top_k: 5
max_gen_tokens: 400
halo_mode: "none"
seg_filter: null
# Model settings
model_path: "models/qwen2.5-0.5b-instruct-q5_k_m.gguf"
# Indexing settings
chunk_mode: "tokens" # or "chars"
chunk_tokens: 500
chunk_size_char: 20000

🎮 Usage

Basic indexing

make run-index

Index specific PDF range

make run-index ARGS="--pdf_range <start_page_number>-<end_page_number> --chunk_mode <tokens_or_chars>"

Index with visualization and table preservation

make run-index ARGS="--keep_tables --visualize --chunk_tokens <number_of_chunk_tokens>"

Custom paths and settings

make run-index ARGS="--pdf_dir <path_to_pdf> --index_prefix book_index --config <path_to_yaml_config_file>"

Chat with custom settings

python -m src.main chat --config <path_to_yaml_config_file> --model_path <path_to_llm_model>

Build with existing llama.cpp installation

export LLAMA_CPP_BINARY=/usr/local/bin/llama-cli
make build

Update environment with new dependencies

make update-env

Export environment for sharing

make export-env

Show installed packages

make show-deps

📊 Command Line Arguments

Core Arguments

  • mode: Operation mode (index or chat)
  • --config: Configuration file path
  • --pdf_dir: Directory containing PDF files
  • --index_prefix: Prefix for index files
  • --model_path: Path to GGUF model file

Indexing Arguments

  • --pdf_range: Process specific page range (e.g., "1-10")
  • --chunk_mode: Chunking strategy (tokens or chars)
  • --chunk_tokens: Tokens per chunk (default: 500)
  • --chunk_size_char: Characters per chunk (default: 20000)
  • --keep_tables: Preserve table formatting
  • --visualize: Show indexing progress visualization

🔨 Development

Available Make Targets

make help          # Show all available commands
make env           # Create conda environment
make build-llama   # Build llama.cpp from source
make install        # Install package in development mode
make build          # Full build process
make test # Run tests
make clean # Clean build artifacts
make show-deps # Show installed packages
make update-env # Update environment
make export-env # Export environment with exact versions

Adding Dependencies

# Add new conda package
conda activate tokensmith
conda install new-package

Add to environment.yml for persistence. Edit environment.yml, then:

make update-env

📊 Benchmark Testing Framework

TokenSmith includes a comprehensive benchmark testing framework that evaluates answer quality across multiple similarity metrics. The framework uses pytest with YAML-defined test cases for easy management and execution.

Adding New Test Cases

Test cases are defined in tests/benchmarks.yaml. Each benchmark includes a question, expected answer, keywords, and similarity threshold:

# tests/benchmarks.yaml
benchmarks:

  - id: "unique_test_id"
    question: "Your question here?"
    expected_answer: "The expected answer that should be generated."
    keywords: ["key", "terms", "to", "match"]
    similarity_threshold: 0.65 # Minimum score to pass (0.0-1.0)

  - id: "ml_basics"
    question: "What is machine learning?"
    expected_answer: "Machine learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed."
    keywords: ["machine learning", "artificial intelligence", "data", "learn", "algorithm"]
    similarity_threshold: 0.6

Test Case Configuration

Required fields:

  • id: Unique identifier for the test case
  • question: The question to ask TokenSmith
  • expected_answer: Reference answer for comparison
  • keywords: List of important terms to check for
  • similarity_threshold: Minimum similarity score (0.6-0.8 recommended)

Scoring weights:

  • Text similarity: 30%
  • Semantic similarity: 50%
  • Keyword matching: 20%

Run all benchmarks

make test-benchmarks

Run with custom parameters

make test-benchmarks ARGS="--index-prefix my_test_index --timeout 600 --model_path models/custom-model.gguf"

Skip slow tests (good for CI)

make test-quick

Run specific test by ID

conda activate tokensmith
pytest tests/test_benchmarks.py::test_tokensmith_benchmark -k "ml_basics" -v

Viewing Results

Test results are automatically generated in:

  • tests/results/benchmark_results.json - Detailed JSON data
  • tests/results/benchmark_summary.html - Visual HTML report
  • tests/results/failed_tests.log - Failed test details

Open HTML report:

make show-test-results

Clean previous results

make clean-test-results

About

Hand-crafting answers with more precision

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6