Skip to content

Fast BM25 full-text search for code repositories with MCP integration for AI coding agents.

License

Notifications You must be signed in to change notification settings

rhobimd-oss/shebe

Shebe

pipeline status coverage report Latest Release

Fast Code Search via BM25

Shebe is a fast and simple local code-search tool powered by BM25. No embeddings, No GPU, No cloud.

Research shows 70-85% of developer code search value comes from keyword-based queries. Developers search with exact terms they know: function names, API calls, error messages. BM25 excels at this.

Trade-offs:

  • Repositories must be cloned locally before indexing (no remote URL support)
  • No semantic similarity: "login" does not match "authenticate". However, BM25 supports multi-term queries without performance degradation - agents quickly learn to include synonyms (e.g., login OR authenticate OR sign-in). For true semantic search, pair with vector tools. See detailed analysis.

Capabilities:

  • 2ms query latency
  • 2k-12k files/sec indexing (6k files in 0.5s)
  • 200-700 tokens/query
  • Full UTF-8 support (emoji, CJK, special characters)
  • 14 MCP tools for coding agents (claude, codex etc) (reference)

Size:

  • ~10k lines of Rust source code (and another ~10k LoC test code).
  • 2 binaries (cli and mcp) each at ~8MB.

Positioning: Complements structural tools (Serena MCP) with content search. Coding agents learn tool selection quickly:

  • grep/ripgrep - Exact regex patterns, exhaustive matches, small codebases
  • Shebe - Ranked results, large codebases (1k+ files), polyglot search, boolean queries
  • Serena - Symbol refactoring, AST-aware edits, type-safe renaming

Alternatives: Cloud solutions like turbopuffer and nia come at a premium. Shebe is a free, local-only alternative. See WHY_SHEBE.md for benchmarks.

Table of Contents


Quick Start

1. Install

# Download the latest release
export SHEBE_VERSION=0.5.6-rc3
curl -LO "https://gitlab.com/api/v4/projects/75748935/packages/generic/shebe/${SHEBE_VERSION}/shebe-v${SHEBE_VERSION}-linux-x86_64.tar.gz"
curl -LO "https://gitlab.com/api/v4/projects/75748935/packages/generic/shebe/${SHEBE_VERSION}/shebe-v${SHEBE_VERSION}-linux-x86_64.tar.gz.sha256"

# Verify checksum
sha256sum -c shebe-v${SHEBE_VERSION}-linux-x86_64.tar.gz.sha256

# Extract and install
tar -xzf shebe-v${SHEBE_VERSION}-linux-x86_64.tar.gz
sudo mv shebe shebe-mcp /usr/local/bin/

# check version
shebe --version

2. Index a Repository

# Clone a test repository
git clone --depth 1 https://github.com/envoyproxy/envoy.git ~/envoy

# Index it (creates session "envoy-v1")
shebe index-repository ~/envoy envoy-v1
# Output: Indexed 8,234 files (12,847 chunks) in 2.1s

3. Search Code

# Search for access log formatting
shebe search-code envoy-v1 "accesslog format"
Results for "accesslog format" in envoy-v1 (top 10):

1. source/extensions/access_loggers/common/access_log_base.h [0.847]
   class AccessLogBase : public AccessLog::Instance {
     void formatAccessLog(...);

2. source/common/formatter/substitution_formatter.cc [0.823]
   SubstitutionFormatter::format(const StreamInfo& info) {

4. Find References

# Find all references to SubstitutionFormatter
shebe find-references envoy-v1 SubstitutionFormatter --symbol-type type
References to "SubstitutionFormatter" (type) - 23 found:

HIGH CONFIDENCE (18):
  source/common/formatter/substitution_formatter.h:45
    class SubstitutionFormatter : public Formatter {

  source/extensions/access_loggers/file/file_access_log.cc:28
    std::unique_ptr<SubstitutionFormatter> formatter_;
  ...

For detailed setup, see INSTALLATION.md.


Common Tasks

Quick links to accomplish specific goals:

Task Tool Guide
Rename a symbol safely find_references Reference
Search polyglot codebase search_code Reference
Explore unfamiliar repo index_repository + search_code Quick Start
Find files by pattern find_file Reference
View file with context read_file or preview_chunk Reference
Update stale index reindex_session Reference

Configuration

Quick Reference

Variable Default Description
SHEBE_INDEX_DIR ~/.local/state/shebe Session storage location
SHEBE_CHUNK_SIZE 512 Characters per chunk (100-2000)
SHEBE_OVERLAP 64 Overlap between chunks
SHEBE_DEFAULT_K 10 Default search results count
SHEBE_MAX_K 100 Maximum search results allowed

Configuration File

Create shebe.toml in your working directory or ~/.config/shebe/shebe.toml:

[indexing]
chunk_size = 512
overlap = 64
max_file_size = 10485760  # 10MB

[search]
default_k = 10
max_k = 100

See CONFIGURATION.md for complete reference.


Documentation

Getting Started

Reference

Development


Performance

Validated on Istio (5,605 files, Go-heavy) and OpenEMR (6,364 files, PHP polyglot):

Metric Result
Query latency 2ms (consistent across all query types)
Indexing (Istio) 11,210 files/sec (0.5s for 5,605 files)
Indexing (OpenEMR) 1,928 files/sec (3.3s for 6,364 files)
Token usage 210-650 tokens/query
Polyglot coverage 11 file types in single query

See docs/Performance.md for detailed benchmarks.


Architecture

See ARCHITECTURE.md for developer guide.


Troubleshooting

Issue Cause Solution
"Session not found" Session doesn't exist or typo Run list_sessions to see available sessions
"Schema version mismatch" Session from older Shebe version Run upgrade_session to migrate
Slow indexing Disk I/O or large files Exclude node_modules/, target/, check disk
No search results Empty session or wrong query Verify with get_session_info, check query syntax
"File not found" in read_file File deleted since indexing Run reindex_session to update
High token usage Too many results Reduce k parameter (default: 10)

For detailed troubleshooting, see docs/guides/mcp-setup-guide.md.


Project Status

Version: v0.5.X
Status: Release Candidate
Testing: 76% coverage
Next: Pagination for list_dir and read_file when more than 500 files match a search term

See CHANGELOG.md for version history.


License

See LICENSE.


Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.

About

Fast BM25 full-text search for code repositories with MCP integration for AI coding agents.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published