MCP Lucene Server

A Model Context Protocol (MCP) server that exposes Apache Lucene fulltext search capabilities with automatic document crawling and indexing. This server uses STDIO transport for communication and can be integrated with Claude Desktop or any other MCP-compatible client.

Features

✨ Automatic Document Crawling

Automatically indexes PDFs, Microsoft Office, and OpenOffice documents
Multi-threaded crawling for fast indexing
Real-time directory monitoring for automatic updates
Incremental indexing with full reconciliation (skips unchanged files, removes orphans)

🔍 Powerful Search

Full Lucene query syntax support (wildcards, boolean operators, phrase queries)
Field-specific filtering (by author, language, file type, etc.)
Structured passages with quality metadata for LLM consumption
Paginated results with filter suggestions

📄 Rich Metadata Extraction

Automatic language detection
Author, title, creation date extraction
File type and size information
SHA-256 content hashing for change detection

⚡ Performance Optimized

Batch processing for efficient indexing
NRT (Near Real-Time) search with dynamic optimization
Configurable thread pools for parallel processing
Progress notifications during bulk operations

🔧 Easy Integration

STDIO transport for seamless MCP client integration
Comprehensive MCP tools for search and crawler control
Flexible configuration via YAML
Cross-platform notifications (macOS Notification Center, Windows Toast, Linux notify-send)

Quick Start

Get up and running with MCP Lucene Server in three steps.

Prerequisites

Java 21 or later - Required to run the server
Maven 3.9+ (only if building from source)

Step 1: Get the Server

Option A: Download Pre-built JAR (Recommended)

Go to the Actions tab
Click on the most recent successful workflow run
Scroll down to "Artifacts" and download luceneserver-X.X.X-SNAPSHOT
Extract the ZIP file to get the JAR

For tagged releases, you can also download from the Releases page.

Option B: Build from Source

./mvnw clean package -DskipTests

This creates an executable JAR at target/luceneserver-0.0.1-SNAPSHOT.jar.

Step 2: Configure Claude Desktop

Locate your Claude Desktop configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

Add the Lucene MCP server to the mcpServers section:

{
  "mcpServers": {
    "lucene-search": {
      "command": "java",
      "args": [
        "--enable-native-access=ALL-UNNAMED",
        "-Xmx2g",
        "-Dspring.profiles.active=deployed",
        "-jar",
        "/absolute/path/to/luceneserver-0.0.1-SNAPSHOT.jar"
      ]
    }
  }
}

Important: Replace /absolute/path/to/luceneserver-0.0.1-SNAPSHOT.jar with the actual absolute path to your JAR file.

The -Dspring.profiles.active=deployed flag is required for clean STDIO communication (disables console logging, web server, and startup banner).

Step 3: Start Using It

Restart Claude Desktop to load the new configuration
Verify the server is running in Claude Desktop's developer settings
Tell Claude to add your documents:

"Add /Users/yourname/Documents as a crawlable directory and start crawling"

That's it! The configuration is saved to ~/.mcplucene/config.yaml and persists across restarts. You can now search your documents through Claude.

Example searches:

"Search for machine learning papers"
"Find all PDFs by John Doe"
"What documents mention quarterly reports?"

Configuration Options

Note: The Quick Start above uses zero-configuration. This section covers advanced customization options.

The server can be configured via environment variables and application.yaml:

Logging Profiles

The server supports two logging profiles (for backwards compatibility, uses the same system property as Spring Boot):

Profile	Usage	Logging Output
default	Development in IDE	Console logging enabled
deployed	Production/Claude Desktop	File logging only

Default profile (no profile specified):

Full logging enabled to console
Suitable for debugging and development

Deployed profile (-Dspring.profiles.active=deployed):

Console logging disabled (required for STDIO transport)
File logging enabled (~/.mcplucene/log/mcplucene.log)
Used when running under Claude Desktop or other MCP clients

Environment Variables

Environment Variable	Default	Description
`LUCENE_INDEX_PATH`	`${user.home}/.mcplucene/luceneindex}`	Path to the Lucene index directory
`LUCENE_CRAWLER_DIRECTORIES`	none	Comma-separated list of directories to crawl (overrides config file)
`SPRING_PROFILES_ACTIVE`	none (default)	Set to `deployed` for production use

Note on LUCENE_CRAWLER_DIRECTORIES: When this environment variable is set, it takes precedence over ~/.mcplucene/config.yaml and application.yaml. The MCP configuration tools (addCrawlableDirectory, removeCrawlableDirectory) will refuse to modify configuration while this override is active. To use runtime configuration, remove this environment variable.

Document Crawler Configuration

The crawler directories can be configured in three ways, with the following priority (highest to lowest):

Environment Variable: LUCENE_CRAWLER_DIRECTORIES (comma-separated paths)
Runtime Configuration: ~/.mcplucene/config.yaml (managed via MCP tools)
Application Default: src/main/resources/application.yaml

Runtime Configuration via MCP Tools (Recommended)

The server provides MCP tools to manage crawlable directories at runtime without editing configuration files:

listCrawlableDirectories - List all configured directories

Ask Claude: "What directories are being crawled?"

addCrawlableDirectory - Add a new directory to crawl

Ask Claude: "Add /Users/yourname/Documents as a crawlable directory"
Ask Claude: "Add /path/to/folder and start crawling it immediately"

removeCrawlableDirectory - Remove a directory from crawling

Ask Claude: "Stop crawling /Users/yourname/Downloads"

Benefits of Runtime Configuration:

No need to rebuild the JAR or restart the server
Configuration persists across restarts in ~/.mcplucene/config.yaml
Easy to distribute pre-built JARs
Conversational interface via Claude

Configuration File Location:

~/.mcplucene/config.yaml

Example config.yaml:

lucene:
  crawler:
    directories:
      - /Users/yourname/Documents
      - /Users/yourname/Downloads

Static Configuration via application.yaml

Configure the document crawler in src/main/resources/application.yaml:

lucene:
  index:
    path: ${LUCENE_INDEX_PATH:./lucene-index}
  crawler:
    # Directories to crawl and index
    directories:
      - "/path/to/your/documents"
      - "/another/path/to/index"

    # File patterns to include
    include-patterns:
      - "*.pdf"
      - "*.doc"
      - "*.docx"
      - "*.odt"
      - "*.ppt"
      - "*.pptx"
      - "*.xls"
      - "*.xlsx"
      - "*.ods"
      - "*.txt"
      - "*.eml"
      - "*.msg"
      - "*.md"
      - "*.rst"
      - "*.html"
      - "*.htm"
      - "*.rtf"
      - "*.epub" 

    # File patterns to exclude
    exclude-patterns:
      - "**/node_modules/**"
      - "**/.git/**"
      - "**/target/**"
      - "**/build/**"

    # Performance settings
    thread-pool-size: 4                    # Parallel crawling threads
    batch-size: 100                        # Documents per batch
    batch-timeout-ms: 5000                 # Batch processing timeout

    # Directory watching
    watch-enabled: true                    # Monitor directories for changes
    watch-poll-interval-ms: 2000          # Watch polling interval

    # NRT optimization
    bulk-index-threshold: 1000            # Files before NRT slowdown
    slow-nrt-refresh-interval-ms: 5000    # NRT interval during bulk indexing

    # Content extraction
    max-content-length: -1                 # -1 = unlimited, or max characters
    extract-metadata: true                 # Extract author, title, etc.
    detect-language: true                  # Auto-detect document language

    # Auto-crawl
    crawl-on-startup: true                 # Start crawling on server startup

    # Progress notifications
    progress-notification-files: 100       # Notify every N files
    progress-notification-interval-ms: 30000  # Or every N milliseconds

    # Incremental indexing
    reconciliation-enabled: true           # Skip unchanged files, remove orphans (default: true)

    # Search passages
    max-passages: 3                        # Max highlighted passages per search result (default: 3)
    max-passage-char-length: 200           # Max character length per passage; longer ones are truncated (default: 200, 0 = no limit)

Supported File Formats:

PDF documents (.pdf)
Microsoft Office: Word (.doc, .docx), Excel (.xls, .xlsx), PowerPoint (.ppt, .pptx)
OpenOffice/LibreOffice: Writer (.odt), Calc (.ods), Impress (.odp)
Plain text files (.txt)
Email: Outlook (.msg), EML (.eml)
Markup: Markdown (.md), reStructuredText (.rst), HTML (.html, .htm)
Rich Text Format (.rtf)
E-books: EPUB (.epub)

Complete Configuration Example:

lucene:
  index:
    path: /Users/yourname/lucene-index
  crawler:
    # Add your document directories here
    directories:
      - "/Users/yourname/Documents"
      - "/Users/yourname/Downloads"
      - "/Volumes/ExternalDrive/Archive"

    # Include only these file types
    include-patterns:
      - "*.pdf"
      - "*.docx"
      - "*.xlsx"

    # Exclude these directories
    exclude-patterns:
      - "**/node_modules/**"
      - "**/.git/**"

    # Performance tuning
    thread-pool-size: 8              # Use more threads for faster indexing
    batch-size: 200                  # Larger batches for better throughput

    # Auto-start crawler
    crawl-on-startup: true

    # Real-time monitoring
    watch-enabled: true

    # No content limit (index full documents)
    max-content-length: -1

Available MCP Tools

`indexAdmin`

An MCP App that provides a visual user interface for index maintenance tasks directly inside your MCP client (e.g. Claude Desktop). When invoked, the app is rendered inline in the conversation and offers one-click access to administrative operations without requiring manual tool calls.

Available actions:

Unlock Index -- Removes a stale write.lock file after an unclean shutdown (equivalent to calling unlockIndex with confirm=true)
Optimize Index -- Merges index segments for improved search performance (equivalent to calling optimizeIndex)
Purge Index -- Deletes all documents from the index (equivalent to calling purgeIndex with confirm=true)

Each action shows inline status feedback (success, error, or progress details) directly in the app UI.

Example:

Ask Claude: "Can you invoke the indexAdmin tool please?"

`search`

Search the Lucene fulltext index using lexical matching (exact word forms only).

Parameters:

query (required): The search query using Lucene query syntax
filterField (optional): Field name to filter results (use facets to discover available values)
filterValue (optional): Value for the filter field
page (optional): Page number, 0-based (default: 0)
pageSize (optional): Results per page (default: 10, max: 100)

🤖 AI-Powered Synonym Expansion:

This server is designed to work with AI assistants like Claude. Instead of using traditional Lucene synonym files, the AI generates context-appropriate synonyms automatically by constructing OR queries.

Why this is better than traditional synonyms:

Context-aware: The AI understands your intent and picks relevant synonyms (e.g., "contract" in legal context vs. "contract" in construction)
No maintenance: No need to maintain static synonym configuration files
Domain-adaptive: Works across legal, technical, medical, or casual language automatically
Multilingual: Generates synonyms in any language without configuration

When you ask Claude to "find documents about cars", it automatically searches for (car OR automobile OR vehicle) - giving you better results than a static synonym list.

⚠️ Technical Details (Lexical Matching):

The index uses a custom UnicodeNormalizingAnalyzer built on Lucene's ICUFoldingFilter, which provides:

✅ Tokenization and lowercasing
✅ Unicode normalization (NFKC) -- full-width characters are mapped to their standard equivalents
✅ Diacritic folding -- accented characters are mapped to their ASCII base forms (e.g., "ä" → "a", "ö" → "o", "ü" → "u", "ñ" → "n")
✅ Ligature expansion -- PDF ligatures are expanded correctly (e.g., the "fi" ligature → "fi", the "fl" ligature → "fl")
✅ Efficient leading wildcard queries -- a content_reversed field stores reversed tokens, so *vertrag is internally rewritten as a trailing wildcard on the reversed field (gartrev*), avoiding costly full-index scans
❌ No automatic synonym expansion at the index level
❌ No phonetic matching (e.g., "Smith" won't match "Smyth")
❌ No stemming (e.g., "running" won't match "run")

The AI assistant compensates for the remaining limitations by expanding your queries intelligently.

💡 Best Practices for Better Results:

Generate Synonyms Yourself: Use OR to combine related terms:
- Instead of: contract
- Use: (contract OR agreement OR deal)
Use Wildcards for Variations: Handle different word forms:
- Instead of: contract
- Use: contract* (matches contracts, contracting, contracted)
Leverage Facets: Use the returned facet values to discover exact terms in the index:
- Check facets.author to find exact author names
- Check facets.language to see available languages
- Use these exact values for filtering

Combine Techniques:

(contract* OR agreement*) AND (sign* OR execut*) AND author:"John Doe"

Supported Query Syntax:

Simple terms: hello world (implicit AND between terms)
Phrase queries: "exact phrase" (preserves word order)
Boolean operators: term1 AND term2, term1 OR term2, NOT term
Trailing wildcard: contract* matches contracts, contracting, contracted
Leading wildcard: *vertrag efficiently finds Arbeitsvertrag, Kaufvertrag (optimised via reverse token field)
Infix wildcard: *vertrag* finds both Vertragsbedingungen and Arbeitsvertrag
Single char wildcard: te?t matches test, text
Fuzzy search: term~2 finds terms within Levenshtein edit distance 2 (default: 2)
Proximity search: "term1 term2"~5 finds terms within 5 words of each other
Field-specific search: title:hello content:world
Grouping: (contract OR agreement) AND signed
Range queries: modified_date:[1609459200000 TO 1640995200000] (timestamps in milliseconds)

German Compound Word Search:

German compound words (e.g., "Arbeitsvertrag", "Vertragsbedingungen") can be searched effectively using wildcards:

*vertrag -- finds words ending in "vertrag" (Arbeitsvertrag, Kaufvertrag, Mietvertrag)
vertrag* -- finds words starting with "vertrag" (Vertragsbedingungen, Vertragsklausel)
*vertrag* -- finds words containing "vertrag" anywhere (combines both)

Leading wildcard queries are optimised internally using a reverse token index (content_reversed field), so they execute as fast as trailing wildcards.

Returns:

Paginated document results, each containing a passages array with highlighted text and quality metadata
Document-level relevance scores
Facets with actual values and counts from the result set
Search execution time in milliseconds (searchTimeMs)

`getIndexStats`

Get statistics about the Lucene index.

Returns:

documentCount: Total number of documents in the index
indexPath: Path to the index directory
schemaVersion: Current index schema version
softwareVersion: Server software version
buildTimestamp: Server build timestamp

`startCrawl`

Start crawling configured directories to index documents.

Parameters:

fullReindex (optional): If true, clears the index before crawling (default: false). When false and reconciliation-enabled is true, an incremental crawl is performed instead.

Features:

Automatically extracts content from PDFs, Office documents, and OpenOffice files
Detects document language
Extracts metadata (author, title, creation date, etc.)
Multi-threaded processing for fast indexing
Progress notifications during crawling
Incremental mode (default): Only new or modified files are indexed; deleted files are removed from the index automatically. Falls back to a full crawl if reconciliation encounters an error.

`getCrawlerStats`

Get real-time statistics about the crawler progress.

Returns:

filesFound: Total files discovered
filesProcessed: Files processed so far
filesIndexed: Files successfully indexed
filesFailed: Files that failed to process
bytesProcessed: Total bytes processed
filesPerSecond: Processing throughput
megabytesPerSecond: Data throughput
elapsedTimeMs: Time elapsed since crawl started
perDirectoryStats: Statistics breakdown per directory
orphansDeleted: Number of index entries removed because the file no longer exists on disk (incremental mode)
filesSkippedUnchanged: Number of files skipped because they were not modified since the last crawl (incremental mode)
reconciliationTimeMs: Time spent comparing the index against the filesystem (incremental mode)
crawlMode: Either "full" or "incremental"
lastCrawlCompletionTimeMs: Unix timestamp (ms) of the last successful crawl completion (null if no previous crawl)
lastCrawlDocumentCount: Number of documents in the index after the last successful crawl (null if no previous crawl)
lastCrawlMode: Mode of the last crawl - "full" or "incremental" (null if no previous crawl)

`listIndexedFields`

List all field names present in the Lucene index.

Returns:

fields: Array of field names available for searching and filtering

Example response:

{
  "success": true,
  "fields": [
    "file_name",
    "file_path",
    "title",
    "author",
    "content",
    "language",
    "file_extension",
    "file_type",
    "created_date",
    "modified_date"
  ]
}

`pauseCrawler`

Pause an ongoing crawl operation. The crawler can be resumed later with resumeCrawler.

`resumeCrawler`

Resume a paused crawl operation.

`getCrawlerStatus`

Get the current state of the crawler.

Returns:

state: One of IDLE, CRAWLING, PAUSED, or WATCHING

`listCrawlableDirectories`

List all configured crawlable directories.

Returns:

success: Boolean indicating operation success
directories: List of absolute directory paths currently configured
totalDirectories: Count of configured directories
configPath: Path to the configuration file (~/.mcplucene/config.yaml)
environmentOverride: Boolean indicating if LUCENE_CRAWLER_DIRECTORIES env var is set

Example response:

{
  "success": true,
  "directories": [
    "/Users/yourname/Documents",
    "/Users/yourname/Downloads"
  ],
  "totalDirectories": 2,
  "configPath": "/Users/yourname/.mcplucene/config.yaml",
  "environmentOverride": false
}

`addCrawlableDirectory`

Add a directory to the crawler configuration.

Parameters:

path (required): Absolute path to the directory to crawl
crawlNow (optional): If true, immediately starts crawling the new directory (default: false)

Returns:

success: Boolean indicating operation success
message: Confirmation message
totalDirectories: Updated count of configured directories
directories: Updated list of all directories
crawlStarted (optional): Present if crawlNow=true, indicates crawl was triggered

Validation:

Directory must exist and be accessible
Path must be a directory (not a file)
Duplicate directories are prevented
Fails if LUCENE_CRAWLER_DIRECTORIES environment variable is set

Example:

Ask Claude: "Add /Users/yourname/Documents as a crawlable directory"
Ask Claude: "Add /path/to/research and crawl it now"

Configuration Persistence: The directory is immediately saved to ~/.mcplucene/config.yaml and will be automatically crawled on future server restarts.

`removeCrawlableDirectory`

Remove a directory from the crawler configuration.

Parameters:

path (required): Absolute path to the directory to remove

Returns:

success: Boolean indicating operation success
message: Confirmation message
totalDirectories: Updated count of configured directories
directories: Updated list of remaining directories

Important Notes:

This does NOT remove already-indexed documents from the removed directory
To remove indexed documents, use startCrawl(fullReindex=true) after removing directories
Fails if LUCENE_CRAWLER_DIRECTORIES environment variable is set
The directory must exist in the current configuration

Example:

Ask Claude: "Stop crawling /Users/yourname/Downloads"
Ask Claude: "Remove /path/to/old/archive from the crawler"

`getDocumentDetails`

Retrieve all stored fields and full content of a document from the Lucene index by its file path. This tool retrieves document details directly from the index without requiring filesystem access - useful for examining indexed content even if the original file has been moved or deleted.

Parameters:

filePath (required): Absolute path to the file (must match exactly the file_path stored in the index)

Returns:

success: Boolean indicating operation success
document: Object containing all stored fields:
- file_path: Full path to the file
- file_name: Name of the file
- file_extension: File extension (e.g., pdf, docx)
- file_type: MIME type
- file_size: File size in bytes
- title: Document title
- author: Author name
- creator: Creator application
- subject: Document subject
- keywords: Document keywords/tags
- language: Detected language code
- created_date: Creation timestamp
- modified_date: Modification timestamp
- indexed_date: Indexing timestamp
- content_hash: SHA-256 hash of content
- content: Full extracted text content (limited to 500KB)
- contentTruncated: Boolean indicating if content was truncated
- originalContentLength: Original content length (only present if truncated)

Content Size Limit: The content field is limited to 500,000 characters (500KB) to ensure the response stays safely below the 1MB MCP response limit. Check the contentTruncated field to determine if the full content was returned.

Example:

Ask Claude: "Show me the indexed details of /Users/yourname/Documents/report.pdf"
Ask Claude: "What content was extracted from /path/to/contract.docx?"

Example response:

{
  "success": true,
  "document": {
    "file_path": "/Users/yourname/Documents/report.pdf",
    "file_name": "report.pdf",
    "file_extension": "pdf",
    "file_type": "application/pdf",
    "file_size": "125432",
    "title": "Annual Report 2024",
    "author": "John Doe",
    "language": "en",
    "indexed_date": "1706540400000",
    "content_hash": "a1b2c3d4...",
    "content": "This is the full extracted text content of the document...",
    "contentTruncated": false
  }
}

`unlockIndex`

Remove the write.lock file from the Lucene index directory. This is a dangerous recovery operation - only use if you are certain no other process is using the index.

Parameters:

confirm (required): Must be set to true to proceed. This is a safety measure.

Returns:

success: Boolean indicating operation success
message: Confirmation message
lockFileExisted: Boolean indicating if a lock file was present
lockFilePath: Path to the lock file

When to use: Use this tool when the server fails to start with a LockObtainFailedException after an unclean shutdown. See Troubleshooting for details.

Example:

Ask Claude: "Unlock the Lucene index - I confirm this is safe"

⚠️ Warning: Unlocking an index that is actively being written to by another process can cause data corruption. Only use this when you are certain the lock is stale.

`optimizeIndex`

Optimize the Lucene index by merging segments. This is a long-running operation that runs in the background.

Parameters:

maxSegments (optional): Target number of segments after optimization (default: 1 for maximum optimization)

Returns:

success: Boolean indicating the operation was started
operationId: UUID to track the operation
targetSegments: The target segment count
currentSegments: The current segment count before optimization
message: Status message

Behavior:

Returns immediately after starting the background operation
Use getIndexAdminStatus to poll for progress
Cannot run while the crawler is actively crawling
Only one admin operation can run at a time

Example:

Ask Claude: "Optimize the search index"
Ask Claude: "What's the status of the optimization?"

Performance Notes:

Optimization improves search performance by reducing the number of segments
Temporarily increases disk usage during the merge
For large indices, this can take several minutes to hours

`purgeIndex`

Delete all documents from the Lucene index. This is a destructive, long-running operation that runs in the background.

Parameters:

confirm (required): Must be set to true to proceed. This is a safety measure.
fullPurge (optional): If true, also deletes index files and reinitializes (default: false)

Returns:

success: Boolean indicating the operation was started
operationId: UUID to track the operation
documentsDeleted: Number of documents that will be deleted
fullPurge: Whether a full purge was requested
message: Status message

Behavior:

Returns immediately after starting the background operation
Use getIndexAdminStatus to poll for progress
Only one admin operation can run at a time

Purge Modes:

Standard purge (fullPurge=false): Deletes all documents but keeps index files. Disk space is reclaimed gradually during future merges.
Full purge (fullPurge=true): Deletes all documents AND index files, then reinitializes an empty index. Disk space is reclaimed immediately.

Example:

Ask Claude: "Delete all documents from the index - I confirm this"
Ask Claude: "Purge the index completely and reclaim disk space - I confirm this"

⚠️ Warning: This operation cannot be undone. All indexed documents will be permanently deleted. You will need to re-crawl directories to repopulate the index.

`getIndexAdminStatus`

Get the status of long-running index administration operations (optimize, purge).

Parameters: None

Returns:

success: Boolean indicating the status was retrieved
state: Current state: IDLE, OPTIMIZING, PURGING, COMPLETED, or FAILED
operationId: UUID of the current/last operation
progressPercent: Progress percentage (0-100)
progressMessage: Human-readable progress message
elapsedTimeMs: Time elapsed since operation started (in milliseconds)
lastOperationResult: Result message from the last completed operation

Example response (during optimization):

{
  "success": true,
  "state": "OPTIMIZING",
  "operationId": "a1b2c3d4-...",
  "progressPercent": 45,
  "progressMessage": "Merging segments...",
  "elapsedTimeMs": 12500,
  "lastOperationResult": null
}

Example response (idle after completion):

{
  "success": true,
  "state": "IDLE",
  "operationId": null,
  "progressPercent": null,
  "progressMessage": "No admin operation running",
  "elapsedTimeMs": null,
  "lastOperationResult": "Optimization completed successfully. Merged to 1 segment(s)."
}

Example:

Ask Claude: "What's the status of the index optimization?"
Ask Claude: "Is the purge operation complete?"

Index Field Schema

When documents are indexed by the crawler, the following fields are automatically extracted and stored:

Content Fields

content: Full text content of the document (analyzed, searchable)
content_reversed: Reversed tokens of the content (analyzed with ReverseUnicodeNormalizingAnalyzer, not stored). Used internally for efficient leading wildcard queries -- not directly searchable by users.
passages: Array of highlighted passages returned in search results (see Search Response Format below)

File Information

file_path: Full path to the file (unique ID)
file_name: Name of the file
file_extension: File extension (e.g., pdf, docx)
file_type: MIME type (e.g., application/pdf)
file_size: File size in bytes

Document Metadata

title: Document title (extracted from metadata)
author: Author name
creator: Creator/application that created the document
subject: Document subject
keywords: Document keywords/tags

Language & Dates

language: Auto-detected language code (e.g., en, de, fr)
created_date: File creation timestamp
modified_date: File modification timestamp
indexed_date: When the document was indexed

Technical

content_hash: SHA-256 hash for change detection

Search Response Format

Search results are optimized for MCP responses (< 1 MB) and include:

{
  "success": true,
  "documents": [
    {
      "score": 0.85,
      "file_name": "example.pdf",
      "file_path": "/path/to/example.pdf",
      "title": "Example Document",
      "author": "John Doe",
      "language": "en",
      "passages": [
        {
          "text": "...relevant <em>search term</em> highlighted in context...",
          "score": 1.0,
          "matchedTerms": ["search term"],
          "termCoverage": 1.0,
          "position": 0.12
        },
        {
          "text": "...another occurrence of <em>search</em> in a later section...",
          "score": 0.75,
          "matchedTerms": ["search"],
          "termCoverage": 0.5,
          "position": 0.67
        }
      ]
    }
  ],
  "totalHits": 42,
  "page": 0,
  "pageSize": 10,
  "totalPages": 5,
  "hasNextPage": true,
  "hasPreviousPage": false,
  "searchTimeMs": 12,
  "facets": {
    "language": [
      { "value": "en", "count": 25 },
      { "value": "de", "count": 12 },
      { "value": "fr", "count": 5 }
    ],
    "file_extension": [
      { "value": "pdf", "count": 30 },
      { "value": "docx", "count": 8 },
      { "value": "xlsx", "count": 4 }
    ],
    "file_type": [
      { "value": "application/pdf", "count": 30 },
      { "value": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "count": 8 }
    ],
    "author": [
      { "value": "John Doe", "count": 15 },
      { "value": "Jane Smith", "count": 10 }
    ]
  }
}

Key Features:

Search Performance Metrics: Every search response includes searchTimeMs showing the exact execution time in milliseconds, enabling performance monitoring and optimization.
Passages with Highlighting: The full content field is NOT included in search results to keep response sizes manageable. Instead, each document contains a passages array with up to max-passages (default: 3) individually highlighted excerpts. Each passage is a separate sentence-level excerpt (not a single joined string), ordered by relevance (best first). Long passages are truncated to max-passage-char-length (default: 200) centred around the highlighted terms, trimming irrelevant leading/trailing text. Each passage includes:
- text -- The highlighted excerpt with matched terms wrapped in <em> tags.
- score -- Normalised relevance score (0.0-1.0), derived from Lucene's BM25 passage scoring. The best passage scores 1.0; other passages are scored relative to the best.
- matchedTerms -- The distinct query terms that appear in this passage (extracted from the <em> tags). Useful for understanding which parts of a multi-term query a passage satisfies.
- termCoverage -- The fraction of all query terms present in this passage (0.0-1.0). A value of 1.0 means every query term matched. LLMs can use this to prefer passages that address the full query.
- position -- Location within the source document (0.0 = start, 1.0 = end), derived from the passage's character offset. Useful for citations or for understanding document structure.
Lucene Faceting: The facets object uses Lucene's SortedSetDocValues for efficient faceted search. It shows actual facet values and document counts from the search results, not just available fields. Only facet dimensions that have values in the result set are returned.
Facet Dimensions: The following fields are indexed as facets:
- language - Detected document language (ISO 639-1 code)
- file_extension - File extension (pdf, docx, etc.)
- file_type - MIME type
- author - Document author (multi-valued)
- creator - Document creator (multi-valued)
- subject - Document subject (multi-valued)

Faceted Search Examples

Use facets to build drill-down queries and refine search results:

# Filter by file type using facet values
filterField: "file_extension"
filterValue: "pdf"

# Filter by language using facet values
filterField: "language"
filterValue: "de"

# Filter by author using facet values
filterField: "author"
filterValue: "John Doe"

# Combine search query with facet filter
queryString: "contract agreement"
filterField: "file_extension"
filterValue: "pdf"

Facet-Driven Workflow:

Perform initial search with broad query
Review facets in response to see available refinement options
Apply filters using facet values to narrow results
Iterate to drill down into specific subsets

Document Crawler Features

Automatic Crawling

The crawler starts automatically on server startup (if crawl-on-startup: true) and:

Discovers files matching include patterns in configured directories
Extracts content using Apache Tika (supports 100+ file formats)
Detects language automatically for each document
Extracts metadata (author, title, dates, etc.)
Indexes documents in batches for optimal performance
Monitors directories for changes (create, modify, delete)

Incremental Indexing (Reconciliation)

By default (reconciliation-enabled: true), every crawl that is not a full reindex performs an incremental pass first. This makes repeated crawls significantly faster because unchanged files are never re-processed.

How it works:

Index snapshot -- All (file_path, modified_date) pairs are read from the Lucene index.
Filesystem snapshot -- The configured directories are walked and the current (file_path, mtime) pairs are collected (no content extraction at this stage).
Four-way diff is computed:
- DELETE -- paths in the index that no longer exist on disk (orphans).
- ADD -- paths on disk that are not yet in the index.
- UPDATE -- paths where the on-disk mtime is newer than the stored modified_date.
- SKIP -- paths that are identical; these are never touched.
Orphan deletions are applied first (bulk delete via a single Lucene query).
Only ADD and UPDATE files are crawled, extracted, and indexed.
On successful completion, the crawl state (timestamp, document count, mode) is persisted to ~/.mcplucene/crawl-state.yaml.

Fallback behaviour: If reconciliation fails for any reason (I/O error reading the index, filesystem walk failure, etc.) the system automatically falls back to a full crawl. No data is lost and no manual intervention is required.

Disabling incremental indexing: Set reconciliation-enabled: false in application.yaml to always perform a full crawl. Alternatively, pass fullReindex: true to startCrawl to force a single full crawl without changing the default.

Persisted state file:

~/.mcplucene/crawl-state.yaml

This file records the last successful crawl's completion time, document count, and mode. It is written only after a crawl completes successfully.

Schema Version Management

The server tracks the index schema version to detect when the schema changes between software updates. This eliminates the need for manual reindexing after upgrades.

How it works:

Each release embeds a SCHEMA_VERSION constant that reflects the current index field schema.
The schema version is persisted in Lucene's commit metadata alongside the software version.
On startup, the server compares the stored schema version with the current one.
If they differ (or if a legacy index has no version), a full reindex is triggered automatically.

What triggers a schema version bump:

Adding or removing indexed fields
Changing field analyzers
Modifying field indexing options (stored, term vectors, etc.)

Checking version information: Use getIndexStats to see the current schema version, software version, and build timestamp.

Real-time Monitoring

With directory watching enabled (watch-enabled: true):

New files are automatically indexed when added
Modified files are re-indexed with updated content
Deleted files are removed from the index

Performance Optimization

Multi-threading:

Crawls multiple directories in parallel (configurable thread pool)
Each directory is processed by a separate thread

Batch Processing:

Documents are indexed in batches (default: 100 documents)
Reduces I/O overhead and improves indexing speed

NRT (Near Real-Time) Optimization:

Normal operation: 100ms refresh interval for fast search updates
Bulk indexing (>1000 files): Automatically slows to 5s to reduce overhead
Restores to 100ms after bulk operation completes

Progress Notifications:

Updates every 100 files OR every 30 seconds (whichever comes first)
Shows throughput (files/sec, MB/sec) and progress
Non-blocking: Appear in system notification area without interrupting workflow
- macOS: Notifications appear in Notification Center (top-right corner)
- Windows: Toast notifications in system tray area
- Linux: Uses notify-send for desktop notifications

Error Handling

Failed files are logged but don't stop the crawl
Statistics track successful vs. failed files
Large documents are fully indexed (no truncation by default)
Corrupted or inaccessible files are skipped gracefully

Troubleshooting

Where to find logs?

When running with the deployed profile, console logging is disabled to ensure clean STDIO communication with MCP clients. Instead, logs are written to files in:

~/.mcplucene/log/mcplucene.log

The log directory is ${user.home}/.mcplucene/log by default (configured in logback.xml). Log files are automatically rotated:

Maximum 10MB per file
Up to 5 log files retained
Total size capped at 50MB

To view recent logs:

# View the current log file
cat ~/.mcplucene/log/mcplucene.log

# Follow logs in real-time
tail -f ~/.mcplucene/log/mcplucene.log

# View last 100 lines
tail -n 100 ~/.mcplucene/log/mcplucene.log

When developing (without the deployed profile), logs are written to the console instead of files.

Schema version changes and automatic reindexing

The server now includes automatic schema version management. When you upgrade to a new version that changes the index schema (e.g., adds new fields, changes analyzers, or modifies field indexing options), the server detects the version mismatch on startup and automatically triggers a full reindex.

What happens:

On startup, the server compares the stored schema version with the current version
If they differ, a full reindex is triggered automatically
You'll see a log message: Schema version changed — triggering full reindex
The reindex runs in the background; you can check progress with getCrawlerStats

Manual reindex: If you need to force a manual reindex for any reason, you can still trigger it:

Ask Claude: "Reindex all documents from scratch"

This calls startCrawl(fullReindex: true), which clears the existing index and re-crawls all configured directories.

Version information: Use getIndexStats to see the current schema version, software version, and build timestamp.

Index lock file prevents startup (write.lock)

Symptom: The server fails to start with an error like Lock held by another program or LockObtainFailedException.

Cause: When the MCP server doesn't shut down cleanly (e.g., the process was forcefully killed, the system crashed, or Claude Desktop was terminated abruptly), Lucene may leave behind a write.lock file in the index directory. This lock file is used to prevent multiple processes from writing to the same index simultaneously. When it's left behind after an unclean shutdown, it blocks the server from starting because Lucene thinks another process is still using the index.

Solution: Delete the lock file manually:

# Remove the write.lock file from the index directory
rm ~/.mcplucene/luceneindex/write.lock

After removing the lock file, the server should start normally.

Prevention: Try to shut down Claude Desktop gracefully when possible. If you need to force-quit, be aware that you may need to remove the lock file before the next startup.

Note: The default index path is ~/.mcplucene/luceneindex. If you've configured a custom index path via LUCENE_INDEX_PATH or application.yaml, look for the write.lock file in that directory instead.

Server shows as "running" but tools don't work

This usually indicates STDIO communication issues:

Ensure the -Dspring.profiles.active=deployed argument is present in the config
Check that no other output is being written to stdout
Verify the JAR path is an absolute path, not relative
If you modified the configuration, ensure the "deployed" profile settings are correct

Claude Desktop doesn't show the server

Verify the JAR file path in the configuration is correct and absolute
Check that Java 21+ is installed: java -version
Validate the JSON syntax in the config file
Check Claude Desktop logs for error messages
Try running the JAR manually to check for startup errors:
```
java -jar /path/to/luceneserver-0.0.1-SNAPSHOT.jar
```

Server fails to start

Ensure the Lucene index directory path is valid
Check that no other process is locking the index directory
Verify sufficient disk space for the index

Empty search results

The index may be empty for several reasons:

No directories configured: Add directories to application.yaml under lucene.crawler.directories
Crawler not started: Use the startCrawl MCP tool or enable crawl-on-startup: true
No matching files: Check that your directories contain files matching the include patterns
Files failed to index: Check the logs for errors, use getCrawlerStats to see failed file count

Crawler not indexing files

Check directory paths: Ensure paths in application.yaml are absolute and exist
Verify file permissions: The server needs read access to all files
Check include patterns: Files must match at least one include pattern
Check exclude patterns: Files must not match any exclude pattern
Monitor crawler status: Use getCrawlerStatus and getCrawlerStats MCP tools
Check logs: Look for parsing errors or I/O exceptions

Out of memory errors during indexing

If you encounter OOM errors with very large documents:

Set content limit: Change max-content-length in application.yaml (e.g., 5242880 for 5MB)
Increase JVM heap: Add -Xmx2g to JVM arguments in Claude Desktop config
Reduce thread pool: Lower thread-pool-size to reduce concurrent processing
Reduce batch size: Lower batch-size to commit more frequently

Slow indexing performance

Increase thread pool: Raise thread-pool-size (default: 4)
Increase batch size: Raise batch-size for fewer commits (default: 100)
Disable language detection: Set detect-language: false if not needed
Disable metadata extraction: Set extract-metadata: false if not needed
Check disk I/O: Slow disk can bottleneck indexing

Usage Examples

Example 1: Index Your Documents Folder

Edit application.yaml:

lucene:
  crawler:
    directories:
      - "/Users/yourname/Documents"
    crawl-on-startup: true

Start the server:

java -jar target/luceneserver-0.0.1-SNAPSHOT.jar

The crawler automatically starts and indexes all supported documents in your Documents folder.

Example 2: Search with Filtering

Ask Claude:

Search for "machine learning" in PDF documents only

Claude will use:

query: "machine learning"
filterField: "file_extension"
filterValue: "pdf"

Example 3: Find Documents by Author

Ask Claude:

Find all documents written by John Doe

Claude will use:

query: "*"
filterField: "author"
filterValue: "John Doe"

Example 4: Monitor Crawler Progress

Ask Claude:

Show me the crawler statistics

Claude calls getCrawlerStats() and shows:

Files processed: 1,234 / 5,000
Throughput: 85 files/sec
Indexed: 1,200 (98%)
Failed: 34 (2%)

Example 5: Manual Crawl with Full Reindex

Ask Claude:

Reindex all documents from scratch

Claude calls startCrawl(fullReindex: true), which:

Clears the existing index
Re-crawls all configured directories
Indexes all documents fresh

Example 6: Language-Specific Search

Ask Claude:

Find German documents about "Technologie"

Claude uses:

query: "Technologie"
filterField: "language"
filterValue: "de"

Example 7: Search with Passages

Search results include a passages array with highlighted excerpts and quality metadata:

{
  "file_name": "report.pdf",
  "passages": [
    {
      "text": "...discusses the impact of <em>machine learning</em> on modern software development. The study shows...",
      "score": 1.0,
      "matchedTerms": ["machine learning"],
      "termCoverage": 1.0,
      "position": 0.08
    },
    {
      "text": "...<em>machine learning</em> algorithms were applied to the dataset in Section 4...",
      "score": 0.75,
      "matchedTerms": ["machine learning"],
      "termCoverage": 1.0,
      "position": 0.45
    }
  ]
}

This allows you to see relevant excerpts without downloading the full document. The metadata fields help LLMs quickly identify the best passage: prefer passages with high termCoverage (covers more of the query) and use position for document-structure context.

Example 8: Managing Crawlable Directories at Runtime

Ask Claude to manage directories without editing configuration files:

"What directories are currently being crawled?"
# Claude calls listCrawlableDirectories()
# Response: Shows all configured directories and config file location

"Add /Users/yourname/Research as a crawlable directory"
# Claude calls addCrawlableDirectory(path="/Users/yourname/Research")
# Directory is added to ~/.mcplucene/config.yaml

"Add /Users/yourname/Projects and start crawling it now"
# Claude calls addCrawlableDirectory(path="/Users/yourname/Projects", crawlNow=true)
# Directory is added and crawl starts immediately

"Stop crawling /Users/yourname/Downloads"
# Claude calls removeCrawlableDirectory(path="/Users/yourname/Downloads")
# Directory is removed from config (indexed documents remain)

Configuration Persistence:

The directories you add via MCP tools are saved to ~/.mcplucene/config.yaml:

lucene:
  crawler:
    directories:
      - /Users/yourname/Documents
      - /Users/yourname/Research
      - /Users/yourname/Projects

This configuration persists across server restarts - no need to reconfigure each time.

Environment Variable Override:

If you set the LUCENE_CRAWLER_DIRECTORIES environment variable, it takes precedence:

{
  "mcpServers": {
    "lucene-search": {
      "command": "java",
      "args": ["-Dspring.profiles.active=deployed", "-jar", "/path/to/jar"],
      "env": {
        "LUCENE_CRAWLER_DIRECTORIES": "/path1,/path2"
      }
    }
  }
}

When this is set, addCrawlableDirectory and removeCrawlableDirectory will return an error message indicating the environment override is active.

Example 9: Working with Lexical Search (Synonyms and Variations)

Note: When using this server through Claude or another AI assistant, synonym expansion happens automatically - the AI constructs OR queries for you based on your natural language request. The examples below show the underlying query syntax for reference or direct API usage.

Since the search engine performs exact lexical matching without automatic synonym expansion, you need to explicitly include synonyms and word variations in your query:

❌ Basic search (might miss relevant results):

query: "car"

This will ONLY match documents containing the exact word "car", missing documents with "automobile", "vehicle", etc.

✅ Better: Include synonyms with OR:

query: "(car OR automobile OR vehicle)"

✅ Best: Combine synonyms with wildcards for variations:

query: "(car* OR automobile* OR vehicle*)"

This matches: car, cars, automobile, automobiles, vehicle, vehicles, etc.

Real-world example - Finding contracts:

query: "(contract* OR agreement* OR deal*) AND (sign* OR execut* OR finali*)"
filterField: "file_extension"
filterValue: "pdf"

This will find documents containing variations like:

"contract signed", "agreement executed", "deal finalized"
"contracts signing", "agreements execute", "deals finalizing"

💡 Tip: Use the facets in the search response to discover the exact terms used in your documents, then refine your query accordingly.

Development

Running for Development

When developing and debugging in your IDE, run the server without the "deployed" profile to get full logging:

In your IDE (IntelliJ, Eclipse, VS Code):

# Just run the main class directly - no profile needed
# You'll see full console logging and debug output
java -jar target/luceneserver-0.0.1-SNAPSHOT.jar

This gives you:

✅ Complete logging output for debugging
✅ Configuration loaded from classpath and user config
✅ All debug information visible in console

For production/Claude Desktop deployment:

# Use the deployed profile for clean STDIO
java --enable-native-access=ALL-UNNAMED -Xmx2g -Dspring.profiles.active=deployed -jar target/luceneserver-0.0.1-SNAPSHOT.jar

Adding Documents to the Index

Recommended approach: Use the document crawler by configuring directories in application.yaml. The crawler automatically handles content extraction, metadata, and language detection.

Programmatic approach: For custom document types or direct indexing:

// Get the LuceneIndexService instance from your application
LuceneIndexService indexService = // ... from your application

public void addDocument(String title, String content) throws IOException {
    Document doc = new Document();
    doc.add(new TextField("title", title, Field.Store.YES));
    doc.add(new TextField("content", content, Field.Store.YES));
    doc.add(new StringField("file_path", "/custom/path", Field.Store.YES));
    indexService.getIndexWriter().addDocument(doc);
    indexService.getIndexWriter().commit();
}

For the full field schema, see the Index Field Schema section.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.claude		.claude
.github		.github
.mvn/wrapper		.mvn/wrapper
doc		doc
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

License

mirkosertic/MCPLuceneServer

Folders and files

Latest commit

History

Repository files navigation

MCP Lucene Server

Features

Table of Contents

Quick Start

Prerequisites

Step 1: Get the Server

Step 2: Configure Claude Desktop

Step 3: Start Using It

Configuration Options

Logging Profiles

Environment Variables

Document Crawler Configuration

Runtime Configuration via MCP Tools (Recommended)

Static Configuration via application.yaml

Available MCP Tools

indexAdmin

search

getIndexStats

startCrawl

getCrawlerStats

listIndexedFields

pauseCrawler

resumeCrawler

getCrawlerStatus

listCrawlableDirectories

addCrawlableDirectory

removeCrawlableDirectory

getDocumentDetails

unlockIndex

optimizeIndex

purgeIndex

getIndexAdminStatus

Index Field Schema

Content Fields

File Information

Document Metadata

Language & Dates

Technical

Search Response Format

Faceted Search Examples

Document Crawler Features

Automatic Crawling

Incremental Indexing (Reconciliation)

Schema Version Management

Real-time Monitoring

Performance Optimization

Error Handling

Troubleshooting

Where to find logs?

Schema version changes and automatic reindexing

Index lock file prevents startup (write.lock)

Server shows as "running" but tools don't work

Claude Desktop doesn't show the server

Server fails to start

Empty search results

Crawler not indexing files

Out of memory errors during indexing

Slow indexing performance

Usage Examples

Example 1: Index Your Documents Folder

Example 2: Search with Filtering

Example 3: Find Documents by Author

Example 4: Monitor Crawler Progress

Example 5: Manual Crawl with Full Reindex

Example 6: Language-Specific Search

Example 7: Search with Passages

Example 8: Managing Crawlable Directories at Runtime

Example 9: Working with Lexical Search (Synonyms and Variations)

Development

Running for Development

Adding Documents to the Index

About

Topics

Resources

`indexAdmin`

`search`

`getIndexStats`

`startCrawl`

`getCrawlerStats`

`listIndexedFields`

`pauseCrawler`

`resumeCrawler`

`getCrawlerStatus`

`listCrawlableDirectories`

`addCrawlableDirectory`

`removeCrawlableDirectory`

`getDocumentDetails`

`unlockIndex`

`optimizeIndex`

`purgeIndex`

`getIndexAdminStatus`