A Model Context Protocol (MCP) server that exposes Apache Lucene fulltext search capabilities with automatic document crawling and indexing. This server uses STDIO transport for communication and can be integrated with Claude Desktop or any other MCP-compatible client.
β¨ Automatic Document Crawling
- Automatically indexes PDFs, Microsoft Office, and OpenOffice documents
- Multi-threaded crawling for fast indexing
- Real-time directory monitoring for automatic updates
- Incremental indexing with full reconciliation (skips unchanged files, removes orphans)
π Powerful Search
- Full Lucene query syntax support (wildcards, boolean operators, phrase queries)
- Field-specific filtering (by author, language, file type, etc.)
- Structured passages with quality metadata for LLM consumption
- Paginated results with filter suggestions
π Rich Metadata Extraction
- Automatic language detection
- Author, title, creation date extraction
- File type and size information
- SHA-256 content hashing for change detection
β‘ Performance Optimized
- Batch processing for efficient indexing
- NRT (Near Real-Time) search with dynamic optimization
- Configurable thread pools for parallel processing
- Progress notifications during bulk operations
π§ Easy Integration
- STDIO transport for seamless MCP client integration
- Comprehensive MCP tools for search and crawler control
- Flexible configuration via YAML
- Cross-platform notifications (macOS Notification Center, Windows Toast, Linux notify-send)
- Quick Start
- Configuration Options
- Available MCP Tools
- Index Field Schema
- Document Crawler Features
- Usage Examples
- Troubleshooting
- Development
Get up and running with MCP Lucene Server in three steps.
- Java 21 or later - Required to run the server
- Maven 3.9+ (only if building from source)
Option A: Download Pre-built JAR (Recommended)
- Go to the Actions tab
- Click on the most recent successful workflow run
- Scroll down to "Artifacts" and download
luceneserver-X.X.X-SNAPSHOT - Extract the ZIP file to get the JAR
For tagged releases, you can also download from the Releases page.
Option B: Build from Source
./mvnw clean package -DskipTestsThis creates an executable JAR at target/luceneserver-0.0.1-SNAPSHOT.jar.
Locate your Claude Desktop configuration file:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json - Linux:
~/.config/Claude/claude_desktop_config.json
Add the Lucene MCP server to the mcpServers section:
{
"mcpServers": {
"lucene-search": {
"command": "java",
"args": [
"--enable-native-access=ALL-UNNAMED",
"-Xmx2g",
"-Dspring.profiles.active=deployed",
"-jar",
"/absolute/path/to/luceneserver-0.0.1-SNAPSHOT.jar"
]
}
}
}Important: Replace /absolute/path/to/luceneserver-0.0.1-SNAPSHOT.jar with the actual absolute path to your JAR file.
The -Dspring.profiles.active=deployed flag is required for clean STDIO communication (disables console logging, web server, and startup banner).
- Restart Claude Desktop to load the new configuration
- Verify the server is running in Claude Desktop's developer settings
- Tell Claude to add your documents:
"Add /Users/yourname/Documents as a crawlable directory and start crawling"
That's it! The configuration is saved to ~/.mcplucene/config.yaml and persists across restarts. You can now search your documents through Claude.
Example searches:
- "Search for machine learning papers"
- "Find all PDFs by John Doe"
- "What documents mention quarterly reports?"
Note: The Quick Start above uses zero-configuration. This section covers advanced customization options.
The server can be configured via environment variables and application.yaml:
The server supports two logging profiles (for backwards compatibility, uses the same system property as Spring Boot):
| Profile | Usage | Logging Output |
|---|---|---|
| default | Development in IDE | Console logging enabled |
| deployed | Production/Claude Desktop | File logging only |
Default profile (no profile specified):
- Full logging enabled to console
- Suitable for debugging and development
Deployed profile (-Dspring.profiles.active=deployed):
- Console logging disabled (required for STDIO transport)
- File logging enabled (
~/.mcplucene/log/mcplucene.log) - Used when running under Claude Desktop or other MCP clients
| Environment Variable | Default | Description |
|---|---|---|
LUCENE_INDEX_PATH |
${user.home}/.mcplucene/luceneindex} |
Path to the Lucene index directory |
LUCENE_CRAWLER_DIRECTORIES |
none | Comma-separated list of directories to crawl (overrides config file) |
SPRING_PROFILES_ACTIVE |
none (default) | Set to deployed for production use |
Note on LUCENE_CRAWLER_DIRECTORIES:
When this environment variable is set, it takes precedence over ~/.mcplucene/config.yaml and application.yaml. The MCP configuration tools (addCrawlableDirectory, removeCrawlableDirectory) will refuse to modify configuration while this override is active. To use runtime configuration, remove this environment variable.
The crawler directories can be configured in three ways, with the following priority (highest to lowest):
- Environment Variable:
LUCENE_CRAWLER_DIRECTORIES(comma-separated paths) - Runtime Configuration:
~/.mcplucene/config.yaml(managed via MCP tools) - Application Default:
src/main/resources/application.yaml
The server provides MCP tools to manage crawlable directories at runtime without editing configuration files:
listCrawlableDirectories - List all configured directories
Ask Claude: "What directories are being crawled?"
addCrawlableDirectory - Add a new directory to crawl
Ask Claude: "Add /Users/yourname/Documents as a crawlable directory"
Ask Claude: "Add /path/to/folder and start crawling it immediately"
removeCrawlableDirectory - Remove a directory from crawling
Ask Claude: "Stop crawling /Users/yourname/Downloads"
Benefits of Runtime Configuration:
- No need to rebuild the JAR or restart the server
- Configuration persists across restarts in
~/.mcplucene/config.yaml - Easy to distribute pre-built JARs
- Conversational interface via Claude
Configuration File Location:
~/.mcplucene/config.yaml
Example config.yaml:
lucene:
crawler:
directories:
- /Users/yourname/Documents
- /Users/yourname/DownloadsConfigure the document crawler in src/main/resources/application.yaml:
lucene:
index:
path: ${LUCENE_INDEX_PATH:./lucene-index}
crawler:
# Directories to crawl and index
directories:
- "/path/to/your/documents"
- "/another/path/to/index"
# File patterns to include
include-patterns:
- "*.pdf"
- "*.doc"
- "*.docx"
- "*.odt"
- "*.ppt"
- "*.pptx"
- "*.xls"
- "*.xlsx"
- "*.ods"
- "*.txt"
- "*.eml"
- "*.msg"
- "*.md"
- "*.rst"
- "*.html"
- "*.htm"
- "*.rtf"
- "*.epub"
# File patterns to exclude
exclude-patterns:
- "**/node_modules/**"
- "**/.git/**"
- "**/target/**"
- "**/build/**"
# Performance settings
thread-pool-size: 4 # Parallel crawling threads
batch-size: 100 # Documents per batch
batch-timeout-ms: 5000 # Batch processing timeout
# Directory watching
watch-enabled: true # Monitor directories for changes
watch-poll-interval-ms: 2000 # Watch polling interval
# NRT optimization
bulk-index-threshold: 1000 # Files before NRT slowdown
slow-nrt-refresh-interval-ms: 5000 # NRT interval during bulk indexing
# Content extraction
max-content-length: -1 # -1 = unlimited, or max characters
extract-metadata: true # Extract author, title, etc.
detect-language: true # Auto-detect document language
# Auto-crawl
crawl-on-startup: true # Start crawling on server startup
# Progress notifications
progress-notification-files: 100 # Notify every N files
progress-notification-interval-ms: 30000 # Or every N milliseconds
# Incremental indexing
reconciliation-enabled: true # Skip unchanged files, remove orphans (default: true)
# Search passages
max-passages: 3 # Max highlighted passages per search result (default: 3)
max-passage-char-length: 200 # Max character length per passage; longer ones are truncated (default: 200, 0 = no limit)Supported File Formats:
- PDF documents (
.pdf) - Microsoft Office: Word (
.doc,.docx), Excel (.xls,.xlsx), PowerPoint (.ppt,.pptx) - OpenOffice/LibreOffice: Writer (
.odt), Calc (.ods), Impress (.odp) - Plain text files (
.txt) - Email: Outlook (
.msg), EML (.eml) - Markup: Markdown (
.md), reStructuredText (.rst), HTML (.html,.htm) - Rich Text Format (
.rtf) - E-books: EPUB (
.epub)
Complete Configuration Example:
lucene:
index:
path: /Users/yourname/lucene-index
crawler:
# Add your document directories here
directories:
- "/Users/yourname/Documents"
- "/Users/yourname/Downloads"
- "/Volumes/ExternalDrive/Archive"
# Include only these file types
include-patterns:
- "*.pdf"
- "*.docx"
- "*.xlsx"
# Exclude these directories
exclude-patterns:
- "**/node_modules/**"
- "**/.git/**"
# Performance tuning
thread-pool-size: 8 # Use more threads for faster indexing
batch-size: 200 # Larger batches for better throughput
# Auto-start crawler
crawl-on-startup: true
# Real-time monitoring
watch-enabled: true
# No content limit (index full documents)
max-content-length: -1An MCP App that provides a visual user interface for index maintenance tasks directly inside your MCP client (e.g. Claude Desktop). When invoked, the app is rendered inline in the conversation and offers one-click access to administrative operations without requiring manual tool calls.
Available actions:
- Unlock Index -- Removes a stale
write.lockfile after an unclean shutdown (equivalent to callingunlockIndexwithconfirm=true) - Optimize Index -- Merges index segments for improved search performance (equivalent to calling
optimizeIndex) - Purge Index -- Deletes all documents from the index (equivalent to calling
purgeIndexwithconfirm=true)
Each action shows inline status feedback (success, error, or progress details) directly in the app UI.
Example:
Ask Claude: "Can you invoke the indexAdmin tool please?"
Search the Lucene fulltext index using lexical matching (exact word forms only).
Parameters:
query(required): The search query using Lucene query syntaxfilterField(optional): Field name to filter results (use facets to discover available values)filterValue(optional): Value for the filter fieldpage(optional): Page number, 0-based (default: 0)pageSize(optional): Results per page (default: 10, max: 100)
π€ AI-Powered Synonym Expansion:
This server is designed to work with AI assistants like Claude. Instead of using traditional Lucene synonym files, the AI generates context-appropriate synonyms automatically by constructing OR queries.
Why this is better than traditional synonyms:
- Context-aware: The AI understands your intent and picks relevant synonyms (e.g., "contract" in legal context vs. "contract" in construction)
- No maintenance: No need to maintain static synonym configuration files
- Domain-adaptive: Works across legal, technical, medical, or casual language automatically
- Multilingual: Generates synonyms in any language without configuration
When you ask Claude to "find documents about cars", it automatically searches for (car OR automobile OR vehicle) - giving you better results than a static synonym list.
The index uses a custom UnicodeNormalizingAnalyzer built on Lucene's ICUFoldingFilter, which provides:
- β Tokenization and lowercasing
- β Unicode normalization (NFKC) -- full-width characters are mapped to their standard equivalents
- β Diacritic folding -- accented characters are mapped to their ASCII base forms (e.g., "Γ€" β "a", "ΓΆ" β "o", "ΓΌ" β "u", "Γ±" β "n")
- β Ligature expansion -- PDF ligatures are expanded correctly (e.g., the "fi" ligature β "fi", the "fl" ligature β "fl")
- β
Efficient leading wildcard queries -- a
content_reversedfield stores reversed tokens, so*vertragis internally rewritten as a trailing wildcard on the reversed field (gartrev*), avoiding costly full-index scans - β No automatic synonym expansion at the index level
- β No phonetic matching (e.g., "Smith" won't match "Smyth")
- β No stemming (e.g., "running" won't match "run")
The AI assistant compensates for the remaining limitations by expanding your queries intelligently.
π‘ Best Practices for Better Results:
-
Generate Synonyms Yourself: Use OR to combine related terms:
- Instead of:
contract - Use:
(contract OR agreement OR deal)
- Instead of:
-
Use Wildcards for Variations: Handle different word forms:
- Instead of:
contract - Use:
contract*(matches contracts, contracting, contracted)
- Instead of:
-
Leverage Facets: Use the returned facet values to discover exact terms in the index:
- Check
facets.authorto find exact author names - Check
facets.languageto see available languages - Use these exact values for filtering
- Check
-
Combine Techniques:
(contract* OR agreement*) AND (sign* OR execut*) AND author:"John Doe"
Supported Query Syntax:
- Simple terms:
hello world(implicit AND between terms) - Phrase queries:
"exact phrase"(preserves word order) - Boolean operators:
term1 AND term2,term1 OR term2,NOT term - Trailing wildcard:
contract*matches contracts, contracting, contracted - Leading wildcard:
*vertragefficiently finds Arbeitsvertrag, Kaufvertrag (optimised via reverse token field) - Infix wildcard:
*vertrag*finds both Vertragsbedingungen and Arbeitsvertrag - Single char wildcard:
te?tmatches test, text - Fuzzy search:
term~2finds terms within Levenshtein edit distance 2 (default: 2) - Proximity search:
"term1 term2"~5finds terms within 5 words of each other - Field-specific search:
title:hello content:world - Grouping:
(contract OR agreement) AND signed - Range queries:
modified_date:[1609459200000 TO 1640995200000](timestamps in milliseconds)
German Compound Word Search:
German compound words (e.g., "Arbeitsvertrag", "Vertragsbedingungen") can be searched effectively using wildcards:
*vertrag-- finds words ending in "vertrag" (Arbeitsvertrag, Kaufvertrag, Mietvertrag)vertrag*-- finds words starting with "vertrag" (Vertragsbedingungen, Vertragsklausel)*vertrag*-- finds words containing "vertrag" anywhere (combines both)
Leading wildcard queries are optimised internally using a reverse token index (content_reversed field), so they execute as fast as trailing wildcards.
Returns:
- Paginated document results, each containing a
passagesarray with highlighted text and quality metadata - Document-level relevance scores
- Facets with actual values and counts from the result set
- Search execution time in milliseconds (
searchTimeMs)
Get statistics about the Lucene index.
Returns:
documentCount: Total number of documents in the indexindexPath: Path to the index directoryschemaVersion: Current index schema versionsoftwareVersion: Server software versionbuildTimestamp: Server build timestamp
Start crawling configured directories to index documents.
Parameters:
fullReindex(optional): If true, clears the index before crawling (default: false). When false andreconciliation-enabledis true, an incremental crawl is performed instead.
Features:
- Automatically extracts content from PDFs, Office documents, and OpenOffice files
- Detects document language
- Extracts metadata (author, title, creation date, etc.)
- Multi-threaded processing for fast indexing
- Progress notifications during crawling
- Incremental mode (default): Only new or modified files are indexed; deleted files are removed from the index automatically. Falls back to a full crawl if reconciliation encounters an error.
Get real-time statistics about the crawler progress.
Returns:
filesFound: Total files discoveredfilesProcessed: Files processed so farfilesIndexed: Files successfully indexedfilesFailed: Files that failed to processbytesProcessed: Total bytes processedfilesPerSecond: Processing throughputmegabytesPerSecond: Data throughputelapsedTimeMs: Time elapsed since crawl startedperDirectoryStats: Statistics breakdown per directoryorphansDeleted: Number of index entries removed because the file no longer exists on disk (incremental mode)filesSkippedUnchanged: Number of files skipped because they were not modified since the last crawl (incremental mode)reconciliationTimeMs: Time spent comparing the index against the filesystem (incremental mode)crawlMode: Either"full"or"incremental"lastCrawlCompletionTimeMs: Unix timestamp (ms) of the last successful crawl completion (null if no previous crawl)lastCrawlDocumentCount: Number of documents in the index after the last successful crawl (null if no previous crawl)lastCrawlMode: Mode of the last crawl -"full"or"incremental"(null if no previous crawl)
List all field names present in the Lucene index.
Returns:
fields: Array of field names available for searching and filtering
Example response:
{
"success": true,
"fields": [
"file_name",
"file_path",
"title",
"author",
"content",
"language",
"file_extension",
"file_type",
"created_date",
"modified_date"
]
}Pause an ongoing crawl operation. The crawler can be resumed later with resumeCrawler.
Resume a paused crawl operation.
Get the current state of the crawler.
Returns:
state: One ofIDLE,CRAWLING,PAUSED, orWATCHING
List all configured crawlable directories.
Returns:
success: Boolean indicating operation successdirectories: List of absolute directory paths currently configuredtotalDirectories: Count of configured directoriesconfigPath: Path to the configuration file (~/.mcplucene/config.yaml)environmentOverride: Boolean indicating ifLUCENE_CRAWLER_DIRECTORIESenv var is set
Example response:
{
"success": true,
"directories": [
"/Users/yourname/Documents",
"/Users/yourname/Downloads"
],
"totalDirectories": 2,
"configPath": "/Users/yourname/.mcplucene/config.yaml",
"environmentOverride": false
}Add a directory to the crawler configuration.
Parameters:
path(required): Absolute path to the directory to crawlcrawlNow(optional): If true, immediately starts crawling the new directory (default: false)
Returns:
success: Boolean indicating operation successmessage: Confirmation messagetotalDirectories: Updated count of configured directoriesdirectories: Updated list of all directoriescrawlStarted(optional): Present ifcrawlNow=true, indicates crawl was triggered
Validation:
- Directory must exist and be accessible
- Path must be a directory (not a file)
- Duplicate directories are prevented
- Fails if
LUCENE_CRAWLER_DIRECTORIESenvironment variable is set
Example:
Ask Claude: "Add /Users/yourname/Documents as a crawlable directory"
Ask Claude: "Add /path/to/research and crawl it now"
Configuration Persistence:
The directory is immediately saved to ~/.mcplucene/config.yaml and will be automatically crawled on future server restarts.
Remove a directory from the crawler configuration.
Parameters:
path(required): Absolute path to the directory to remove
Returns:
success: Boolean indicating operation successmessage: Confirmation messagetotalDirectories: Updated count of configured directoriesdirectories: Updated list of remaining directories
Important Notes:
- This does NOT remove already-indexed documents from the removed directory
- To remove indexed documents, use
startCrawl(fullReindex=true)after removing directories - Fails if
LUCENE_CRAWLER_DIRECTORIESenvironment variable is set - The directory must exist in the current configuration
Example:
Ask Claude: "Stop crawling /Users/yourname/Downloads"
Ask Claude: "Remove /path/to/old/archive from the crawler"
Retrieve all stored fields and full content of a document from the Lucene index by its file path. This tool retrieves document details directly from the index without requiring filesystem access - useful for examining indexed content even if the original file has been moved or deleted.
Parameters:
filePath(required): Absolute path to the file (must match exactly thefile_pathstored in the index)
Returns:
success: Boolean indicating operation successdocument: Object containing all stored fields:file_path: Full path to the filefile_name: Name of the filefile_extension: File extension (e.g.,pdf,docx)file_type: MIME typefile_size: File size in bytestitle: Document titleauthor: Author namecreator: Creator applicationsubject: Document subjectkeywords: Document keywords/tagslanguage: Detected language codecreated_date: Creation timestampmodified_date: Modification timestampindexed_date: Indexing timestampcontent_hash: SHA-256 hash of contentcontent: Full extracted text content (limited to 500KB)contentTruncated: Boolean indicating if content was truncatedoriginalContentLength: Original content length (only present if truncated)
Content Size Limit:
The content field is limited to 500,000 characters (500KB) to ensure the response stays safely below the 1MB MCP response limit. Check the contentTruncated field to determine if the full content was returned.
Example:
Ask Claude: "Show me the indexed details of /Users/yourname/Documents/report.pdf"
Ask Claude: "What content was extracted from /path/to/contract.docx?"
Example response:
{
"success": true,
"document": {
"file_path": "/Users/yourname/Documents/report.pdf",
"file_name": "report.pdf",
"file_extension": "pdf",
"file_type": "application/pdf",
"file_size": "125432",
"title": "Annual Report 2024",
"author": "John Doe",
"language": "en",
"indexed_date": "1706540400000",
"content_hash": "a1b2c3d4...",
"content": "This is the full extracted text content of the document...",
"contentTruncated": false
}
}Remove the write.lock file from the Lucene index directory. This is a dangerous recovery operation - only use if you are certain no other process is using the index.
Parameters:
confirm(required): Must be set totrueto proceed. This is a safety measure.
Returns:
success: Boolean indicating operation successmessage: Confirmation messagelockFileExisted: Boolean indicating if a lock file was presentlockFilePath: Path to the lock file
When to use:
Use this tool when the server fails to start with a LockObtainFailedException after an unclean shutdown. See Troubleshooting for details.
Example:
Ask Claude: "Unlock the Lucene index - I confirm this is safe"
Optimize the Lucene index by merging segments. This is a long-running operation that runs in the background.
Parameters:
maxSegments(optional): Target number of segments after optimization (default: 1 for maximum optimization)
Returns:
success: Boolean indicating the operation was startedoperationId: UUID to track the operationtargetSegments: The target segment countcurrentSegments: The current segment count before optimizationmessage: Status message
Behavior:
- Returns immediately after starting the background operation
- Use
getIndexAdminStatusto poll for progress - Cannot run while the crawler is actively crawling
- Only one admin operation can run at a time
Example:
Ask Claude: "Optimize the search index"
Ask Claude: "What's the status of the optimization?"
Performance Notes:
- Optimization improves search performance by reducing the number of segments
- Temporarily increases disk usage during the merge
- For large indices, this can take several minutes to hours
Delete all documents from the Lucene index. This is a destructive, long-running operation that runs in the background.
Parameters:
confirm(required): Must be set totrueto proceed. This is a safety measure.fullPurge(optional): Iftrue, also deletes index files and reinitializes (default:false)
Returns:
success: Boolean indicating the operation was startedoperationId: UUID to track the operationdocumentsDeleted: Number of documents that will be deletedfullPurge: Whether a full purge was requestedmessage: Status message
Behavior:
- Returns immediately after starting the background operation
- Use
getIndexAdminStatusto poll for progress - Only one admin operation can run at a time
Purge Modes:
- Standard purge (
fullPurge=false): Deletes all documents but keeps index files. Disk space is reclaimed gradually during future merges. - Full purge (
fullPurge=true): Deletes all documents AND index files, then reinitializes an empty index. Disk space is reclaimed immediately.
Example:
Ask Claude: "Delete all documents from the index - I confirm this"
Ask Claude: "Purge the index completely and reclaim disk space - I confirm this"
Get the status of long-running index administration operations (optimize, purge).
Parameters: None
Returns:
success: Boolean indicating the status was retrievedstate: Current state:IDLE,OPTIMIZING,PURGING,COMPLETED, orFAILEDoperationId: UUID of the current/last operationprogressPercent: Progress percentage (0-100)progressMessage: Human-readable progress messageelapsedTimeMs: Time elapsed since operation started (in milliseconds)lastOperationResult: Result message from the last completed operation
Example response (during optimization):
{
"success": true,
"state": "OPTIMIZING",
"operationId": "a1b2c3d4-...",
"progressPercent": 45,
"progressMessage": "Merging segments...",
"elapsedTimeMs": 12500,
"lastOperationResult": null
}Example response (idle after completion):
{
"success": true,
"state": "IDLE",
"operationId": null,
"progressPercent": null,
"progressMessage": "No admin operation running",
"elapsedTimeMs": null,
"lastOperationResult": "Optimization completed successfully. Merged to 1 segment(s)."
}Example:
Ask Claude: "What's the status of the index optimization?"
Ask Claude: "Is the purge operation complete?"
When documents are indexed by the crawler, the following fields are automatically extracted and stored:
content: Full text content of the document (analyzed, searchable)content_reversed: Reversed tokens of the content (analyzed withReverseUnicodeNormalizingAnalyzer, not stored). Used internally for efficient leading wildcard queries -- not directly searchable by users.passages: Array of highlighted passages returned in search results (see Search Response Format below)
file_path: Full path to the file (unique ID)file_name: Name of the filefile_extension: File extension (e.g.,pdf,docx)file_type: MIME type (e.g.,application/pdf)file_size: File size in bytes
title: Document title (extracted from metadata)author: Author namecreator: Creator/application that created the documentsubject: Document subjectkeywords: Document keywords/tags
language: Auto-detected language code (e.g.,en,de,fr)created_date: File creation timestampmodified_date: File modification timestampindexed_date: When the document was indexed
content_hash: SHA-256 hash for change detection
Search results are optimized for MCP responses (< 1 MB) and include:
{
"success": true,
"documents": [
{
"score": 0.85,
"file_name": "example.pdf",
"file_path": "/path/to/example.pdf",
"title": "Example Document",
"author": "John Doe",
"language": "en",
"passages": [
{
"text": "...relevant <em>search term</em> highlighted in context...",
"score": 1.0,
"matchedTerms": ["search term"],
"termCoverage": 1.0,
"position": 0.12
},
{
"text": "...another occurrence of <em>search</em> in a later section...",
"score": 0.75,
"matchedTerms": ["search"],
"termCoverage": 0.5,
"position": 0.67
}
]
}
],
"totalHits": 42,
"page": 0,
"pageSize": 10,
"totalPages": 5,
"hasNextPage": true,
"hasPreviousPage": false,
"searchTimeMs": 12,
"facets": {
"language": [
{ "value": "en", "count": 25 },
{ "value": "de", "count": 12 },
{ "value": "fr", "count": 5 }
],
"file_extension": [
{ "value": "pdf", "count": 30 },
{ "value": "docx", "count": 8 },
{ "value": "xlsx", "count": 4 }
],
"file_type": [
{ "value": "application/pdf", "count": 30 },
{ "value": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "count": 8 }
],
"author": [
{ "value": "John Doe", "count": 15 },
{ "value": "Jane Smith", "count": 10 }
]
}
}Key Features:
-
Search Performance Metrics: Every search response includes
searchTimeMsshowing the exact execution time in milliseconds, enabling performance monitoring and optimization. -
Passages with Highlighting: The full
contentfield is NOT included in search results to keep response sizes manageable. Instead, each document contains apassagesarray with up tomax-passages(default: 3) individually highlighted excerpts. Each passage is a separate sentence-level excerpt (not a single joined string), ordered by relevance (best first). Long passages are truncated tomax-passage-char-length(default: 200) centred around the highlighted terms, trimming irrelevant leading/trailing text. Each passage includes:text-- The highlighted excerpt with matched terms wrapped in<em>tags.score-- Normalised relevance score (0.0-1.0), derived from Lucene's BM25 passage scoring. The best passage scores 1.0; other passages are scored relative to the best.matchedTerms-- The distinct query terms that appear in this passage (extracted from the<em>tags). Useful for understanding which parts of a multi-term query a passage satisfies.termCoverage-- The fraction of all query terms present in this passage (0.0-1.0). A value of 1.0 means every query term matched. LLMs can use this to prefer passages that address the full query.position-- Location within the source document (0.0 = start, 1.0 = end), derived from the passage's character offset. Useful for citations or for understanding document structure.
-
Lucene Faceting: The
facetsobject uses Lucene's SortedSetDocValues for efficient faceted search. It shows actual facet values and document counts from the search results, not just available fields. Only facet dimensions that have values in the result set are returned. -
Facet Dimensions: The following fields are indexed as facets:
language- Detected document language (ISO 639-1 code)file_extension- File extension (pdf, docx, etc.)file_type- MIME typeauthor- Document author (multi-valued)creator- Document creator (multi-valued)subject- Document subject (multi-valued)
Use facets to build drill-down queries and refine search results:
# Filter by file type using facet values
filterField: "file_extension"
filterValue: "pdf"
# Filter by language using facet values
filterField: "language"
filterValue: "de"
# Filter by author using facet values
filterField: "author"
filterValue: "John Doe"
# Combine search query with facet filter
queryString: "contract agreement"
filterField: "file_extension"
filterValue: "pdf"
Facet-Driven Workflow:
- Perform initial search with broad query
- Review
facetsin response to see available refinement options - Apply filters using facet values to narrow results
- Iterate to drill down into specific subsets
The crawler starts automatically on server startup (if crawl-on-startup: true) and:
- Discovers files matching include patterns in configured directories
- Extracts content using Apache Tika (supports 100+ file formats)
- Detects language automatically for each document
- Extracts metadata (author, title, dates, etc.)
- Indexes documents in batches for optimal performance
- Monitors directories for changes (create, modify, delete)
By default (reconciliation-enabled: true), every crawl that is not a full reindex performs an incremental pass first. This makes repeated crawls significantly faster because unchanged files are never re-processed.
How it works:
- Index snapshot -- All
(file_path, modified_date)pairs are read from the Lucene index. - Filesystem snapshot -- The configured directories are walked and the current
(file_path, mtime)pairs are collected (no content extraction at this stage). - Four-way diff is computed:
- DELETE -- paths in the index that no longer exist on disk (orphans).
- ADD -- paths on disk that are not yet in the index.
- UPDATE -- paths where the on-disk mtime is newer than the stored
modified_date. - SKIP -- paths that are identical; these are never touched.
- Orphan deletions are applied first (bulk delete via a single Lucene query).
- Only ADD and UPDATE files are crawled, extracted, and indexed.
- On successful completion, the crawl state (timestamp, document count, mode) is persisted to
~/.mcplucene/crawl-state.yaml.
Fallback behaviour: If reconciliation fails for any reason (I/O error reading the index, filesystem walk failure, etc.) the system automatically falls back to a full crawl. No data is lost and no manual intervention is required.
Disabling incremental indexing:
Set reconciliation-enabled: false in application.yaml to always perform a full crawl. Alternatively, pass fullReindex: true to startCrawl to force a single full crawl without changing the default.
Persisted state file:
~/.mcplucene/crawl-state.yaml
This file records the last successful crawl's completion time, document count, and mode. It is written only after a crawl completes successfully.
The server tracks the index schema version to detect when the schema changes between software updates. This eliminates the need for manual reindexing after upgrades.
How it works:
- Each release embeds a
SCHEMA_VERSIONconstant that reflects the current index field schema. - The schema version is persisted in Lucene's commit metadata alongside the software version.
- On startup, the server compares the stored schema version with the current one.
- If they differ (or if a legacy index has no version), a full reindex is triggered automatically.
What triggers a schema version bump:
- Adding or removing indexed fields
- Changing field analyzers
- Modifying field indexing options (stored, term vectors, etc.)
Checking version information:
Use getIndexStats to see the current schema version, software version, and build timestamp.
With directory watching enabled (watch-enabled: true):
- New files are automatically indexed when added
- Modified files are re-indexed with updated content
- Deleted files are removed from the index
Multi-threading:
- Crawls multiple directories in parallel (configurable thread pool)
- Each directory is processed by a separate thread
Batch Processing:
- Documents are indexed in batches (default: 100 documents)
- Reduces I/O overhead and improves indexing speed
NRT (Near Real-Time) Optimization:
- Normal operation: 100ms refresh interval for fast search updates
- Bulk indexing (>1000 files): Automatically slows to 5s to reduce overhead
- Restores to 100ms after bulk operation completes
Progress Notifications:
- Updates every 100 files OR every 30 seconds (whichever comes first)
- Shows throughput (files/sec, MB/sec) and progress
- Non-blocking: Appear in system notification area without interrupting workflow
- macOS: Notifications appear in Notification Center (top-right corner)
- Windows: Toast notifications in system tray area
- Linux: Uses notify-send for desktop notifications
- Failed files are logged but don't stop the crawl
- Statistics track successful vs. failed files
- Large documents are fully indexed (no truncation by default)
- Corrupted or inaccessible files are skipped gracefully
When running with the deployed profile, console logging is disabled to ensure clean STDIO communication with MCP clients. Instead, logs are written to files in:
~/.mcplucene/log/mcplucene.log
The log directory is ${user.home}/.mcplucene/log by default (configured in logback.xml). Log files are automatically rotated:
- Maximum 10MB per file
- Up to 5 log files retained
- Total size capped at 50MB
To view recent logs:
# View the current log file
cat ~/.mcplucene/log/mcplucene.log
# Follow logs in real-time
tail -f ~/.mcplucene/log/mcplucene.log
# View last 100 lines
tail -n 100 ~/.mcplucene/log/mcplucene.logWhen developing (without the deployed profile), logs are written to the console instead of files.
The server now includes automatic schema version management. When you upgrade to a new version that changes the index schema (e.g., adds new fields, changes analyzers, or modifies field indexing options), the server detects the version mismatch on startup and automatically triggers a full reindex.
What happens:
- On startup, the server compares the stored schema version with the current version
- If they differ, a full reindex is triggered automatically
- You'll see a log message:
Schema version changed β triggering full reindex - The reindex runs in the background; you can check progress with
getCrawlerStats
Manual reindex: If you need to force a manual reindex for any reason, you can still trigger it:
Ask Claude: "Reindex all documents from scratch"
This calls startCrawl(fullReindex: true), which clears the existing index and re-crawls all configured directories.
Version information:
Use getIndexStats to see the current schema version, software version, and build timestamp.
Symptom: The server fails to start with an error like Lock held by another program or LockObtainFailedException.
Cause: When the MCP server doesn't shut down cleanly (e.g., the process was forcefully killed, the system crashed, or Claude Desktop was terminated abruptly), Lucene may leave behind a write.lock file in the index directory. This lock file is used to prevent multiple processes from writing to the same index simultaneously. When it's left behind after an unclean shutdown, it blocks the server from starting because Lucene thinks another process is still using the index.
Solution: Delete the lock file manually:
# Remove the write.lock file from the index directory
rm ~/.mcplucene/luceneindex/write.lockAfter removing the lock file, the server should start normally.
Prevention: Try to shut down Claude Desktop gracefully when possible. If you need to force-quit, be aware that you may need to remove the lock file before the next startup.
Note: The default index path is ~/.mcplucene/luceneindex. If you've configured a custom index path via LUCENE_INDEX_PATH or application.yaml, look for the write.lock file in that directory instead.
This usually indicates STDIO communication issues:
- Ensure the
-Dspring.profiles.active=deployedargument is present in the config - Check that no other output is being written to stdout
- Verify the JAR path is an absolute path, not relative
- If you modified the configuration, ensure the "deployed" profile settings are correct
- Verify the JAR file path in the configuration is correct and absolute
- Check that Java 21+ is installed:
java -version - Validate the JSON syntax in the config file
- Check Claude Desktop logs for error messages
- Try running the JAR manually to check for startup errors:
java -jar /path/to/luceneserver-0.0.1-SNAPSHOT.jar
- Ensure the Lucene index directory path is valid
- Check that no other process is locking the index directory
- Verify sufficient disk space for the index
The index may be empty for several reasons:
- No directories configured: Add directories to
application.yamlunderlucene.crawler.directories - Crawler not started: Use the
startCrawlMCP tool or enablecrawl-on-startup: true - No matching files: Check that your directories contain files matching the include patterns
- Files failed to index: Check the logs for errors, use
getCrawlerStatsto see failed file count
- Check directory paths: Ensure paths in
application.yamlare absolute and exist - Verify file permissions: The server needs read access to all files
- Check include patterns: Files must match at least one include pattern
- Check exclude patterns: Files must not match any exclude pattern
- Monitor crawler status: Use
getCrawlerStatusandgetCrawlerStatsMCP tools - Check logs: Look for parsing errors or I/O exceptions
If you encounter OOM errors with very large documents:
- Set content limit: Change
max-content-lengthinapplication.yaml(e.g.,5242880for 5MB) - Increase JVM heap: Add
-Xmx2gto JVM arguments in Claude Desktop config - Reduce thread pool: Lower
thread-pool-sizeto reduce concurrent processing - Reduce batch size: Lower
batch-sizeto commit more frequently
- Increase thread pool: Raise
thread-pool-size(default: 4) - Increase batch size: Raise
batch-sizefor fewer commits (default: 100) - Disable language detection: Set
detect-language: falseif not needed - Disable metadata extraction: Set
extract-metadata: falseif not needed - Check disk I/O: Slow disk can bottleneck indexing
- Edit
application.yaml:
lucene:
crawler:
directories:
- "/Users/yourname/Documents"
crawl-on-startup: true- Start the server:
java -jar target/luceneserver-0.0.1-SNAPSHOT.jar- The crawler automatically starts and indexes all supported documents in your Documents folder.
Ask Claude:
Search for "machine learning" in PDF documents only
Claude will use:
query: "machine learning"
filterField: "file_extension"
filterValue: "pdf"
Ask Claude:
Find all documents written by John Doe
Claude will use:
query: "*"
filterField: "author"
filterValue: "John Doe"
Ask Claude:
Show me the crawler statistics
Claude calls getCrawlerStats() and shows:
- Files processed: 1,234 / 5,000
- Throughput: 85 files/sec
- Indexed: 1,200 (98%)
- Failed: 34 (2%)
Ask Claude:
Reindex all documents from scratch
Claude calls startCrawl(fullReindex: true), which:
- Clears the existing index
- Re-crawls all configured directories
- Indexes all documents fresh
Ask Claude:
Find German documents about "Technologie"
Claude uses:
query: "Technologie"
filterField: "language"
filterValue: "de"
Search results include a passages array with highlighted excerpts and quality metadata:
{
"file_name": "report.pdf",
"passages": [
{
"text": "...discusses the impact of <em>machine learning</em> on modern software development. The study shows...",
"score": 1.0,
"matchedTerms": ["machine learning"],
"termCoverage": 1.0,
"position": 0.08
},
{
"text": "...<em>machine learning</em> algorithms were applied to the dataset in Section 4...",
"score": 0.75,
"matchedTerms": ["machine learning"],
"termCoverage": 1.0,
"position": 0.45
}
]
}This allows you to see relevant excerpts without downloading the full document. The metadata fields help LLMs quickly identify the best passage: prefer passages with high termCoverage (covers more of the query) and use position for document-structure context.
Ask Claude to manage directories without editing configuration files:
"What directories are currently being crawled?"
# Claude calls listCrawlableDirectories()
# Response: Shows all configured directories and config file location
"Add /Users/yourname/Research as a crawlable directory"
# Claude calls addCrawlableDirectory(path="/Users/yourname/Research")
# Directory is added to ~/.mcplucene/config.yaml
"Add /Users/yourname/Projects and start crawling it now"
# Claude calls addCrawlableDirectory(path="/Users/yourname/Projects", crawlNow=true)
# Directory is added and crawl starts immediately
"Stop crawling /Users/yourname/Downloads"
# Claude calls removeCrawlableDirectory(path="/Users/yourname/Downloads")
# Directory is removed from config (indexed documents remain)
Configuration Persistence:
The directories you add via MCP tools are saved to ~/.mcplucene/config.yaml:
lucene:
crawler:
directories:
- /Users/yourname/Documents
- /Users/yourname/Research
- /Users/yourname/ProjectsThis configuration persists across server restarts - no need to reconfigure each time.
Environment Variable Override:
If you set the LUCENE_CRAWLER_DIRECTORIES environment variable, it takes precedence:
{
"mcpServers": {
"lucene-search": {
"command": "java",
"args": ["-Dspring.profiles.active=deployed", "-jar", "/path/to/jar"],
"env": {
"LUCENE_CRAWLER_DIRECTORIES": "/path1,/path2"
}
}
}
}When this is set, addCrawlableDirectory and removeCrawlableDirectory will return an error message indicating the environment override is active.
Note: When using this server through Claude or another AI assistant, synonym expansion happens automatically - the AI constructs OR queries for you based on your natural language request. The examples below show the underlying query syntax for reference or direct API usage.
Since the search engine performs exact lexical matching without automatic synonym expansion, you need to explicitly include synonyms and word variations in your query:
β Basic search (might miss relevant results):
query: "car"
This will ONLY match documents containing the exact word "car", missing documents with "automobile", "vehicle", etc.
β Better: Include synonyms with OR:
query: "(car OR automobile OR vehicle)"
β Best: Combine synonyms with wildcards for variations:
query: "(car* OR automobile* OR vehicle*)"
This matches: car, cars, automobile, automobiles, vehicle, vehicles, etc.
Real-world example - Finding contracts:
query: "(contract* OR agreement* OR deal*) AND (sign* OR execut* OR finali*)"
filterField: "file_extension"
filterValue: "pdf"
This will find documents containing variations like:
- "contract signed", "agreement executed", "deal finalized"
- "contracts signing", "agreements execute", "deals finalizing"
π‘ Tip: Use the facets in the search response to discover the exact terms used in your documents, then refine your query accordingly.
When developing and debugging in your IDE, run the server without the "deployed" profile to get full logging:
In your IDE (IntelliJ, Eclipse, VS Code):
# Just run the main class directly - no profile needed
# You'll see full console logging and debug output
java -jar target/luceneserver-0.0.1-SNAPSHOT.jarThis gives you:
- β Complete logging output for debugging
- β Configuration loaded from classpath and user config
- β All debug information visible in console
For production/Claude Desktop deployment:
# Use the deployed profile for clean STDIO
java --enable-native-access=ALL-UNNAMED -Xmx2g -Dspring.profiles.active=deployed -jar target/luceneserver-0.0.1-SNAPSHOT.jarRecommended approach: Use the document crawler by configuring directories in application.yaml. The crawler automatically handles content extraction, metadata, and language detection.
Programmatic approach: For custom document types or direct indexing:
// Get the LuceneIndexService instance from your application
LuceneIndexService indexService = // ... from your application
public void addDocument(String title, String content) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
doc.add(new TextField("content", content, Field.Store.YES));
doc.add(new StringField("file_path", "/custom/path", Field.Store.YES));
indexService.getIndexWriter().addDocument(doc);
indexService.getIndexWriter().commit();
}For the full field schema, see the Index Field Schema section.
