new mcp tools and other improvements

Latest

Latest

alexsku released this 19 Nov 02:27

· 1 commit to main since this release

8761474

Response Token Budget Management

New TokenCountEstimator class for fast token counting using character-based heuristics
Automatic result truncation via _select_results_within_budget() to prevent context window issues
Configurable token limits:
- TOOL_RESPONSE_TOKEN_LIMIT environment variable (default: 80,000 tokens)
- ENTITY_SCHEMA_TOKEN_BUDGET environment variable (default: 16,000 tokens per entity)
90% safety buffer to account for token estimation inaccuracies
Ensures at least one result is always returned

Enhanced Search Capabilities

Enhanced Keyword Search:
- Supports pagination with start parameter
- Added viewUrn for view-based filtering
- Added sortInput for custom sorting

Query Entity Support

Native QueryEntity type support (SQL queries as first-class entities)
New query_entity.gql GraphQL query
Optimized entity retrieval with specialized query for QueryEntity types
Includes query statement, subjects (datasets/fields), and platform information

GraphQL Compatibility

Adaptive field detection for newer GMS versions
Caching mechanism for GMS version detection
Graceful fallback when newer fields aren't available
Support for #[CLOUD] and #[NEWER_GMS] conditional field markers
DISABLE_NEWER_GMS_FIELD_DETECTION environment variable override

Schema Field Optimization

Smart field prioritization to stay within token budgets:
1. Primary key fields (isPartOfKey=true)
2. Partitioning key fields (isPartitioningKey=true)
3. Fields with descriptions
4. Fields with tags or glossary terms
5. Alphabetically by field path
Generator-based approach for memory efficiency

Error Handling & Security

Enhanced error logging with full stack traces in async_background wrapper
Logs function name, args, and kwargs on failures
ReDoS protection in HTML sanitization with bounded regex patterns
Query truncation function (configurable via QUERY_LENGTH_HARD_LIMIT, default: 5,000 chars)

Default Views Support

Automatic default view application for all search operations
Fetches organization's default global view from DataHub
5-minute caching (configurable via VIEW_CACHE_TTL_SECONDS)
Can be disabled via DATAHUB_MCP_DISABLE_DEFAULT_VIEW environment variable
Ensures search results respect organization's data governance policies

Dependencies

Added cachetools>=5.0.0: For GMS field detection caching
Added types-cachetools (dev): Type stubs for mypy

Performance

Memory efficiency: Generator-based result selection avoids loading all results into memory
Caching: GMS version detection cached per graph instance
Fast token estimation: Character-based heuristic (no tokenizer overhead)
Smart truncation: Truncates less important schema fields first

Assets 2