refactor: rewrite ERP MCP server to FastMCP v3 with Google OAuth#1
refactor: rewrite ERP MCP server to FastMCP v3 with Google OAuth#1HusnainRazaGhaffar merged 11 commits intomainfrom
Conversation
Replace the 8-file Playwright-based codebase with a stateless 2-file server (server.py + erp_client.py) using FastMCP v3 GoogleProvider for MCP-native OAuth. Eliminates browser automation, SSE transport, and global singleton auth in favor of per-request token exchange with SHA-256 cached keys, domain restriction, and structured audit logging. - server.py: 11 @mcp.tool declarations, GoogleProvider auth, HTTP transport - erp_client.py: stateless ERPClient with httpx, TTLCache, all ERP API calls - Dockerfile: python:3.12.8-slim-bookworm (~150MB vs ~1.2GB Playwright image) - 116 unit tests (75 erp_client + 41 server) - Security controls: SEC-01 through SEC-07 from audit findings - Delete 15 old files (api_client, auth, oauth_server, browser auth, SSE, etc.)
Triggers on v* tags, authenticates via OIDC (id-token: write), builds Docker image and pushes to AWS ECR with version tag + latest.
Critical:
- Fix check_person_week_project_exists to actually compare target project_id
- Add /health endpoint for Docker HEALTHCHECK via @mcp.custom_route
- Replace hardcoded absolute path in test with inspect.getfile()
- Pin all GitHub Actions to commit SHAs (supply chain hardening)
- Replace unreliable atexit handler with FastMCP lifespan context manager
- Add test job as prerequisite for ECR build-and-push
High:
- Defer GoogleProvider construction (os.environ.get + __main__ validation)
- Pin dependencies with compatible-release specifiers (~=)
- Replace magic number 66 with _DEFAULT_LABEL_ID constant
- Fix _match_project_in_week_log to use case-insensitive exact match
- Narrow _request JSON parse catch to (JSONDecodeError, ValueError)
- Add async context manager (__aenter__/__aexit__) to ERPClient
- Add timeout-minutes, concurrency group to CI workflow
- Remove unconditional latest tag from ECR push
Medium:
- Extract _find_active_project helper (deduplicate create/delete)
- Replace datetime.strptime with date.fromisoformat throughout
- Raise ConnectionError instead of ValueError for transport errors
- Fix hours precision with divmod instead of truncation
- Fix test client fixture to close httpx client on teardown
- Add pyproject.toml with pytest/mypy/ruff configuration
- Add PR/branch triggers, workflow-level permissions: {}, GHA cache
- Add missing .dockerignore entries
- TTLCache: generic PEP 695 syntax, asyncio.Lock for async-safe access - server.py: exception chaining (B904), lifespan validation, MCP_HOST env var - CI: setup-buildx-action, ref-scoped concurrency, pip caching, semver validation - Docker: base image digest pin, MCP_HOST=0.0.0.0, version label - Tests: remove redundant decorators, fix imports, add encoding param - Linting: ruff A rule, per-file-ignores, all checks clean
…ty, CI/Docker hardening) - HIGH: Add date range cap (366 days) on get_logs_for_date_range - HIGH: Add hours validation (0 < h <= 24) on write tools - HIGH: Rename TTLCache get/put to _get/_put, add aclear() with lock - HIGH: Prefer exact match over substring in resolve_project_id - MEDIUM: Add observability logging to _get_erp_token - MEDIUM: Validate allowed_domain not empty, guard email without '@' - MEDIUM: Add /health version field, BuildKit cache mount, APP_VERSION build arg - MEDIUM: Create requirements-dev.txt, .env.example - LOW: Anchor semver regex, add pip to dependabot, improve error messages - NIT: Add credential patterns and dev caches to .dockerignore - Tests: 21 new tests (137 total), all passing, ruff/mypy clean
…oped cache - Add workflow_dispatch trigger with push_image input for dev builds from any branch - Dev images tagged dev-<short-sha> + dev-latest, with build-only mode option - Add Trivy vulnerability scanner (HIGH/CRITICAL, SARIF upload) to both release and dev jobs - Switch to branch-scoped GHA cache to prevent cache pollution between branches
- Restructure CI to scan-before-push (Trivy gate with exit-code 1) - Add Trivy scan for no-push dev builds and cache fallback to main - Scope cancel-in-progress concurrency to test job only - Convert Dockerfile to multi-stage build, strip pip from runtime image - Wire reproducible builds via requirements.lock / requirements-dev.lock - Add _check_erp_result() helper for error-dict-to-ToolError conversion - Move ERPClient construction into _lifespan context manager - Add logging in _request for HTTP/transport errors - Return partial_error/error status from fill_logs_for_days - Raise ValueError in resolve_label_id on unmatched label name - Add input validations (description length, week_starting Monday check) - Harden .dockerignore with .env.* and token.json patterns - Add dependabot reviewers
…dd CONTRIBUTING.md
Apply all findings from the self-audit and extensibility assessment:
server.py:
- Replace 11 identical try/except blocks with _tool_error_handler decorator
- Add _get_erp() singleton guard (RuntimeError if lifespan not started)
- Remove version from /health endpoint response
- Initialize module-level erp as None with proper type annotation
erp_client.py:
- Add _MAX_TOKEN_LENGTH (4096) guard on token exchange
- Sanitize transport/unexpected error messages (no internal details)
- Replace _LOCALHOST_HOSTS frozenset with _is_loopback() using ipaddress module
- TTLCache.__len__ now evicts expired entries before counting
- Reduce TTL from 1800s to 900s, add explicit verify=True
CI/CD (.github/workflows/build-and-push.yml):
- Tighten tag filter to v[0-9]+.[0-9]+.[0-9]+ (reject malformed tags)
- Add mutual exclusivity guard on release job
- Remove floating :latest tag from release builds
- Raise Trivy gate to HIGH+CRITICAL severity
- Fix SARIF upload condition, add dev cache-from main fallback
- Pin Python to 3.12.8, normalize ${{ }} syntax
Docker:
- Remove conflicting --no-cache-dir (let BuildKit cache work)
- Strip pip/setuptools/wheel from venv in builder stage
- Add timeout=3 to healthcheck urllib call
- Add BUILD_DATE label
- Add requirements-dev.lock/txt to .dockerignore
Tests:
- Add _discover_tool_names() and _discover_write_tool_names() AST helpers
- Auto-discover tools in TestNoEmailParameter (remove hardcoded list)
- Add TestAuditLogCoverage.test_all_write_tools_covered guard test
- Add 6 new ERPClient tests (TTL len, ambiguous substring, recursion, etc.)
- Expand ruff rules (A, N, S, C4, PT, RUF), fix all lint issues
Docs:
- Add CONTRIBUTING.md with tool-addition walkthrough, security checklist,
and pre-submit checklist
erp_client.py (12 fixes): - Fix TTLCache.__len__ to be read-only (no mutation without lock) - Add per-key lock to prevent token exchange thundering herd - Change `if not week_log_id` to `if week_log_id is None` (lines 808, 890) - Standardize description matching to strip().lower() in both paths - Fix per-project minutes overflow with carry-over logic - Truncate response text to 500 chars on JSON parse failure - Split httpx timeout: connect=5s, read/write=30s, pool=5s - Validate ERP token format before caching (regex check) - Remove overly permissive bidirectional substring matching - Truncate forwarded ERP error messages to 500 chars - Align TTLCache default TTL to 900s (matches actual usage) - Clarify TTLCache docstring re async safety server.py (7 fixes): - Preserve function signatures through decorator (ParamSpec/TypeVar) - Add logging.basicConfig with LOG_LEVEL env var - Move OAuth credential validation into lifespan (fail-fast for uvicorn) - Add SecurityHeadersMiddleware for defense-in-depth - Reject sub-minute hours that round to 0 minutes - Reject simultaneous project_id + project_name on all tools - Document module-level singleton constraint CI/CD (8 fixes): - Add ruff + mypy steps to test job - Harden release cache scope to main-only (prevent cache poisoning) - Add docker ecosystem to dependabot.yml - Make release concurrency group per-tag - Fix SARIF upload condition in dev job (always() pattern) - Update codeql-action version comment - Document Trivy ignore-unfixed trade-off - Add JUnit XML output to pytest Dockerfile (4 fixes): - Add apt-get upgrade in runtime stage for CVE patching - Fix pip uninstall error suppression (; true -> || true) - Document health check port and APP_VERSION ENV Tests (7 additions): - Oversized token rejection, non-Monday date, reversed date range, oversized description (3 tools), realistic expires_at in fixture Dependencies: - Add ruff and mypy to requirements-dev.txt and regenerate lock file
Match the redirect URI already requested in Google Cloud Console (https://dev-workstream.arbisoft.com/callback) instead of FastMCP's default /auth/callback.
| raise ValueError("allowed_domain must not be empty") | ||
| self._token_cache: TTLCache[tuple[str, str]] = TTLCache(maxsize=500) | ||
| # Per-key locks to coalesce concurrent token exchanges for the same key. | ||
| self._exchange_locks: dict[str, asyncio.Lock] = {} |
There was a problem hiding this comment.
[💡 Suggestion]: _exchange_locks dict grows unboundedly
Every unique Google token adds an asyncio.Lock to self._exchange_locks that is never removed. Over the lifetime of the process, this dict accumulates one entry per distinct cache_key (SHA-256 hash). While the TTLCache evicts stale tokens, the corresponding locks remain forever.
In a long-running single-worker deployment this is a slow memory leak. Consider cleaning up locks after use, or use a bounded TTLCache for locks as well:
Suggested fix:
# Option A: clean up after use
async with lock:
... # existing logic
# After releasing the lock, remove it if no one else is waiting
if not lock.locked():
self._exchange_locks.pop(cache_key, None)
# Option B: use setdefault (more idiomatic, still needs cleanup)
lock = self._exchange_locks.setdefault(cache_key, asyncio.Lock())
erp_client.py
Outdated
|
|
||
| result = await self.get_log_labels(token) | ||
| if result.get("status") != "success": | ||
| return None |
There was a problem hiding this comment.
[💡 Suggestion]: resolve_label_id silently falls back to default label on API failure
When a user provides label_name but the get_log_labels API call fails, this method returns None (line 1130). The caller in create_or_update_log then substitutes _DEFAULT_LABEL_ID, silently ignoring the user's label preference.
The user explicitly asked for a specific label but gets the default without any warning. Consider raising a ValueError instead, consistent with how resolve_project_id handles API failures (line 1076).
Suggested fix:
result = await self.get_log_labels(token)
if result.get("status") != "success":
raise ValueError(
f"Failed to fetch log labels (cannot resolve '{label_name}'): "
f"{result.get('message')}"
)
NIXKnight
left a comment
There was a problem hiding this comment.
See inline comments for details.
…ling - Clean up per-key exchange locks after release to prevent unbounded dict growth over the process lifetime - Raise ValueError in resolve_label_id when the labels API fails, instead of silently falling back to the default label
Summary
Complete rewrite of the ERP MCP server from raw MCP SDK to FastMCP v3 with native Google OAuth:
FastMCP(stateless_http=True)and nativeGoogleProviderfor OAuthERPClientclass withTTLCache(SHA-256 hashed keys, LRU eviction, TTL 15 min), per-key coalescing locks, andhttpxasync transportcreate_or_update_loguses PATCH on existing week logs or POST fallback, with project resolution via case-insensitive substring matchinghdclaim, no HTTP redirects, no stack traces in responses, ERP token format validation,SecurityHeadersMiddlewareget_week_logs,get_logs_for_date_range,create_or_update_log,delete_log,fill_logs_for_days,get_active_projects,check_person_week_project_existsDeleted:
main.py,tools/,utils/,auth/,services/— the entire legacy module tree replaced byserver.py+erp_client.pyContext
The legacy server used raw MCP SDK with a custom OAuth implementation spread across multiple modules. This rewrite consolidates everything into two files (
server.pyfor tool definitions and OAuth,erp_client.pyfor ERP API communication) using FastMCP v3's built-in patterns. The ERP API contract is unchanged — this is a transport and framework rewrite, not a backend change.