Skip to content

refactor: rewrite ERP MCP server to FastMCP v3 with Google OAuth#1

Merged
HusnainRazaGhaffar merged 11 commits intomainfrom
refactor/fastmcp-v3
Feb 25, 2026
Merged

refactor: rewrite ERP MCP server to FastMCP v3 with Google OAuth#1
HusnainRazaGhaffar merged 11 commits intomainfrom
refactor/fastmcp-v3

Conversation

@HusnainRazaGhaffar
Copy link

Summary

Complete rewrite of the ERP MCP server from raw MCP SDK to FastMCP v3 with native Google OAuth:

  • FastMCP v3 framework — replaces hand-rolled MCP protocol handling with FastMCP(stateless_http=True) and native GoogleProvider for OAuth
  • ERPClient refactor — extracted all ERP HTTP logic into a standalone ERPClient class with TTLCache (SHA-256 hashed keys, LRU eviction, TTL 15 min), per-key coalescing locks, and httpx async transport
  • Write-path dual strategycreate_or_update_log uses PATCH on existing week logs or POST fallback, with project resolution via case-insensitive substring matching
  • Security hardening — no email parameters on tools (identity from Google token only), domain restriction via hd claim, no HTTP redirects, no stack traces in responses, ERP token format validation, SecurityHeadersMiddleware
  • 7 MCP toolsget_week_logs, get_logs_for_date_range, create_or_update_log, delete_log, fill_logs_for_days, get_active_projects, check_person_week_project_exists
  • 149 passing tests — full coverage of ERPClient (respx-mocked HTTP), server tools (mocked ERPClient), TTLCache, and security controls
  • CI/CD pipeline — GitHub Actions with lint (ruff), type check (mypy), test, Docker build, Trivy vulnerability scanning, and ECR push (OIDC, no long-lived credentials)

Deleted: main.py, tools/, utils/, auth/, services/ — the entire legacy module tree replaced by server.py + erp_client.py

Context

The legacy server used raw MCP SDK with a custom OAuth implementation spread across multiple modules. This rewrite consolidates everything into two files (server.py for tool definitions and OAuth, erp_client.py for ERP API communication) using FastMCP v3's built-in patterns. The ERP API contract is unchanged — this is a transport and framework rewrite, not a backend change.

Replace the 8-file Playwright-based codebase with a stateless 2-file
server (server.py + erp_client.py) using FastMCP v3 GoogleProvider for
MCP-native OAuth. Eliminates browser automation, SSE transport, and
global singleton auth in favor of per-request token exchange with
SHA-256 cached keys, domain restriction, and structured audit logging.

- server.py: 11 @mcp.tool declarations, GoogleProvider auth, HTTP transport
- erp_client.py: stateless ERPClient with httpx, TTLCache, all ERP API calls
- Dockerfile: python:3.12.8-slim-bookworm (~150MB vs ~1.2GB Playwright image)
- 116 unit tests (75 erp_client + 41 server)
- Security controls: SEC-01 through SEC-07 from audit findings
- Delete 15 old files (api_client, auth, oauth_server, browser auth, SSE, etc.)
Triggers on v* tags, authenticates via OIDC (id-token: write),
builds Docker image and pushes to AWS ECR with version tag + latest.
Critical:
- Fix check_person_week_project_exists to actually compare target project_id
- Add /health endpoint for Docker HEALTHCHECK via @mcp.custom_route
- Replace hardcoded absolute path in test with inspect.getfile()
- Pin all GitHub Actions to commit SHAs (supply chain hardening)
- Replace unreliable atexit handler with FastMCP lifespan context manager
- Add test job as prerequisite for ECR build-and-push

High:
- Defer GoogleProvider construction (os.environ.get + __main__ validation)
- Pin dependencies with compatible-release specifiers (~=)
- Replace magic number 66 with _DEFAULT_LABEL_ID constant
- Fix _match_project_in_week_log to use case-insensitive exact match
- Narrow _request JSON parse catch to (JSONDecodeError, ValueError)
- Add async context manager (__aenter__/__aexit__) to ERPClient
- Add timeout-minutes, concurrency group to CI workflow
- Remove unconditional latest tag from ECR push

Medium:
- Extract _find_active_project helper (deduplicate create/delete)
- Replace datetime.strptime with date.fromisoformat throughout
- Raise ConnectionError instead of ValueError for transport errors
- Fix hours precision with divmod instead of truncation
- Fix test client fixture to close httpx client on teardown
- Add pyproject.toml with pytest/mypy/ruff configuration
- Add PR/branch triggers, workflow-level permissions: {}, GHA cache
- Add missing .dockerignore entries
- TTLCache: generic PEP 695 syntax, asyncio.Lock for async-safe access
- server.py: exception chaining (B904), lifespan validation, MCP_HOST env var
- CI: setup-buildx-action, ref-scoped concurrency, pip caching, semver validation
- Docker: base image digest pin, MCP_HOST=0.0.0.0, version label
- Tests: remove redundant decorators, fix imports, add encoding param
- Linting: ruff A rule, per-file-ignores, all checks clean
…ty, CI/Docker hardening)

- HIGH: Add date range cap (366 days) on get_logs_for_date_range
- HIGH: Add hours validation (0 < h <= 24) on write tools
- HIGH: Rename TTLCache get/put to _get/_put, add aclear() with lock
- HIGH: Prefer exact match over substring in resolve_project_id
- MEDIUM: Add observability logging to _get_erp_token
- MEDIUM: Validate allowed_domain not empty, guard email without '@'
- MEDIUM: Add /health version field, BuildKit cache mount, APP_VERSION build arg
- MEDIUM: Create requirements-dev.txt, .env.example
- LOW: Anchor semver regex, add pip to dependabot, improve error messages
- NIT: Add credential patterns and dev caches to .dockerignore
- Tests: 21 new tests (137 total), all passing, ruff/mypy clean
…oped cache

- Add workflow_dispatch trigger with push_image input for dev builds from any branch
- Dev images tagged dev-<short-sha> + dev-latest, with build-only mode option
- Add Trivy vulnerability scanner (HIGH/CRITICAL, SARIF upload) to both release and dev jobs
- Switch to branch-scoped GHA cache to prevent cache pollution between branches
- Restructure CI to scan-before-push (Trivy gate with exit-code 1)
- Add Trivy scan for no-push dev builds and cache fallback to main
- Scope cancel-in-progress concurrency to test job only
- Convert Dockerfile to multi-stage build, strip pip from runtime image
- Wire reproducible builds via requirements.lock / requirements-dev.lock
- Add _check_erp_result() helper for error-dict-to-ToolError conversion
- Move ERPClient construction into _lifespan context manager
- Add logging in _request for HTTP/transport errors
- Return partial_error/error status from fill_logs_for_days
- Raise ValueError in resolve_label_id on unmatched label name
- Add input validations (description length, week_starting Monday check)
- Harden .dockerignore with .env.* and token.json patterns
- Add dependabot reviewers
…dd CONTRIBUTING.md

Apply all findings from the self-audit and extensibility assessment:

server.py:
- Replace 11 identical try/except blocks with _tool_error_handler decorator
- Add _get_erp() singleton guard (RuntimeError if lifespan not started)
- Remove version from /health endpoint response
- Initialize module-level erp as None with proper type annotation

erp_client.py:
- Add _MAX_TOKEN_LENGTH (4096) guard on token exchange
- Sanitize transport/unexpected error messages (no internal details)
- Replace _LOCALHOST_HOSTS frozenset with _is_loopback() using ipaddress module
- TTLCache.__len__ now evicts expired entries before counting
- Reduce TTL from 1800s to 900s, add explicit verify=True

CI/CD (.github/workflows/build-and-push.yml):
- Tighten tag filter to v[0-9]+.[0-9]+.[0-9]+ (reject malformed tags)
- Add mutual exclusivity guard on release job
- Remove floating :latest tag from release builds
- Raise Trivy gate to HIGH+CRITICAL severity
- Fix SARIF upload condition, add dev cache-from main fallback
- Pin Python to 3.12.8, normalize ${{ }} syntax

Docker:
- Remove conflicting --no-cache-dir (let BuildKit cache work)
- Strip pip/setuptools/wheel from venv in builder stage
- Add timeout=3 to healthcheck urllib call
- Add BUILD_DATE label
- Add requirements-dev.lock/txt to .dockerignore

Tests:
- Add _discover_tool_names() and _discover_write_tool_names() AST helpers
- Auto-discover tools in TestNoEmailParameter (remove hardcoded list)
- Add TestAuditLogCoverage.test_all_write_tools_covered guard test
- Add 6 new ERPClient tests (TTL len, ambiguous substring, recursion, etc.)
- Expand ruff rules (A, N, S, C4, PT, RUF), fix all lint issues

Docs:
- Add CONTRIBUTING.md with tool-addition walkthrough, security checklist,
  and pre-submit checklist
erp_client.py (12 fixes):
- Fix TTLCache.__len__ to be read-only (no mutation without lock)
- Add per-key lock to prevent token exchange thundering herd
- Change `if not week_log_id` to `if week_log_id is None` (lines 808, 890)
- Standardize description matching to strip().lower() in both paths
- Fix per-project minutes overflow with carry-over logic
- Truncate response text to 500 chars on JSON parse failure
- Split httpx timeout: connect=5s, read/write=30s, pool=5s
- Validate ERP token format before caching (regex check)
- Remove overly permissive bidirectional substring matching
- Truncate forwarded ERP error messages to 500 chars
- Align TTLCache default TTL to 900s (matches actual usage)
- Clarify TTLCache docstring re async safety

server.py (7 fixes):
- Preserve function signatures through decorator (ParamSpec/TypeVar)
- Add logging.basicConfig with LOG_LEVEL env var
- Move OAuth credential validation into lifespan (fail-fast for uvicorn)
- Add SecurityHeadersMiddleware for defense-in-depth
- Reject sub-minute hours that round to 0 minutes
- Reject simultaneous project_id + project_name on all tools
- Document module-level singleton constraint

CI/CD (8 fixes):
- Add ruff + mypy steps to test job
- Harden release cache scope to main-only (prevent cache poisoning)
- Add docker ecosystem to dependabot.yml
- Make release concurrency group per-tag
- Fix SARIF upload condition in dev job (always() pattern)
- Update codeql-action version comment
- Document Trivy ignore-unfixed trade-off
- Add JUnit XML output to pytest

Dockerfile (4 fixes):
- Add apt-get upgrade in runtime stage for CVE patching
- Fix pip uninstall error suppression (; true -> || true)
- Document health check port and APP_VERSION ENV

Tests (7 additions):
- Oversized token rejection, non-Monday date, reversed date range,
  oversized description (3 tools), realistic expires_at in fixture

Dependencies:
- Add ruff and mypy to requirements-dev.txt and regenerate lock file
Match the redirect URI already requested in Google Cloud Console
(https://dev-workstream.arbisoft.com/callback) instead of FastMCP's
default /auth/callback.
raise ValueError("allowed_domain must not be empty")
self._token_cache: TTLCache[tuple[str, str]] = TTLCache(maxsize=500)
# Per-key locks to coalesce concurrent token exchanges for the same key.
self._exchange_locks: dict[str, asyncio.Lock] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[💡 Suggestion]: _exchange_locks dict grows unboundedly

Every unique Google token adds an asyncio.Lock to self._exchange_locks that is never removed. Over the lifetime of the process, this dict accumulates one entry per distinct cache_key (SHA-256 hash). While the TTLCache evicts stale tokens, the corresponding locks remain forever.

In a long-running single-worker deployment this is a slow memory leak. Consider cleaning up locks after use, or use a bounded TTLCache for locks as well:

Suggested fix:

# Option A: clean up after use
async with lock:
    ...  # existing logic
# After releasing the lock, remove it if no one else is waiting
if not lock.locked():
    self._exchange_locks.pop(cache_key, None)

# Option B: use setdefault (more idiomatic, still needs cleanup)
lock = self._exchange_locks.setdefault(cache_key, asyncio.Lock())

erp_client.py Outdated

result = await self.get_log_labels(token)
if result.get("status") != "success":
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[💡 Suggestion]: resolve_label_id silently falls back to default label on API failure

When a user provides label_name but the get_log_labels API call fails, this method returns None (line 1130). The caller in create_or_update_log then substitutes _DEFAULT_LABEL_ID, silently ignoring the user's label preference.

The user explicitly asked for a specific label but gets the default without any warning. Consider raising a ValueError instead, consistent with how resolve_project_id handles API failures (line 1076).

Suggested fix:

result = await self.get_log_labels(token)
if result.get("status") != "success":
    raise ValueError(
        f"Failed to fetch log labels (cannot resolve '{label_name}'): "
        f"{result.get('message')}"
    )

NIXKnight
NIXKnight previously approved these changes Feb 24, 2026
Copy link
Collaborator

@NIXKnight NIXKnight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments for details.

…ling

- Clean up per-key exchange locks after release to prevent unbounded
  dict growth over the process lifetime
- Raise ValueError in resolve_label_id when the labels API fails,
  instead of silently falling back to the default label
@HusnainRazaGhaffar HusnainRazaGhaffar merged commit 19ee561 into main Feb 25, 2026
3 checks passed
@HusnainRazaGhaffar HusnainRazaGhaffar deleted the refactor/fastmcp-v3 branch February 25, 2026 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants