Skip to content

Conversation

@domdomegg
Copy link
Member

@domdomegg domdomegg commented Aug 6, 2025

Adds a CI action to build docker images and publish them to GHCR on every commit. We can then use these when deploying the registry.

This turned out to be an upstream blocker of building the infra for deployment, and something like this is needed for either deployment approach we're exploring.


Summary

  • Add comprehensive documentation for pre-built Docker images in README
  • Include usage examples and configuration guidance
  • Add GitHub Actions workflow for automated Docker image publishing

🤖 Generated with Claude Code

Add documentation for pre-built Docker images including usage examples and configuration guidance. Include GitHub Actions workflow for automated Docker image publishing.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@domdomegg domdomegg merged commit f04c995 into main Aug 7, 2025
5 checks passed
@domdomegg domdomegg deleted the adamj/add-docker-images-documentation branch August 7, 2025 15:25
domdomegg added a commit that referenced this pull request Aug 8, 2025
Adds a CI action to build docker images and publish them to GHCR on
every commit. We can then use these when deploying the registry.

This turned out to be an upstream blocker of building the infra for
deployment, and something like this is needed for either deployment
approach we're exploring.

---

## Summary
- Add comprehensive documentation for pre-built Docker images in README
- Include usage examples and configuration guidance  
- Add GitHub Actions workflow for automated Docker image publishing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <noreply@anthropic.com>
domdomegg added a commit that referenced this pull request Aug 11, 2025
)

Original PR: #227

- Add Pulumi-based infrastructure as code for deploying MCP Registry to
Kubernetes
- Support for both local development (minikube) and Azure Kubernetes
Service (AKS)
- Complete deployment orchestration including:
- cluster setup: e.g. you point this at an Azure account, and it can set
up and manage the cluster for you. e.g. K8s version, number of nodes,
type of nodes, ...
  - cloud agnostic K8s services: cert-manager, nginx-ingress
- app services: MongoDB, and registry application (currently using nginx
as a placeholder, blocked on #225 (as is #190). but should be a 1 line
change)

## How is this different to #190

- Supports cluster setup and management. This enables:
- Non-hosting maintainers managing many devops workflows (e.g. scaling
up the cluster, or bumping K8s versions). Without this, we'd need to
bug/page the organisation hosting the registry when we need these things
changed.
- Makes it easy to spin up things like staging/temporary clusters, as
well as enables contributors to replicate the stack exactly on their own
Azure accounts.
- Sets up cloud-agnostic services. For example, rather than using the
Azure-managed ingresses and CA, we install nginx-ingress and
cert-manager. This enables:
- Running the entire infra stack can also run locally (e.g. in minikube,
k3s, orbstack, colima) - making it much easier for contributors to test
changes to infra stuff.
- Moving between cloud providers much more easily, e.g. we could shift
from Azure to GCP/AWS/other with minimal hassle.
- Everything stays written in Go, rather than Helm templates. This means
we get things like type-checking etc. for free (which from my experience
makes AI tools wayyy better at editing K8s stuff), and contributors
don't need to learn a new language if they're already using Go.

## Testing

I've got this running well:
- locally in minikube
- on cloud in Azure (my personal Azure account)

<details><summary>Claude written architecture and security
review</summary>
<p>

## Deployment Review & Assessment

### Current Architecture Strengths
**Pulumi IaC Approach**
- Well-structured infrastructure as code using Pulumi
- Multi-provider support (AKS, local) with clean abstraction
- Good separation of concerns in `pkg/` directory

**Security Fundamentals**
- Non-root container execution (`appuser` with UID 10001)
- Secrets properly managed via Kubernetes secrets
- TLS/SSL certificate management with cert-manager and Let's Encrypt

### Critical Issues & High-Priority Improvements

**1. Production Deployment Not Ready** 🚨
The registry deployment uses `nginx:alpine` placeholder image instead of
the actual MCP registry:
- `deploy/pkg/k8s/registry.go:67` - TODO comments indicate incomplete
setup
- Health probes are commented out
- Port mapping doesn't match actual application (80 vs 8080)

**Fix:** Build and publish actual registry container image to GHCR,
update deployment

**2. Database Security Considerations** 🔒
- MongoDB deployed without authentication
- No backup/disaster recovery strategy
- Database credentials hardcoded

*Note: MongoDB is not exposed externally (ClusterIP service), so this is
not a critical security risk but should be addressed for production.*

**3. Monitoring & Observability Gaps** 📊
- No Prometheus/Grafana monitoring stack
- No log aggregation (ELK/Loki)
- No application metrics/health dashboards
- No alerting configured

**4. High Availability & Reliability** ⚠️
- Single MongoDB instance (no replication)
- No persistent volume backup strategy
- Fixed 10Gi storage without growth planning
- Only 2 replicas for registry service
- No pod disruption budgets
- No horizontal pod autoscaling

### Recommended Improvements

**Immediate (High Priority)**
1. Complete Registry Deployment - Build proper container image pipeline,
enable health checks
2. Secure MongoDB - Add authentication credentials, implement backup
strategy

**Medium Priority**
3. Add Monitoring Stack - Prometheus, Grafana deployment
4. Security Hardening (Nice to Have) - RBAC policies, Network Policies,
Pod Security Standards
5. CI/CD Pipeline Enhancement - Container image building/publishing,
automated deployment

**Lower Priority**
6. High Availability - MongoDB replica set, HPA for registry pods
7. Operational Excellence - Kubernetes dashboard, cost optimization

### Configuration Issues
- Production config has test credentials: `deploy/Pulumi.prod.yaml:4-5`
- Missing environment-specific resource sizing
- Hardcoded domain names (`example.com`)

The deployment setup shows good architectural foundations but needs
significant work before production readiness. The most critical issue is
the placeholder nginx container - priority should be completing the
actual registry application deployment before addressing the other
improvements. Security measures like RBAC and Network Policies are nice
to have but not strictly necessary given that MongoDB is not exposed
externally.

🤖 Generated with [Claude Code](https://claude.ai/code)

</p>
</details> 


## Metadata

Working towards #91

---------

Co-authored-by: Claude <noreply@anthropic.com>
@domdomegg domdomegg mentioned this pull request Aug 12, 2025
9 tasks
domdomegg added a commit that referenced this pull request Aug 12, 2025
Adds the Pulumi code to:
- Deploy the registry (and associated services e.g. mongodb) to Google
Cloud Platform (GCP), on top of Google Kubernetes Engine (GKE)
- Sets up proper environments and secrets management
- Uses the real container image, now that it's published in #225. At the
moment attached to latest, we might want to pin the version later (or
perhaps always use `latest` in staging, and pin prod)
- Uses real domains (`staging.registry.modelcontextprotocol.io`) rather
than examples (``)

## Motivation and Context

Setting up infrastructure to deploy it. I set something up in Azure in
#227, although not super robust (e.g. no service accounts etc.). Think
we will use GCP as:
- the maintainers have experience with GCP, but none with Azure
- costs are quite low, and Anthropic is happy to cover them in the short
term
- means we only have to maintain one login system (just Google Cloud
Identity), not two (Google Workspace + Azure)

## How Has This Been Tested?

Deployed this to a staging and production cluster. Try it yourself at:

```bash
curl -H "Host: staging.registry.modelcontextprotocol.io" -k https://35.222.36.75/v0/ping
```

(will be sorting out domains very soon)

## Breaking Changes

NA - just adds support for GCP deployment

## Types of changes
<!-- What types of changes does your code introduce? Put an `x` in all
the boxes that apply: -->
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Documentation update

## Checklist
<!-- Go over all the following points, and put an `x` in all the boxes
that apply. -->
- [x] I have read the [MCP
Documentation](https://modelcontextprotocol.io)
- [x] My code follows the repository's style guidelines
- [ ] New and existing tests pass locally
- [x] I have added appropriate error handling
- [x] I have added or updated documentation as needed

## Additional context
<!-- Add any other context, implementation notes, or design decisions
-->

Expected follow-ups:
- GitHub Action setup to deploy things to the cluster from GitHub, to
avoid gatekeeping to just the people with the secrets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants