Welcome to my personal lab for exploring AI/ML, DevOps, and security. I've built a resilient, open-source platform by combining bare-metal servers, virtualization, and container orchestration. It's a place for learning, tinkering, and maybe over-engineer a solution or two.
This project is, first and foremost, a platform for learning and exploration. The core philosophy is to maintain a resilient and reproducible test environment where experimentation is encouraged. While this approach can sometimes lead to over-engineering (here's the counter-argument), the primary goal is to guarantee that any component can be rebuilt from code.
This philosophy is supported by several key principles:
- Everything as Code: All infrastructure, from bare-metal provisioning to application deployment, is defined declaratively and managed through version control. This ensures consistency and enables rapid disaster recovery.
- Monorepo Simplicity: The entire homelab is managed within a single repository, providing a unified view of all services, configurations, and documentation.
- Open Source First: I prioritize the use of open-source software to maintain flexibility and support the community.
- Accelerated AI/ML: The environment is specifically tailored for AI/ML workloads, with a focus on leveraging AMD and Intel GPU acceleration for inference.
Hardware:
- Servers: 5 servers – ‘Method’ (SuperMicro H12SSL‑i), ‘Indy’ (SuperMicro D‑2146NT), ‘Stale’ (X10SDV‑4C‑TLN4F), ‘Nose’ & ‘Tail’ (Framework Mainboard)
- Networking: TP‑Link Omada switches & Protectli Opnsense firewall
- Accelerated compute: Intel Arc A310, AMD Radeon AI Pro R9700, AMD Ryzen AI MAX+ 395 “Strix Halo”
- Management: UPS, PiKVM
Software Stack:
- Operating Systems: Debian, Proxmox, Talos, NixOS, Truenas
- Storage: Ceph cluster (hot storage) and Truenas (cold storage)
- Container Orchestration: Ephemeral Talos Kubernetes clusters and Harbor proxy/registry
- Automation: OpenTofu, Ansible, ArgoCD, NixOS, Argo Events and Argo Workflows
- Security: SOPS, HashiCorp Vault, Authelia, Traefik, VLANs
- Observability: Kube Prometheus Stack, Alloy, LangSmith
AI/ML Capabilities:
- 🤖 Managing device through Intel GPU plugin and AMD ROCm operator
- 🖼️ Immich machine learning & Jellyfin transcoding with Intel Arc A310
- 📦
llm-modelsHelm chart – KubeElasti scale‑to‑zero Llama.cpp inference routed through LiteLLM - 🧠 Embedding model inference with AMD Radeon AI Pro R9700
- ⚡ Dense & MoE inference on two AMD Ryzen AI MAX+ 395
- ☁️ GCP Vertex AI for larger ML inference
Automation:
- Infrastructure as Code with OpenTofu
- Debian, Proxmox and Opnsense management with Ansible
- GitOps deployment with ArgoCD
- Blue/green deployment strategies
- Container registry and proxy with Harbor
- Argo Events and Argo Workflows for backups, secret management and CI/CD pipelines
- NixOS for Framework 13 laptop and Aorus gaming desktop
- Common helm chart
Storage & Backups:
- Ceph backbone
- SeaweedFS PVC hot storage
- Truenas / MinIO cold storage
- Offsite replication to Cloudflare R2
- Automated backups with Argo Workflows, k8up and CloudNative PG
Security:
- Network segmentation with OPNsense and intervlan routing with TP Link Omada
- Secrets management with SOPS and Vault
- Automated TLS certificates with Cert Manager and Cloudflare
- OIDC/MFA authentication with Authelia
- Middleware and encrypted ingress with Traefik
Disaster Recovery:
- Infrastructure-as-Code for rapid rebuilding
- Automated backup restoration workflows and gitops
- Regular disaster recovery testing with blue/green cluster
- 3-2-1 backup strategy
Comprehensive documentation is available in the Docker Docs Site directory, covering architecture, deployments, operations, security, and AI/ML implementations.
- Short‑term: Wireguard VPN & Cloudflare tunnels to publish docs site
- Mid‑term: Personal website & publish project lighthearted
- Long‑term: Fine‑tuning & building generative models, Home Assistant
Github issues are more up to date.