Commissure is a distributed, high-performance runtime that brings large-scale LLM inference to Google Cloud Run - without leaving the serverless experience.
It allows you to run models that would normally exceed a single containerβs memory or GPU capacity (e.g., Gemma-3-27B, ~55 GB of weights) by splitting the model into cooperating Cloud Run services that communicate through gRPC.
Each service loads only its assigned range of layers, and together they behave as one large model.
In neuroscience, a commissure is the bridge between the brainβs hemispheres - the structure that lets separate regions act as one.
In the same way, Commissure bridges multiple Cloud Run services to act as a single distributed model.
Commissure demonstrates how Cloud Run can be used for HPC-style distributed inference, without resorting to custom clusters or VM orchestration.
Instead of scaling up to larger machines, Commissure scales across multiple GPU services that share work in real time.
The project was built for the Google Cloud Run Hackathon (GPU Category), using the NVIDIA L4 GPUs available in Cloud Run.
It runs Gemma-3-27B fully unquantized and achieves real-time streaming inference through gRPC boundary tensor passing in bfloat16 format.
Commissure divides the model into three cooperative stages, each running as an independent Cloud Run service:
| Stage | Role | Responsibilities |
|---|---|---|
| Stage A | Frontend | Public HTTP/SSE endpoint (/v1/chat/completions), tokenization, embeddings, and first Kβ layers |
| Stage B | Middle | Transformer layers between Kβ β¦ Kβ, receives activations from A, streams transformed activations to C |
| Stage C | Back | Final layers + norm + LM head; produces logits and selects next token |
Each stage runs on its own Cloud Run GPU instance, loading only the subset of model weights that fits comfortably within its memory budget.
Intermediate activations are serialized as raw bfloat16 tensors and streamed between services via gRPC - ensuring compact transfer and low latency.
From the outside, users still see one API endpoint, but under the hood, the model is running cooperatively across three containers.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BUILD & DEPLOY PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Developer Workstation
(Commissure repo)
β ./commissure up
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cloud Build β
β β’ Builds single GPU image (Stage A/B/C via STAGE env) β
β β’ Runs grpc_tools.protoc to generate boundary_pb2[_grpc].py β
β β’ Installs PyTorch, Transformers, gRPC, FastAPI, etc. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β docker push
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Artifact Registry β
β β’ Stores Commissure runtime image β
β β’ Image later reused for all three Cloud Run services (A, B, C) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HF snapshot_download (local) + gsutil rsync
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Secret Manager β
β β’ Secret: HUGGING_FACE_HUB_TOKEN / HF_TOKEN β
β β’ Mounted into Cloud Run env as HF_TOKEN/HUGGING_FACE_HUB_TOKEN β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cloud Storage (GCS) β
β β’ Bucket: gs://<hf_bucket> β
β β’ Flat copy of Gemma-3-27B weights (safetensors + config.json) β
β β’ Mounted into Cloud Run as volume /cache/huggingface β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ
β gcloud run deploy β¦ --add-volume=hf-cache
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β IAM / Service Accounts β
β β’ SA: commissure-runtime@<project>.iam.gserviceaccount.com β
β β’ Roles: β
β β read from Artifact Registry β
β β read from Cloud Storage bucket β
β β access Secret Manager (HF token) β
β β write logs to Cloud Logging β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cloud Run (GPU, europe-west1) β
β β’ Deploys three services from the SAME image: β
β β commissure-a (STAGE=a) β
β β commissure-b (STAGE=b) β
β β commissure-c (STAGE=c) β
β β’ Each attached to: β
β β L4 GPU β
β β hf-cache volume (GCS bucket) β
β β HF secrets from Secret Manager β
β β Runtime service account β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cloud Logging / Cloud Monitoring:
β’ All three services emit logs + metrics for warmup, latency, errors, etc.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RUNTIME DATA FLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Request (HTTPS) β
β β’ curl /v1/chat/completions (OpenAI-compatible) β
β β’ Browser / CLI / app client β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE A β FRONT (Cloud Run Service β L4 GPU) β
β β’ FastAPI HTTP / SSE endpoint (public-facing, OpenAI-compatible) β
β β’ Uses Cloud Run GPU service + hf-cache volume (GCS-mounted) β
β β’ Tokenizer (chat templates, stop IDs, text β token IDs) β
β β’ Embeddings + decoder layers 0..Kββ1 (front of Gemma-3-27B) β
β β’ Maintains its own DynamicCache KV for layers 0..Kββ1 β
β β’ Computes boundary activations: β
β xβ β β^{BΓSΓd_model} (bf16 activations on GPU) β
β β’ Serializes xβ to bf16 wire format (uint16) β
β β’ Streams xβ chunks over gRPC to Stage B β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ
β gRPC stream (bf16-serialized xβ)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE B β MIDDLE (Cloud Run Service β L4 GPU) β
β β’ gRPC bidirectional streaming server (Boundary.Decode) β
β β’ Runs on Cloud Run GPU with same image and hf-cache volume β
β β’ Middle transformer layers Kβ..Kββ1 β
β β’ DynamicCache KV for its own layer range β
β β’ Receives boundary tensor xβ (bf16) from Stage A β
β β’ Computes xβ = f_B(xβ) through layers Kβ..Kββ1 β
β xβ, xβ β β^{BΓSΓd_model} (shape preserved) β
β β’ Serializes xβ as bf16 and streams to Stage C over gRPC β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ
β gRPC stream (bf16-serialized xβ)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE C β BACK (Cloud Run Service β L4 GPU) β
β β’ gRPC bidirectional streaming server (Boundary.Decode) β
β β’ Runs final decoder layers Kβ..Lβ1 + final LayerNorm + LM Head β
β β’ DynamicCache KV for its own layers β
β β’ Receives boundary tensor xβ (bf16) β
β β’ Computes logits: xβ = f_C(xβ), xβ β β^{BΓSΓVocab} (fp32 logits) β
β β’ Token sampling (temperature, top-p, greedy fallback) β
β β’ Returns next_token_id to Stage B over the same gRPC stream β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ
β TokenFrame (step_id, next_token_id)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE B β CONTINUATION β
β β’ For S>1 (prefill): only forwards transformed chunks to Stage C β
β β’ For S=1 (decode): forwards token-by-token, relays next_token_id upstream β
β β’ Streams TokenFrame back to Stage A β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ
β token IDs (gRPC stream)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE A β CONTINUATION β
β β’ Receives next_token_id from Stage B/C chain β
β β’ Decodes token IDs β UTF-8 text using tokenizer β
β β’ Streams chunks as SSE (/v1/chat/completions, OpenAI-style) β
β β’ Client sees a single logical model endpoint, even though under the hood β
β three Cloud Run GPU services are cooperating via gRPC. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each token is generated in real time:
A β B β C β A β B β C β β¦
The pipeline repeats per token, streaming results back to the client through SSE or plain text.
Commissure uses a streaming checkpoint loader (half_loader.py) to materialize only the relevant layer weights directly from .safetensors files - without instantiating the full model in memory.
This enables each service to boot with minimal memory overhead and remain within the limits of a single Cloud Run GPU container.
-
3-Stage gRPC Pipeline
Each Cloud Run service performs a distinct range of transformer layers, passing intermediate activations over secure gRPC streams. -
OpenAI-Compatible API
Stage A exposes/v1/chat/completionsand/generateendpoints, fully compatible with the OpenAI client SDKs and curl usage. -
bfloat16 Wire Format
Boundary tensors are transmitted in raw bf16 format to minimize latency and bandwidth while maintaining numerical fidelity. -
Dynamic KV Cache per Stage
Every service maintains its ownDynamicCacheso attention states are reused locally between tokens. -
Lazy Loading & Auto-Warmup
Each stage loads weights on first use. The CLI performs automatic warm-up to ensure all stages are initialized before requests. -
Composable Scaling
The design naturally extends to 2, 3, 4 or N stages by adjusting layer ranges (Kβ,Kβ, β¦).
Commissure is fully automated through the provided CLI (commissure) and YAML configuration file (commissure.yaml).
The CLI uses Cloud Build, Artifact Registry, and Cloud Run to build, push, and deploy all stages in sequence.
-
Build and Deploy (One Command)
# One-time authentication gcloud auth login # Install dependencies on your workstation (Preferably inside a virtual environment) python3 -m pip install --upgrade huggingface_hub pyyaml # provide HF token export HUGGING_FACE_HUB_TOKEN=hf_xxx # Build and deploy chmod +x ./commissure ./commissure up
This performs:
- Enabling required Google Cloud APIs
- Creating Artifact Registry & Cloud Storage bucket
- Building the container image with Cloud Build
- Uploading model weights to GCS
- Deploying Stage C β Stage B β Stage A (with correct endpoints)
- Automatic warm-up
-
Manual Control (Optional)
./commissure build ./commissure deploy ./commissure ask "Hello Cloud Run" -
Runtime Volumes
Each service mounts a Cloud Storage bucket (hf-cache) as a volume for model weights:--add-volume=name=hf-cache,type=cloud-storage,bucket="gs://<bucket>" -
Environment Variables
K1,K2- layer boundariesMODEL_DIR- GCS-mounted model pathDEVICE=cuda- selects GPUB_ENDPOINT,C_ENDPOINT- downstream gRPC targets
-
Secrets
Hugging Face access tokens are stored in Secret Manager and automatically mounted:--set-secrets=HUGGING_FACE_HUB_TOKEN=<secret>:latest
For the demo deployment, we used:
- Model: Gemma-3-27B-Instruct (unquantized, β 55 GB of weights)
- Stages: A = layers 0β19, B = 20β43, C = 44βend
- GPUs: NVIDIA L4 (Cloud Run GPU)
- Wire Format: bf16 boundary frames over gRPC
- Endpoint: OpenAI-compatible HTTP API
When invoked via curl:
curl -N https://<stage-a-url>/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"Explain Google Cloud Run"}],"stream":true}'Each token returned to the client was generated through a full A β B β C β B β A roundtrip, yet streaming latency remained interactive.
Languages & Frameworks
- Python 3.11
- FastAPI / Uvicorn
- gRPC / Protocol Buffers
- PyTorch 2.8 (CUDA 12.8, bfloat16)
- Hugging Face Transformers + Safetensors
- Accelerate for meta-initialization
- Cloud Build + Artifact Registry + Cloud Run
Google Cloud Integration
- Cloud Run GPU (L4, europe-west1)
- Cloud Storage (volume mounts)
- Secret Manager (HF token)
- IAM-secured Service Accounts
- Autoscaling + revisioned deployments
Commissure proves that Cloud Run can serve models far larger than a single container by treating each GPU-enabled service as a neural region of a distributed brain.
- HPC-like performance, serverless simplicity
- Dynamic scaling and isolation per stage
- Managed networking and IAM out of the box
- Composable architecture ready for larger checkpoints and future quantized variants
This approach unlocks new possibilities for developers who want to deploy large open-source LLMs while keeping the convenience of Cloud Run - no manual clusters, no VM orchestration, no external load balancers.
- Per-Stage Quantization: Mix bf16, int8, and nf4 to fit larger models.
- Distributed Training Prototype: Leverage the same gRPC fabric for backprop experiments.
- Unified CLI Dashboard: Realtime health and token stream monitoring.
app/
βββ stage_a.py # Front (HTTP + tokenizer + first layers)
βββ stage_b_mid.py # Middle transformer layers
βββ stage_c.py # Final layers + logits
βββ half_loader.py # Streaming checkpoint loader
βββ model_loader.py # Stage wrapper classes
βββ utils.py # bf16 serialization utilities
βββ tokenizer.py # Shared tokenizer logic
βββ boundary.proto # gRPC schema
commissure_cli.py # Cloud Run deploy/build CLI
Dockerfile # Runtime image definition
docker/entrypoint.sh # Stage selector
commissure.yaml # Deployment config
requirements.txt # Requirements for Docker Image
Commissure shows that large-scale model inference doesnβt have to leave the serverless world.
By treating Cloud Run services as the hemispheres of a single distributed brain, we can scale open-source LLMs beyond traditional container limits - without sacrificing simplicity, security, or autoscaling.
**Built for the Google Cloud Run Hackathon.