Commissure

Commissure is a distributed, high-performance runtime that brings large-scale LLM inference to Google Cloud Run - without leaving the serverless experience.

It allows you to run models that would normally exceed a single container’s memory or GPU capacity (e.g., Gemma-3-27B, ~55 GB of weights) by splitting the model into cooperating Cloud Run services that communicate through gRPC.
Each service loads only its assigned range of layers, and together they behave as one large model.

In neuroscience, a commissure is the bridge between the brain’s hemispheres - the structure that lets separate regions act as one.
In the same way, Commissure bridges multiple Cloud Run services to act as a single distributed model.

Overview

Commissure demonstrates how Cloud Run can be used for HPC-style distributed inference, without resorting to custom clusters or VM orchestration.
Instead of scaling up to larger machines, Commissure scales across multiple GPU services that share work in real time.

The project was built for the Google Cloud Run Hackathon (GPU Category), using the NVIDIA L4 GPUs available in Cloud Run.
It runs Gemma-3-27B fully unquantized and achieves real-time streaming inference through gRPC boundary tensor passing in bfloat16 format.

How It Works

1. Split-by-Stage Inference

Commissure divides the model into three cooperative stages, each running as an independent Cloud Run service:

Stage	Role	Responsibilities
Stage A	Frontend	Public HTTP/SSE endpoint (`/v1/chat/completions`), tokenization, embeddings, and first K₁ layers
Stage B	Middle	Transformer layers between K₁ … K₂, receives activations from A, streams transformed activations to C
Stage C	Back	Final layers + norm + LM head; produces logits and selects next token

Each stage runs on its own Cloud Run GPU instance, loading only the subset of model weights that fits comfortably within its memory budget.
Intermediate activations are serialized as raw bfloat16 tensors and streamed between services via gRPC - ensuring compact transfer and low latency.

From the outside, users still see one API endpoint, but under the hood, the model is running cooperatively across three containers.

2. Data Flow

┌───────────────────────────────────────────────────────────────────────────────┐
│                            BUILD & DEPLOY PIPELINE                            │
└───────────────────────────────────────────────────────────────────────────────┘

   Developer Workstation
   (Commissure repo)
                 │  ./commissure up
                 ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  Cloud Build                                                                  │
│  • Builds single GPU image (Stage A/B/C via STAGE env)                        │
│  • Runs grpc_tools.protoc to generate boundary_pb2[_grpc].py                  │
│  • Installs PyTorch, Transformers, gRPC, FastAPI, etc.                        │
└───────────────────────────────────────────────────────────────────────────────┘
                 │  docker push
                 ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  Artifact Registry                                                            │
│  • Stores Commissure runtime image                                            │
│  • Image later reused for all three Cloud Run services (A, B, C)              │
└───────────────────────────────────────────────────────────────────────────────┘
                 │  HF snapshot_download (local) + gsutil rsync
                 ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  Secret Manager                                                               │
│  • Secret: HUGGING_FACE_HUB_TOKEN / HF_TOKEN                                  │
│  • Mounted into Cloud Run env as HF_TOKEN/HUGGING_FACE_HUB_TOKEN              │
└─────────────────────────────┬─────────────────────────────────────────────────┘
                              │
                              ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  Cloud Storage (GCS)                                                          │
│  • Bucket: gs://<hf_bucket>                                                   │
│  • Flat copy of Gemma-3-27B weights (safetensors + config.json)               │
│  • Mounted into Cloud Run as volume /cache/huggingface                        │
└─────────────────────────────┬─────────────────────────────────────────────────┘
                              │   gcloud run deploy … --add-volume=hf-cache
                              │
                              ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  IAM / Service Accounts                                                       │
│  • SA: commissure-runtime@<project>.iam.gserviceaccount.com                   │
│  • Roles:                                                                     │
│      – read from Artifact Registry                                            │
│      – read from Cloud Storage bucket                                         │
│      – access Secret Manager (HF token)                                       │
│      – write logs to Cloud Logging                                            │
└─────────────────────────────┬─────────────────────────────────────────────────┘
                              │
                              ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  Cloud Run (GPU, europe-west1)                                                │
│  • Deploys three services from the SAME image:                                │
│      – commissure-a   (STAGE=a)                                               │
│      – commissure-b   (STAGE=b)                                               │
│      – commissure-c   (STAGE=c)                                               │
│  • Each attached to:                                                          │
│      – L4 GPU                                                                 │
│      – hf-cache volume (GCS bucket)                                           │
│      – HF secrets from Secret Manager                                         │
│      – Runtime service account                                                │
└───────────────────────────────────────────────────────────────────────────────┘

Cloud Logging / Cloud Monitoring:
• All three services emit logs + metrics for warmup, latency, errors, etc.

┌───────────────────────────────────────────────────────────────────────────────┐
│                             RUNTIME DATA FLOW                                 │
└───────────────────────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────────────────────┐
│                            User Request (HTTPS)                               │
│     • curl /v1/chat/completions (OpenAI-compatible)                           │
│     • Browser / CLI / app client                                              │
└─────────────────────────────┬─────────────────────────────────────────────────┘
                              │
                              ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  STAGE A – FRONT (Cloud Run Service – L4 GPU)                                 │
│  • FastAPI HTTP / SSE endpoint (public-facing, OpenAI-compatible)             │
│  • Uses Cloud Run GPU service + hf-cache volume (GCS-mounted)                 │
│  • Tokenizer (chat templates, stop IDs, text → token IDs)                     │
│  • Embeddings + decoder layers 0..K₁−1 (front of Gemma-3-27B)                 │
│  • Maintains its own DynamicCache KV for layers 0..K₁−1                       │
│  • Computes boundary activations:                                             │
│        x₀ ∈ ℝ^{B×S×d_model}  (bf16 activations on GPU)                        │
│  • Serializes x₀ to bf16 wire format (uint16)                                 │
│  • Streams x₀ chunks over gRPC to Stage B                                     │
└─────────────────────────────┬─────────────────────────────────────────────────┘
                              │  gRPC stream (bf16-serialized x₀)              
                              ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  STAGE B – MIDDLE (Cloud Run Service – L4 GPU)                                │
│  • gRPC bidirectional streaming server (Boundary.Decode)                      │
│  • Runs on Cloud Run GPU with same image and hf-cache volume                  │
│  • Middle transformer layers K₁..K₂−1                                         │
│  • DynamicCache KV for its own layer range                                    │
│  • Receives boundary tensor x₀ (bf16) from Stage A                            │
│  • Computes x₁ = f_B(x₀) through layers K₁..K₂−1                              │
│       x₀, x₁ ∈ ℝ^{B×S×d_model} (shape preserved)                              │
│  • Serializes x₁ as bf16 and streams to Stage C over gRPC                     │
└─────────────────────────────┬─────────────────────────────────────────────────┘
                              │  gRPC stream (bf16-serialized x₁)              
                              ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  STAGE C – BACK (Cloud Run Service – L4 GPU)                                  │
│  • gRPC bidirectional streaming server (Boundary.Decode)                      │
│  • Runs final decoder layers K₂..L−1 + final LayerNorm + LM Head              │
│  • DynamicCache KV for its own layers                                         │
│  • Receives boundary tensor x₁ (bf16)                                         │
│  • Computes logits: x₂ = f_C(x₁), x₂ ∈ ℝ^{B×S×Vocab} (fp32 logits)            │
│  • Token sampling (temperature, top-p, greedy fallback)                       │
│  • Returns next_token_id to Stage B over the same gRPC stream                 │
└─────────────────────────────┬─────────────────────────────────────────────────┘
                              │  TokenFrame (step_id, next_token_id)          
                              ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  STAGE B – CONTINUATION                                                       │
│  • For S>1 (prefill): only forwards transformed chunks to Stage C             │
│  • For S=1 (decode): forwards token-by-token, relays next_token_id upstream   │
│  • Streams TokenFrame back to Stage A                                         │
└─────────────────────────────┬─────────────────────────────────────────────────┘
                              │  token IDs (gRPC stream)                       
                              ▼
┌───────────────────────────────────────────────────────────────────────────────┐
│  STAGE A – CONTINUATION                                                       │
│  • Receives next_token_id from Stage B/C chain                                │
│  • Decodes token IDs → UTF-8 text using tokenizer                             │
│  • Streams chunks as SSE (/v1/chat/completions, OpenAI-style)                 │
│  • Client sees a single logical model endpoint, even though under the hood    │
│    three Cloud Run GPU services are cooperating via gRPC.                     │
└───────────────────────────────────────────────────────────────────────────────┘

Each token is generated in real time:

A → B → C → A → B → C → …

The pipeline repeats per token, streaming results back to the client through SSE or plain text.

3. Model Loading

Commissure uses a streaming checkpoint loader (half_loader.py) to materialize only the relevant layer weights directly from .safetensors files - without instantiating the full model in memory.
This enables each service to boot with minimal memory overhead and remain within the limits of a single Cloud Run GPU container.

Architecture Highlights

3-Stage gRPC Pipeline
Each Cloud Run service performs a distinct range of transformer layers, passing intermediate activations over secure gRPC streams.
OpenAI-Compatible API
Stage A exposes /v1/chat/completions and /generate endpoints, fully compatible with the OpenAI client SDKs and curl usage.
bfloat16 Wire Format
Boundary tensors are transmitted in raw bf16 format to minimize latency and bandwidth while maintaining numerical fidelity.
Dynamic KV Cache per Stage
Every service maintains its own DynamicCache so attention states are reused locally between tokens.
Lazy Loading & Auto-Warmup
Each stage loads weights on first use. The CLI performs automatic warm-up to ensure all stages are initialized before requests.
Composable Scaling
The design naturally extends to 2, 3, 4 or N stages by adjusting layer ranges (K₁, K₂, …).

Deployment

Commissure is fully automated through the provided CLI (commissure) and YAML configuration file (commissure.yaml).
The CLI uses Cloud Build, Artifact Registry, and Cloud Run to build, push, and deploy all stages in sequence.

Steps

Build and Deploy (One Command)

# One-time authentication
gcloud auth login

# Install dependencies on your workstation (Preferably inside a virtual environment)
python3 -m pip install --upgrade huggingface_hub pyyaml

# provide HF token
export HUGGING_FACE_HUB_TOKEN=hf_xxx

# Build and deploy
chmod +x ./commissure
./commissure up

This performs:

Enabling required Google Cloud APIs
Creating Artifact Registry & Cloud Storage bucket
Building the container image with Cloud Build
Uploading model weights to GCS
Deploying Stage C → Stage B → Stage A (with correct endpoints)
Automatic warm-up

Manual Control (Optional)

./commissure build
./commissure deploy
./commissure ask "Hello Cloud Run"

Runtime Volumes
Each service mounts a Cloud Storage bucket (hf-cache) as a volume for model weights:
```
--add-volume=name=hf-cache,type=cloud-storage,bucket="gs://<bucket>"
```
Environment Variables
- K1, K2 - layer boundaries
- MODEL_DIR - GCS-mounted model path
- DEVICE=cuda - selects GPU
- B_ENDPOINT, C_ENDPOINT - downstream gRPC targets
Secrets
Hugging Face access tokens are stored in Secret Manager and automatically mounted:
```
--set-secrets=HUGGING_FACE_HUB_TOKEN=<secret>:latest
```

Demonstration

For the demo deployment, we used:

Model: Gemma-3-27B-Instruct (unquantized, ≈ 55 GB of weights)
Stages: A = layers 0–19, B = 20–43, C = 44–end
GPUs: NVIDIA L4 (Cloud Run GPU)
Wire Format: bf16 boundary frames over gRPC
Endpoint: OpenAI-compatible HTTP API

When invoked via curl:

curl -N https://<stage-a-url>/v1/chat/completions   -H "Content-Type: application/json"   -d '{"messages":[{"role":"user","content":"Explain Google Cloud Run"}],"stream":true}'

Each token returned to the client was generated through a full A → B → C → B → A roundtrip, yet streaming latency remained interactive.

Technology Stack

Languages & Frameworks

Python 3.11
FastAPI / Uvicorn
gRPC / Protocol Buffers
PyTorch 2.8 (CUDA 12.8, bfloat16)
Hugging Face Transformers + Safetensors
Accelerate for meta-initialization
Cloud Build + Artifact Registry + Cloud Run

Google Cloud Integration

Cloud Run GPU (L4, europe-west1)
Cloud Storage (volume mounts)
Secret Manager (HF token)
IAM-secured Service Accounts
Autoscaling + revisioned deployments

Why It Matters

Commissure proves that Cloud Run can serve models far larger than a single container by treating each GPU-enabled service as a neural region of a distributed brain.

HPC-like performance, serverless simplicity
Dynamic scaling and isolation per stage
Managed networking and IAM out of the box
Composable architecture ready for larger checkpoints and future quantized variants

This approach unlocks new possibilities for developers who want to deploy large open-source LLMs while keeping the convenience of Cloud Run - no manual clusters, no VM orchestration, no external load balancers.

Roadmap

Per-Stage Quantization: Mix bf16, int8, and nf4 to fit larger models.
Distributed Training Prototype: Leverage the same gRPC fabric for backprop experiments.
Unified CLI Dashboard: Realtime health and token stream monitoring.

Repository Structure

app/
 ├── stage_a.py          # Front (HTTP + tokenizer + first layers)
 ├── stage_b_mid.py      # Middle transformer layers
 ├── stage_c.py          # Final layers + logits
 ├── half_loader.py      # Streaming checkpoint loader
 ├── model_loader.py     # Stage wrapper classes
 ├── utils.py            # bf16 serialization utilities
 ├── tokenizer.py        # Shared tokenizer logic
 └── boundary.proto      # gRPC schema
commissure_cli.py        # Cloud Run deploy/build CLI
Dockerfile               # Runtime image definition
docker/entrypoint.sh     # Stage selector
commissure.yaml          # Deployment config
requirements.txt         # Requirements for Docker Image

Conclusion

Commissure shows that large-scale model inference doesn’t have to leave the serverless world.
By treating Cloud Run services as the hemispheres of a single distributed brain, we can scale open-source LLMs beyond traditional container limits - without sacrificing simplicity, security, or autoscaling.

**Built for the Google Cloud Run Hackathon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Commissure

Overview

How It Works

1. Split-by-Stage Inference

2. Data Flow

3. Model Loading

Architecture Highlights

Deployment

Steps

Demonstration

Technology Stack

Why It Matters

Roadmap

Repository Structure

Conclusion

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
docker		docker
protos		protos
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
commissure		commissure
commissure.yaml		commissure.yaml
commissure_cli.py		commissure_cli.py
requirements.txt		requirements.txt

License

hardrave/Commissure

Folders and files

Latest commit

History

Repository files navigation

Commissure

Overview

How It Works

1. Split-by-Stage Inference

2. Data Flow

3. Model Loading

Architecture Highlights

Deployment

Steps

Demonstration

Technology Stack

Why It Matters

Roadmap

Repository Structure

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages