Skip to content

Commissure turns Cloud Run into an HPC-grade LLM runtime - slicing giant models into GPU microservices and streaming activations over gRPC, pushing serverless beyond its limits.

License

Notifications You must be signed in to change notification settings

hardrave/Commissure

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Commissure

Commissure is a distributed, high-performance runtime that brings large-scale LLM inference to Google Cloud Run - without leaving the serverless experience.

It allows you to run models that would normally exceed a single container’s memory or GPU capacity (e.g., Gemma-3-27B, ~55 GB of weights) by splitting the model into cooperating Cloud Run services that communicate through gRPC.
Each service loads only its assigned range of layers, and together they behave as one large model.

In neuroscience, a commissure is the bridge between the brain’s hemispheres - the structure that lets separate regions act as one.
In the same way, Commissure bridges multiple Cloud Run services to act as a single distributed model.


Overview

Commissure demonstrates how Cloud Run can be used for HPC-style distributed inference, without resorting to custom clusters or VM orchestration.
Instead of scaling up to larger machines, Commissure scales across multiple GPU services that share work in real time.

The project was built for the Google Cloud Run Hackathon (GPU Category), using the NVIDIA L4 GPUs available in Cloud Run.
It runs Gemma-3-27B fully unquantized and achieves real-time streaming inference through gRPC boundary tensor passing in bfloat16 format.


How It Works

1. Split-by-Stage Inference

Commissure divides the model into three cooperative stages, each running as an independent Cloud Run service:

Stage Role Responsibilities
Stage A Frontend Public HTTP/SSE endpoint (/v1/chat/completions), tokenization, embeddings, and first K₁ layers
Stage B Middle Transformer layers between K₁ … Kβ‚‚, receives activations from A, streams transformed activations to C
Stage C Back Final layers + norm + LM head; produces logits and selects next token

Each stage runs on its own Cloud Run GPU instance, loading only the subset of model weights that fits comfortably within its memory budget.
Intermediate activations are serialized as raw bfloat16 tensors and streamed between services via gRPC - ensuring compact transfer and low latency.

From the outside, users still see one API endpoint, but under the hood, the model is running cooperatively across three containers.


2. Data Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                            BUILD & DEPLOY PIPELINE                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

   Developer Workstation
   (Commissure repo)
                 β”‚  ./commissure up
                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Cloud Build                                                                  β”‚
β”‚  β€’ Builds single GPU image (Stage A/B/C via STAGE env)                        β”‚
β”‚  β€’ Runs grpc_tools.protoc to generate boundary_pb2[_grpc].py                  β”‚
β”‚  β€’ Installs PyTorch, Transformers, gRPC, FastAPI, etc.                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚  docker push
                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Artifact Registry                                                            β”‚
β”‚  β€’ Stores Commissure runtime image                                            β”‚
β”‚  β€’ Image later reused for all three Cloud Run services (A, B, C)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚  HF snapshot_download (local) + gsutil rsync
                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Secret Manager                                                               β”‚
β”‚  β€’ Secret: HUGGING_FACE_HUB_TOKEN / HF_TOKEN                                  β”‚
β”‚  β€’ Mounted into Cloud Run env as HF_TOKEN/HUGGING_FACE_HUB_TOKEN              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Cloud Storage (GCS)                                                          β”‚
β”‚  β€’ Bucket: gs://<hf_bucket>                                                   β”‚
β”‚  β€’ Flat copy of Gemma-3-27B weights (safetensors + config.json)               β”‚
β”‚  β€’ Mounted into Cloud Run as volume /cache/huggingface                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚   gcloud run deploy … --add-volume=hf-cache
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  IAM / Service Accounts                                                       β”‚
β”‚  β€’ SA: commissure-runtime@<project>.iam.gserviceaccount.com                   β”‚
β”‚  β€’ Roles:                                                                     β”‚
β”‚      – read from Artifact Registry                                            β”‚
β”‚      – read from Cloud Storage bucket                                         β”‚
β”‚      – access Secret Manager (HF token)                                       β”‚
β”‚      – write logs to Cloud Logging                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Cloud Run (GPU, europe-west1)                                                β”‚
β”‚  β€’ Deploys three services from the SAME image:                                β”‚
β”‚      – commissure-a   (STAGE=a)                                               β”‚
β”‚      – commissure-b   (STAGE=b)                                               β”‚
β”‚      – commissure-c   (STAGE=c)                                               β”‚
β”‚  β€’ Each attached to:                                                          β”‚
β”‚      – L4 GPU                                                                 β”‚
β”‚      – hf-cache volume (GCS bucket)                                           β”‚
β”‚      – HF secrets from Secret Manager                                         β”‚
β”‚      – Runtime service account                                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cloud Logging / Cloud Monitoring:
β€’ All three services emit logs + metrics for warmup, latency, errors, etc.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                             RUNTIME DATA FLOW                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                            User Request (HTTPS)                               β”‚
β”‚     β€’ curl /v1/chat/completions (OpenAI-compatible)                           β”‚
β”‚     β€’ Browser / CLI / app client                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE A – FRONT (Cloud Run Service – L4 GPU)                                 β”‚
β”‚  β€’ FastAPI HTTP / SSE endpoint (public-facing, OpenAI-compatible)             β”‚
β”‚  β€’ Uses Cloud Run GPU service + hf-cache volume (GCS-mounted)                 β”‚
β”‚  β€’ Tokenizer (chat templates, stop IDs, text β†’ token IDs)                     β”‚
β”‚  β€’ Embeddings + decoder layers 0..Kβ‚βˆ’1 (front of Gemma-3-27B)                 β”‚
β”‚  β€’ Maintains its own DynamicCache KV for layers 0..Kβ‚βˆ’1                       β”‚
β”‚  β€’ Computes boundary activations:                                             β”‚
β”‚        xβ‚€ ∈ ℝ^{BΓ—SΓ—d_model}  (bf16 activations on GPU)                        β”‚
β”‚  β€’ Serializes xβ‚€ to bf16 wire format (uint16)                                 β”‚
β”‚  β€’ Streams xβ‚€ chunks over gRPC to Stage B                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚  gRPC stream (bf16-serialized xβ‚€)              
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE B – MIDDLE (Cloud Run Service – L4 GPU)                                β”‚
β”‚  β€’ gRPC bidirectional streaming server (Boundary.Decode)                      β”‚
β”‚  β€’ Runs on Cloud Run GPU with same image and hf-cache volume                  β”‚
β”‚  β€’ Middle transformer layers K₁..Kβ‚‚βˆ’1                                         β”‚
β”‚  β€’ DynamicCache KV for its own layer range                                    β”‚
β”‚  β€’ Receives boundary tensor xβ‚€ (bf16) from Stage A                            β”‚
β”‚  β€’ Computes x₁ = f_B(xβ‚€) through layers K₁..Kβ‚‚βˆ’1                              β”‚
β”‚       xβ‚€, x₁ ∈ ℝ^{BΓ—SΓ—d_model} (shape preserved)                              β”‚
β”‚  β€’ Serializes x₁ as bf16 and streams to Stage C over gRPC                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚  gRPC stream (bf16-serialized x₁)              
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE C – BACK (Cloud Run Service – L4 GPU)                                  β”‚
β”‚  β€’ gRPC bidirectional streaming server (Boundary.Decode)                      β”‚
β”‚  β€’ Runs final decoder layers Kβ‚‚..Lβˆ’1 + final LayerNorm + LM Head              β”‚
β”‚  β€’ DynamicCache KV for its own layers                                         β”‚
β”‚  β€’ Receives boundary tensor x₁ (bf16)                                         β”‚
β”‚  β€’ Computes logits: xβ‚‚ = f_C(x₁), xβ‚‚ ∈ ℝ^{BΓ—SΓ—Vocab} (fp32 logits)            β”‚
β”‚  β€’ Token sampling (temperature, top-p, greedy fallback)                       β”‚
β”‚  β€’ Returns next_token_id to Stage B over the same gRPC stream                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚  TokenFrame (step_id, next_token_id)          
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE B – CONTINUATION                                                       β”‚
β”‚  β€’ For S>1 (prefill): only forwards transformed chunks to Stage C             β”‚
β”‚  β€’ For S=1 (decode): forwards token-by-token, relays next_token_id upstream   β”‚
β”‚  β€’ Streams TokenFrame back to Stage A                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚  token IDs (gRPC stream)                       
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE A – CONTINUATION                                                       β”‚
β”‚  β€’ Receives next_token_id from Stage B/C chain                                β”‚
β”‚  β€’ Decodes token IDs β†’ UTF-8 text using tokenizer                             β”‚
β”‚  β€’ Streams chunks as SSE (/v1/chat/completions, OpenAI-style)                 β”‚
β”‚  β€’ Client sees a single logical model endpoint, even though under the hood    β”‚
β”‚    three Cloud Run GPU services are cooperating via gRPC.                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each token is generated in real time:

A β†’ B β†’ C β†’ A β†’ B β†’ C β†’ … 

The pipeline repeats per token, streaming results back to the client through SSE or plain text.


3. Model Loading

Commissure uses a streaming checkpoint loader (half_loader.py) to materialize only the relevant layer weights directly from .safetensors files - without instantiating the full model in memory.
This enables each service to boot with minimal memory overhead and remain within the limits of a single Cloud Run GPU container.


Architecture Highlights

  • 3-Stage gRPC Pipeline
    Each Cloud Run service performs a distinct range of transformer layers, passing intermediate activations over secure gRPC streams.

  • OpenAI-Compatible API
    Stage A exposes /v1/chat/completions and /generate endpoints, fully compatible with the OpenAI client SDKs and curl usage.

  • bfloat16 Wire Format
    Boundary tensors are transmitted in raw bf16 format to minimize latency and bandwidth while maintaining numerical fidelity.

  • Dynamic KV Cache per Stage
    Every service maintains its own DynamicCache so attention states are reused locally between tokens.

  • Lazy Loading & Auto-Warmup
    Each stage loads weights on first use. The CLI performs automatic warm-up to ensure all stages are initialized before requests.

  • Composable Scaling
    The design naturally extends to 2, 3, 4 or N stages by adjusting layer ranges (K₁, Kβ‚‚, …).


Deployment

Commissure is fully automated through the provided CLI (commissure) and YAML configuration file (commissure.yaml).
The CLI uses Cloud Build, Artifact Registry, and Cloud Run to build, push, and deploy all stages in sequence.

Steps

  1. Build and Deploy (One Command)

    # One-time authentication
    gcloud auth login
    
    # Install dependencies on your workstation (Preferably inside a virtual environment)
    python3 -m pip install --upgrade huggingface_hub pyyaml
    
    # provide HF token
    export HUGGING_FACE_HUB_TOKEN=hf_xxx
    
    # Build and deploy
    chmod +x ./commissure
    ./commissure up

    This performs:

    • Enabling required Google Cloud APIs
    • Creating Artifact Registry & Cloud Storage bucket
    • Building the container image with Cloud Build
    • Uploading model weights to GCS
    • Deploying Stage C β†’ Stage B β†’ Stage A (with correct endpoints)
    • Automatic warm-up
  2. Manual Control (Optional)

    ./commissure build
    ./commissure deploy
    ./commissure ask "Hello Cloud Run"
  3. Runtime Volumes
    Each service mounts a Cloud Storage bucket (hf-cache) as a volume for model weights:

    --add-volume=name=hf-cache,type=cloud-storage,bucket="gs://<bucket>"
    
  4. Environment Variables

    • K1, K2 - layer boundaries
    • MODEL_DIR - GCS-mounted model path
    • DEVICE=cuda - selects GPU
    • B_ENDPOINT, C_ENDPOINT - downstream gRPC targets
  5. Secrets
    Hugging Face access tokens are stored in Secret Manager and automatically mounted:

    --set-secrets=HUGGING_FACE_HUB_TOKEN=<secret>:latest
    

Demonstration

For the demo deployment, we used:

  • Model: Gemma-3-27B-Instruct (unquantized, β‰ˆ 55 GB of weights)
  • Stages: A = layers 0–19, B = 20–43, C = 44–end
  • GPUs: NVIDIA L4 (Cloud Run GPU)
  • Wire Format: bf16 boundary frames over gRPC
  • Endpoint: OpenAI-compatible HTTP API

When invoked via curl:

curl -N https://<stage-a-url>/v1/chat/completions   -H "Content-Type: application/json"   -d '{"messages":[{"role":"user","content":"Explain Google Cloud Run"}],"stream":true}'

Each token returned to the client was generated through a full A β†’ B β†’ C β†’ B β†’ A roundtrip, yet streaming latency remained interactive.


Technology Stack

Languages & Frameworks

  • Python 3.11
  • FastAPI / Uvicorn
  • gRPC / Protocol Buffers
  • PyTorch 2.8 (CUDA 12.8, bfloat16)
  • Hugging Face Transformers + Safetensors
  • Accelerate for meta-initialization
  • Cloud Build + Artifact Registry + Cloud Run

Google Cloud Integration

  • Cloud Run GPU (L4, europe-west1)
  • Cloud Storage (volume mounts)
  • Secret Manager (HF token)
  • IAM-secured Service Accounts
  • Autoscaling + revisioned deployments

Why It Matters

Commissure proves that Cloud Run can serve models far larger than a single container by treating each GPU-enabled service as a neural region of a distributed brain.

  • HPC-like performance, serverless simplicity
  • Dynamic scaling and isolation per stage
  • Managed networking and IAM out of the box
  • Composable architecture ready for larger checkpoints and future quantized variants

This approach unlocks new possibilities for developers who want to deploy large open-source LLMs while keeping the convenience of Cloud Run - no manual clusters, no VM orchestration, no external load balancers.


Roadmap

  • Per-Stage Quantization: Mix bf16, int8, and nf4 to fit larger models.
  • Distributed Training Prototype: Leverage the same gRPC fabric for backprop experiments.
  • Unified CLI Dashboard: Realtime health and token stream monitoring.

Repository Structure

app/
 β”œβ”€β”€ stage_a.py          # Front (HTTP + tokenizer + first layers)
 β”œβ”€β”€ stage_b_mid.py      # Middle transformer layers
 β”œβ”€β”€ stage_c.py          # Final layers + logits
 β”œβ”€β”€ half_loader.py      # Streaming checkpoint loader
 β”œβ”€β”€ model_loader.py     # Stage wrapper classes
 β”œβ”€β”€ utils.py            # bf16 serialization utilities
 β”œβ”€β”€ tokenizer.py        # Shared tokenizer logic
 └── boundary.proto      # gRPC schema
commissure_cli.py        # Cloud Run deploy/build CLI
Dockerfile               # Runtime image definition
docker/entrypoint.sh     # Stage selector
commissure.yaml          # Deployment config
requirements.txt         # Requirements for Docker Image

Conclusion

Commissure shows that large-scale model inference doesn’t have to leave the serverless world.
By treating Cloud Run services as the hemispheres of a single distributed brain, we can scale open-source LLMs beyond traditional container limits - without sacrificing simplicity, security, or autoscaling.


**Built for the Google Cloud Run Hackathon.

About

Commissure turns Cloud Run into an HPC-grade LLM runtime - slicing giant models into GPU microservices and streaming activations over gRPC, pushing serverless beyond its limits.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published