Deploying Gemma 4 on Cloud Run: Pay Only When You Use It

Google's Gemma 4 is a major leap in open-access AI models, offering four variants (from 2B to 31B parameters) with advanced features like multimodal input, improved reasoning, and efficient function calling. But the real game-changer is deploying Gemma 4 on Google Cloud Run—so you only pay for compute when the model is actually in use.

Why Cloud Run?

Cloud Run scales to zero when idle, meaning you’re not billed for unused resources. This is a huge improvement over always-on endpoints like Vertex AI Model Garden, where forgetting to shut down can lead to surprise bills. With Cloud Run, you can experiment, test, and even run production workloads with cost control and flexibility.

Gemma 4 Highlights

Four Model Sizes: E2B, E4B (dense, 128k context), 26B A4B (MoE, 256k context), and 31B (dense, 256k context).
Mixture of Experts (MoE): The 26B model activates only 4B parameters per token, offering near-26B performance at a fraction of the compute cost.
Multimodal Input: Supports images, audio, and video as input, with text output. Small models can process video+audio; large models excel at image tasks.
Advanced Reasoning & Function Calling: Handles complex, multi-step tasks and structured tool calls, making it ideal for agentic pipelines.

Deployment Stack

vLLM: The recommended inference engine for production, supporting efficient batching, KV-cache sharing, and quantization.
Run:ai Model Streamer: For large models, streams weights from Google Cloud Storage (GCS) during startup, enabling fast cold starts.
Private Google Access: Critical for fast model loading from GCS—enables internal network speeds and slashes cold start times.

Cost and Performance

Scale-to-Zero: No requests = no cost. Cold starts for large models (e.g., 26B) can be as low as 191 seconds with the right setup.
Warm Responses: Once running, even the largest models respond in seconds (e.g., 1.61s for 26B MoE).
Production Flexibility: You can set a minimum instance count for always-on serving, or stick with scale-to-zero for development/testing.

Step-by-Step Guide

The original article provides a full deployment guide, including:

Setting up environment variables
Enabling required Google Cloud APIs
Checking GPU quota
Creating VPC and GCS bucket
Uploading models
Deploying with vLLM and Run:ai streamer

Why This Matters

Deploying Gemma 4 on Cloud Run democratizes access to powerful AI, making it affordable and practical for both experimentation and production. You get the benefits of advanced LLMs without the risk of runaway costs.

Read the full guide for detailed steps: Deploy Gemma 4 on Cloud Run (dev.to)

Cloud Run + Gemma 4 = powerful, cost-effective, and production-ready AI on your terms.

Deploying Gemma 4 on Cloud Run: Pay Only When You Use It

Deploying Gemma 4 on Cloud Run: Pay Only When You Use It

Why Cloud Run?

Gemma 4 Highlights

Deployment Stack

Cost and Performance

Step-by-Step Guide

Why This Matters

Comments

Leave a comment