Deploying Gemma 4 on Cloud Run: Pay Only When You Use It
Learn how to deploy Google's Gemma 4 on Cloud Run for cost-effective, scalable AI—pay only when you use it. Key features, deployment stack, and why it matters.
Deploying Gemma 4 on Cloud Run: Pay Only When You Use It
Google's Gemma 4 is a major leap in open-access AI models, offering four variants (from 2B to 31B parameters) with advanced features like multimodal input, improved reasoning, and efficient function calling. But the real game-changer is deploying Gemma 4 on Google Cloud Run—so you only pay for compute when the model is actually in use.
Why Cloud Run?
Cloud Run scales to zero when idle, meaning you’re not billed for unused resources. This is a huge improvement over always-on endpoints like Vertex AI Model Garden, where forgetting to shut down can lead to surprise bills. With Cloud Run, you can experiment, test, and even run production workloads with cost control and flexibility.
Gemma 4 Highlights
- Four Model Sizes: E2B, E4B (dense, 128k context), 26B A4B (MoE, 256k context), and 31B (dense, 256k context).
- Mixture of Experts (MoE): The 26B model activates only 4B parameters per token, offering near-26B performance at a fraction of the compute cost.
- Multimodal Input: Supports images, audio, and video as input, with text output. Small models can process video+audio; large models excel at image tasks.
- Advanced Reasoning & Function Calling: Handles complex, multi-step tasks and structured tool calls, making it ideal for agentic pipelines.
Deployment Stack
- vLLM: The recommended inference engine for production, supporting efficient batching, KV-cache sharing, and quantization.
- Run:ai Model Streamer: For large models, streams weights from Google Cloud Storage (GCS) during startup, enabling fast cold starts.
- Private Google Access: Critical for fast model loading from GCS—enables internal network speeds and slashes cold start times.
Cost and Performance
- Scale-to-Zero: No requests = no cost. Cold starts for large models (e.g., 26B) can be as low as 191 seconds with the right setup.
- Warm Responses: Once running, even the largest models respond in seconds (e.g., 1.61s for 26B MoE).
- Production Flexibility: You can set a minimum instance count for always-on serving, or stick with scale-to-zero for development/testing.
Step-by-Step Guide
The original article provides a full deployment guide, including:
- Setting up environment variables
- Enabling required Google Cloud APIs
- Checking GPU quota
- Creating VPC and GCS bucket
- Uploading models
- Deploying with vLLM and Run:ai streamer
Why This Matters
Deploying Gemma 4 on Cloud Run democratizes access to powerful AI, making it affordable and practical for both experimentation and production. You get the benefits of advanced LLMs without the risk of runaway costs.
Read the full guide for detailed steps: Deploy Gemma 4 on Cloud Run (dev.to)
Cloud Run + Gemma 4 = powerful, cost-effective, and production-ready AI on your terms.
Share this post