Provisioning a Small GKE Cluster with Terraform and Terragrunt

Table of Contents
Introduction
If you run Google Kubernetes Engine (GKE) in more than one environment, plain Terraform folders tend to sprawl: duplicated backend blocks, copy-pasted variables, and fragile ordering between VPC and cluster. Terragrunt adds a thin orchestration layer: shared remote state, generated providers, and explicit dependency links between stacks.
This guide walks through a realistic small-cluster layout (the kind used in production-style repos that split modules/ and environments/). It focuses on foundation only: bootstrap state, VPC with pod/service secondary ranges, and a small autoscaling node pool. A companion post covers Cloud SQL and Cloud Storage in the same pattern: Connecting GKE to Cloud SQL and Cloud Storage with Terraform.
For baseline GKE concepts and a minimal single-file Terraform example, see the docs: GKE cluster setup and GKE networking.
Why Terragrunt on Top of Terraform
Terragrunt does not replace Terraform; it composes it:
- DRY remote state — one
remote_statedefinition per environment root, with a prefix per child folder (each stack gets its own state key). - Generated provider — e.g. a single
provider "google"block emitted asprovider.tffor every stack. - Live config vs modules — Terraform code stays in reusable modules; environment-specific values live in
terragrunt.hclunderenvironments/<env>/.... - Dependencies —
dependency "vpc" { config_path = "../vpc" }wires outputs into downstream modules without manualterraform_remote_stateboilerplate in every child module.
That matches the structure described in infrastructure repos that document: modules for VPC, GKE, DB, buckets; Terragrunt folders per stack; global for shared IAM or registry resources.
Repository Layout (Modules + Live Config)
A common top-level shape:
bootstrap_gcp/— one-off Terraform to create the GCS bucket used for Terraform state.modules/— reusable Terraform:vpc,gke,db,bucket, etc.environments/— Terragrunt live config:dev/,prod/, optionalglobal/for shared resources.
Each environment root terragrunt.hcl typically sets:
- Generated
provider.tf remote_statebackend (GCS) with an environment-specific prefix- Shared
inputs: project, region, networking, GKE sizing, DNS flags, and so on
Child folders such as vpc/, gke/, db/, bucket/ each contain their own terragrunt.hcl pointing at terraform { source = "../../../modules/<name>" } and include to inherit the parent.
Step 1: Bootstrap Remote State
Before terragrunt run-all, you need a durable state bucket. A minimal bootstrap stack creates something like:
google_storage_bucketfor Terraform state- Versioning enabled on that bucket
Apply this stack once per organization/project (or per landing zone), then point every Terragrunt root at that bucket. This keeps state centralized and versioned, which matters as soon as you have more than one stack (VPC, GKE, DB, buckets).
Step 2: Remote State per Stack
In the environment root, remote_state often uses the GCS backend with a prefix that includes the relative path of each Terragrunt unit, for example:
remote_state {
backend = "gcs"
config = {
bucket = "my-terraform-state"
prefix = "dev/${path_relative_to_include()}"
}
}
That yields separate state files for dev/vpc, dev/gke, dev/db, etc., while sharing one bucket and one naming convention.
Step 3: VPC With GKE Secondary Ranges
GKE VPC-native clusters need a subnet with secondary IP ranges for pods and services. A typical modules/vpc pattern:
- Custom VPC, single regional subnet
- Two
secondary_ip_rangeblocks, e.g. namedservices-rangeandpods-range(names are referenced by the cluster)
The cluster module then consumes network ID, subnetwork ID, and the range names so ip_allocation_policy can attach pod and service CIDRs correctly.
Step 4: Wire GKE to VPC With Terragrunt
The gke stack’s terragrunt.hcl usually:
- Points
terraform.sourceatmodules/gke includes the parent folders (provider + remote state + shared inputs)- Declares a
dependency "vpc"on../vpc - Passes VPC outputs into the module
inputs
Conceptually:
terraform {
source = "../../../modules/gke"
}
include {
path = find_in_parent_folders()
}
dependency "vpc" {
config_path = "../vpc"
mock_outputs = {
network_id = "network-00000001"
subnetwork_id = "subnet-00000001"
pods_range_name = "pods-range"
services_range_name = "services-range"
}
}
inputs = {
network_id = dependency.vpc.outputs.network_id
subnetwork_id = dependency.vpc.outputs.subnetwork_id
pods_range_name = dependency.vpc.outputs.pods_range_name
services_range_name = dependency.vpc.outputs.services_range_name
}
Note:
mock_outputshelpsvalidateand some planning scenarios when the VPC stack is not applied yet; remove or tighten mocks for real applies.
Step 5: Cluster and Node Pool Choices for a Small Footprint
Cluster resource
A compact google_container_cluster setup often:
- Sets
remove_default_node_pool = trueand a smallinitial_node_countonly to satisfy the API, then manages nodes in a dedicated pool - Uses
ip_allocation_policywithcluster_secondary_range_nameandservices_secondary_range_namematching the VPC module - Configures a maintenance window (weekly recurring window is typical)
This repo style does not always enable private control-plane endpoints or Workload Identity in the same module; treat those as hardening steps and align them with GKE security when you need them.
Node pool
For cost-aware dev or small prod:
autoscalingwith a lowtotal_max_node_count(e.g. 2) andmin_node_countof 0 where policy allows — so the pool can scale down when idlepreemptible(or spot) nodes in non-prod to cut cost- A dedicated node service account with
oauth_scopesincludingcloud-platform, logging, and monitoring
Common IAM on the node SA includes:
roles/container.defaultNodeServiceAccountroles/artifactregistry.reader(pull images from Artifact Registry)roles/compute.networkAdmin(often used for ingress / certificate workflows — review least privilege for your case)
Machine type and disk size come from the environment root inputs (e.g. e2-standard-2, 60 GiB disk) so dev and prod can diverge without forking the module.
Step 6: Apply Order and Day-Two Operations
From the environment directory:
terragrunt run-all plan
terragrunt run-all apply
Terragrunt respects dependencies between units (e.g. GKE after VPC). For brownfield GCP, some repos also ship import scripts mapping existing resources into state — useful when you adopt Terraform after manual console setup.
After apply:
- Fetch credentials with
gcloud container clusters get-credentials ... - Install cluster components (ingress, GitOps, etc.) — that layer is usually outside this Terraform/Terragrunt foundation
How This Relates to the Docs
The site’s GKE cluster setup shows a single-file Terraform example including private nodes and Workload Identity. Use that page when you want the canonical resource shape; use Terragrunt when you need multi-stack, multi-environment, and shared backend ergonomics. The two approaches compose: modules can mirror the same google_container_cluster arguments.
Next: Data Layer
Once the cluster and VPC exist, add Cloud SQL (private IP, service networking) and GCS buckets with IAM — see Connecting GKE to Cloud SQL and Cloud Storage with Terraform.
Summary
| Piece | Role |
|---|---|
| Bootstrap bucket | Durable GCS backend for all stacks |
Env root terragrunt.hcl | Provider generation, remote state, shared inputs |
modules/vpc | Subnet + pod/service secondary ranges |
modules/gke | Cluster + autoscaling node pool + node SA IAM |
dependency "vpc" in gke/ | Clean wiring without copy-paste |
Terragrunt keeps small GKE setups maintainable: the same modules power dev and prod, while state stays isolated per stack and environment.