Provisioning a Small GKE Cluster with Terraform and Terragrunt

K8s Guru
6 min read
Provisioning a Small GKE Cluster with Terraform and Terragrunt

Introduction

If you run Google Kubernetes Engine (GKE) in more than one environment, plain Terraform folders tend to sprawl: duplicated backend blocks, copy-pasted variables, and fragile ordering between VPC and cluster. Terragrunt adds a thin orchestration layer: shared remote state, generated providers, and explicit dependency links between stacks.

This guide walks through a realistic small-cluster layout (the kind used in production-style repos that split modules/ and environments/). It focuses on foundation only: bootstrap state, VPC with pod/service secondary ranges, and a small autoscaling node pool. A companion post covers Cloud SQL and Cloud Storage in the same pattern: Connecting GKE to Cloud SQL and Cloud Storage with Terraform.

For baseline GKE concepts and a minimal single-file Terraform example, see the docs: GKE cluster setup and GKE networking.

Why Terragrunt on Top of Terraform

Terragrunt does not replace Terraform; it composes it:

  • DRY remote state — one remote_state definition per environment root, with a prefix per child folder (each stack gets its own state key).
  • Generated provider — e.g. a single provider "google" block emitted as provider.tf for every stack.
  • Live config vs modules — Terraform code stays in reusable modules; environment-specific values live in terragrunt.hcl under environments/<env>/....
  • Dependenciesdependency "vpc" { config_path = "../vpc" } wires outputs into downstream modules without manual terraform_remote_state boilerplate in every child module.

That matches the structure described in infrastructure repos that document: modules for VPC, GKE, DB, buckets; Terragrunt folders per stack; global for shared IAM or registry resources.

Repository Layout (Modules + Live Config)

A common top-level shape:

  • bootstrap_gcp/ — one-off Terraform to create the GCS bucket used for Terraform state.
  • modules/ — reusable Terraform: vpc, gke, db, bucket, etc.
  • environments/ — Terragrunt live config: dev/, prod/, optional global/ for shared resources.

Each environment root terragrunt.hcl typically sets:

  • Generated provider.tf
  • remote_state backend (GCS) with an environment-specific prefix
  • Shared inputs: project, region, networking, GKE sizing, DNS flags, and so on

Child folders such as vpc/, gke/, db/, bucket/ each contain their own terragrunt.hcl pointing at terraform { source = "../../../modules/<name>" } and include to inherit the parent.

Step 1: Bootstrap Remote State

Before terragrunt run-all, you need a durable state bucket. A minimal bootstrap stack creates something like:

  • google_storage_bucket for Terraform state
  • Versioning enabled on that bucket

Apply this stack once per organization/project (or per landing zone), then point every Terragrunt root at that bucket. This keeps state centralized and versioned, which matters as soon as you have more than one stack (VPC, GKE, DB, buckets).

Step 2: Remote State per Stack

In the environment root, remote_state often uses the GCS backend with a prefix that includes the relative path of each Terragrunt unit, for example:

remote_state {
  backend = "gcs"
  config = {
    bucket = "my-terraform-state"
    prefix = "dev/${path_relative_to_include()}"
  }
}

That yields separate state files for dev/vpc, dev/gke, dev/db, etc., while sharing one bucket and one naming convention.

Step 3: VPC With GKE Secondary Ranges

GKE VPC-native clusters need a subnet with secondary IP ranges for pods and services. A typical modules/vpc pattern:

  • Custom VPC, single regional subnet
  • Two secondary_ip_range blocks, e.g. named services-range and pods-range (names are referenced by the cluster)

The cluster module then consumes network ID, subnetwork ID, and the range names so ip_allocation_policy can attach pod and service CIDRs correctly.

Step 4: Wire GKE to VPC With Terragrunt

The gke stack’s terragrunt.hcl usually:

  1. Points terraform.source at modules/gke
  2. includes the parent folders (provider + remote state + shared inputs)
  3. Declares a dependency "vpc" on ../vpc
  4. Passes VPC outputs into the module inputs

Conceptually:

terraform {
  source = "../../../modules/gke"
}

include {
  path = find_in_parent_folders()
}

dependency "vpc" {
  config_path = "../vpc"
  mock_outputs = {
    network_id          = "network-00000001"
    subnetwork_id       = "subnet-00000001"
    pods_range_name     = "pods-range"
    services_range_name = "services-range"
  }
}

inputs = {
  network_id          = dependency.vpc.outputs.network_id
  subnetwork_id       = dependency.vpc.outputs.subnetwork_id
  pods_range_name     = dependency.vpc.outputs.pods_range_name
  services_range_name = dependency.vpc.outputs.services_range_name
}

Note: mock_outputs helps validate and some planning scenarios when the VPC stack is not applied yet; remove or tighten mocks for real applies.

Step 5: Cluster and Node Pool Choices for a Small Footprint

Cluster resource

A compact google_container_cluster setup often:

  • Sets remove_default_node_pool = true and a small initial_node_count only to satisfy the API, then manages nodes in a dedicated pool
  • Uses ip_allocation_policy with cluster_secondary_range_name and services_secondary_range_name matching the VPC module
  • Configures a maintenance window (weekly recurring window is typical)

This repo style does not always enable private control-plane endpoints or Workload Identity in the same module; treat those as hardening steps and align them with GKE security when you need them.

Node pool

For cost-aware dev or small prod:

  • autoscaling with a low total_max_node_count (e.g. 2) and min_node_count of 0 where policy allows — so the pool can scale down when idle
  • preemptible (or spot) nodes in non-prod to cut cost
  • A dedicated node service account with oauth_scopes including cloud-platform, logging, and monitoring

Common IAM on the node SA includes:

  • roles/container.defaultNodeServiceAccount
  • roles/artifactregistry.reader (pull images from Artifact Registry)
  • roles/compute.networkAdmin (often used for ingress / certificate workflows — review least privilege for your case)

Machine type and disk size come from the environment root inputs (e.g. e2-standard-2, 60 GiB disk) so dev and prod can diverge without forking the module.

Step 6: Apply Order and Day-Two Operations

From the environment directory:

terragrunt run-all plan
terragrunt run-all apply

Terragrunt respects dependencies between units (e.g. GKE after VPC). For brownfield GCP, some repos also ship import scripts mapping existing resources into state — useful when you adopt Terraform after manual console setup.

After apply:

  • Fetch credentials with gcloud container clusters get-credentials ...
  • Install cluster components (ingress, GitOps, etc.) — that layer is usually outside this Terraform/Terragrunt foundation

How This Relates to the Docs

The site’s GKE cluster setup shows a single-file Terraform example including private nodes and Workload Identity. Use that page when you want the canonical resource shape; use Terragrunt when you need multi-stack, multi-environment, and shared backend ergonomics. The two approaches compose: modules can mirror the same google_container_cluster arguments.

Next: Data Layer

Once the cluster and VPC exist, add Cloud SQL (private IP, service networking) and GCS buckets with IAM — see Connecting GKE to Cloud SQL and Cloud Storage with Terraform.

Summary

PieceRole
Bootstrap bucketDurable GCS backend for all stacks
Env root terragrunt.hclProvider generation, remote state, shared inputs
modules/vpcSubnet + pod/service secondary ranges
modules/gkeCluster + autoscaling node pool + node SA IAM
dependency "vpc" in gke/Clean wiring without copy-paste

Terragrunt keeps small GKE setups maintainable: the same modules power dev and prod, while state stays isolated per stack and environment.