etcd Basics

etcd is a distributed, consistent key-value store that serves as Kubernetes’ backing store. Understanding etcd is crucial because it’s where all cluster state lives—every pod, service, deployment, and configuration is stored in etcd. If etcd fails or loses data, your cluster loses its “memory” and can’t function properly.

Think of etcd like a filing cabinet for Kubernetes. Every piece of information the cluster needs to remember—what pods are running, what services exist, what configurations are set—is stored in etcd. Just as you can’t run a company without records of what’s happening, Kubernetes can’t function without etcd storing its state.

What is etcd?

etcd is a distributed key-value store designed for reliable storage of data that needs to be accessed by a distributed system or cluster of machines. It’s written in Go and uses the Raft consensus algorithm to ensure consistency across a cluster of etcd nodes.

Key Characteristics

Distributed - Runs as a cluster of multiple nodes
Consistent - Strong consistency guarantees (CP in CAP theorem)
Reliable - Fault-tolerant and durable
Fast - Optimized for read and write performance
Simple - Key-value interface, easy to understand

Why Kubernetes Uses etcd

Kubernetes chose etcd as its backing store because it provides:

Consistency

Kubernetes needs strong consistency—all components must see the same state. etcd provides linearizable reads and writes, meaning all operations appear to happen in a single, well-defined order. This is essential for Kubernetes’ control plane to make correct decisions.

Reliability

etcd is designed to be fault-tolerant. It can survive node failures as long as a majority of nodes (quorum) remain available. For a 3-node cluster, 1 node can fail. For a 5-node cluster, 2 nodes can fail.

Watch Support

Kubernetes components need to be notified when state changes. etcd provides efficient watch functionality that allows clients to subscribe to changes in the key-value store. This is how controllers and other components react to changes.

Performance

etcd is optimized for the read-heavy workload of Kubernetes. Most operations are reads (checking current state), with writes happening less frequently (when resources are created or updated).

etcd in Kubernetes

In Kubernetes, etcd stores everything:

graph TB API[API Server] --> etcd[etcd] subgraph "etcd Stores" Pods[Pods] Services[Services] Deployments[Deployments] Nodes[Nodes] Config[ConfigMaps] Secrets[Secrets] All[All Resources] end etcd --> Pods etcd --> Services etcd --> Deployments etcd --> Nodes etcd --> Config etcd --> Secrets etcd --> All style API fill:#e1f5ff style etcd fill:#fff4e1

What Gets Stored

All API objects - Pods, services, deployments, nodes, namespaces, etc.
Configuration - ConfigMaps, Secrets, RBAC policies
State - Current state of all resources (status fields)
Metadata - Labels, annotations, resource versions
Cluster configuration - Cluster-level settings

Storage Structure

etcd stores data in a hierarchical key structure:

/registry/pods/default/my-pod
/registry/services/default/my-service
/registry/deployments/production/my-app
/registry/nodes/node-1
/registry/namespaces/default

Each resource is stored as a JSON document containing:

Spec - Desired state (what you want)
Status - Current state (what actually is)
Metadata - Name, labels, annotations, etc.

etcd Operations

Reads

When you run kubectl get pods, the API server reads from etcd:

API server receives GET request
API server reads from etcd: /registry/pods/default/*
API server filters and processes results
API server returns response to kubectl

Reads are fast because etcd keeps data in memory (with disk persistence for durability).

Writes

When you create a pod, the API server writes to etcd:

API server receives POST request
API server validates request
API server writes to etcd: /registry/pods/default/my-pod
etcd confirms write
API server returns success

Writes go through the Raft consensus algorithm to ensure all etcd nodes agree.

Watches

Components watch etcd (through the API server) for changes:

Component opens watch on /registry/pods/default
etcd streams changes as they occur
Component receives notification of changes
Component reacts to changes (e.g., controller reconciles)

Watches are efficient—they only send changes, not full state.

etcd Clustering

For production, etcd runs as a cluster for high availability and performance.

Cluster Size

Typical etcd cluster sizes:

1 node - Development/testing only (no fault tolerance)
3 nodes - Can survive 1 node failure (minimum for production)
5 nodes - Can survive 2 node failures (for larger clusters)
7 nodes - Rarely needed (for very large clusters)

Odd numbers are required for quorum (majority voting).

Quorum

etcd uses Raft consensus, which requires a majority (quorum) for writes:

3 nodes - Need 2 nodes for quorum (can lose 1)
5 nodes - Need 3 nodes for quorum (can lose 2)
7 nodes - Need 4 nodes for quorum (can lose 3)

If quorum is lost, etcd becomes read-only and can’t accept writes.

etcd Cluster Architecture

graph TB subgraph "etcd Cluster" etcd1[etcd Node 1<br/>Leader] etcd2[etcd Node 2<br/>Follower] etcd3[etcd Node 3<br/>Follower] end API[API Server] --> etcd1 API --> etcd2 API --> etcd3 etcd1 <--> etcd2 etcd2 <--> etcd3 etcd3 <--> etcd1 style etcd1 fill:#e1f5ff style etcd2 fill:#fff4e1 style etcd3 fill:#fff4e1

Leader - Handles all write requests and replicates to followers Followers - Receive replication from leader, can handle read requests Election - If leader fails, followers elect a new leader

Raft Consensus

etcd uses the Raft consensus algorithm to ensure all nodes agree on the state.

How Raft Works

Leader election - One node becomes leader
Log replication - Leader replicates writes to followers
Commit - Write committed when majority acknowledge
Consistency - All nodes see same committed state

Why Raft?

Raft provides:

Strong consistency - All nodes see same state
Fault tolerance - Survives node failures
Understandability - Simpler than alternatives like Paxos
Performance - Efficient for Kubernetes’ workload

etcd and API Server

The API server is the only component that directly communicates with etcd:

graph LR User[Users/Components] --> API[API Server] API --> etcd[etcd] Scheduler[Scheduler] --> API CM[Controller Manager] --> API Kubelet[Kubelet] --> API style API fill:#e1f5ff style etcd fill:#fff4e1

Why this design?

Security - etcd not exposed directly
Abstraction - API server provides higher-level API
Validation - All writes go through API server validation
Versioning - API server handles API versioning

Components never talk to etcd directly—they always go through the API server.

etcd Performance

Read Performance

etcd is optimized for reads:

In-memory storage - Fast access to current state
Consistent reads - All reads see committed state
Efficient watches - Only sends changes, not full state

Write Performance

Writes are slower than reads because they:

Go through Raft consensus
Must be replicated to majority
Are persisted to disk

For Kubernetes, this is acceptable because writes are less frequent than reads.

Size Considerations

etcd performance degrades with size:

Recommended - Keep etcd database under 8GB
Maximum - Can handle up to ~50GB (with performance impact)
Compaction - etcd compacts old revisions automatically

etcd Backup and Restore

Why Backup?

etcd contains all cluster state. If etcd data is lost:

Cluster loses all configuration
All resource definitions are gone
Cluster must be rebuilt from scratch

Regular backups are essential for disaster recovery.

Backup Process

# Backup etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Restore Process

# Restore etcd from backup
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restore

Restore should be done carefully and typically requires cluster downtime.

Backup Best Practices

Regular backups - Daily or more frequent for production
Test restores - Regularly test that backups can be restored
Off-site storage - Store backups outside the cluster
Automation - Automate backup process
Retention - Keep multiple backup versions

etcd Maintenance

Compaction

etcd keeps a history of all changes. Over time, this history grows. Compaction removes old history:

# Compact etcd (keep last 1000 revisions)
ETCDCTL_API=3 etcdctl compact 1000

Kubernetes typically handles this automatically.

Defragmentation

As etcd writes and deletes data, the database can become fragmented. Defragmentation reorganizes data:

# Defragment etcd
ETCDCTL_API=3 etcdctl defrag

Should be done during maintenance windows as it can impact performance.

Health Checks

Monitor etcd health:

Node health - Check if nodes are responding
Leader health - Ensure leader is functioning
Database size - Monitor database growth
Performance - Monitor read/write latency

etcd Security

Authentication

etcd supports authentication:

Client certificates - Mutual TLS authentication
Username/password - Basic authentication (less secure)

Kubernetes typically uses client certificates.

Encryption

etcd data can be encrypted at rest:

Encryption at rest - Encrypt data on disk
TLS in transit - Encrypt communication between nodes

Access Control

etcd supports role-based access control (RBAC) to limit what clients can do.

Key Takeaways

etcd is Kubernetes’ backing store—all cluster state is stored there
etcd provides strong consistency, which is essential for Kubernetes
etcd runs as a cluster for high availability (typically 3 or 5 nodes)
The API server is the only component that directly talks to etcd
etcd uses Raft consensus to ensure all nodes agree on state
Regular backups are essential—etcd data loss means cluster data loss
etcd is optimized for reads, which matches Kubernetes’ workload