Jobs

Jobs run one or more pods until a specified number of them successfully terminate. Unlike Deployments and StatefulSets that run continuously, Jobs are designed for tasks that run to completion—one-time work like data processing, backups, database migrations, or batch operations.

What Are Jobs?

A Job creates one or more pods and ensures that a specified number of them successfully complete. When pods complete successfully, the Job is considered complete. If a pod fails, the Job can automatically retry by creating new pods.

graph TB A[Job Created] --> B[Creates Pod] B --> C{Pod Completes?} C -->|Success| D{All Pods Succeeded?} C -->|Failure| E{Retries Left?} E -->|Yes| F[Create New Pod] E -->|No| G[Job Failed] F --> C D -->|Yes| H[Job Succeeded] D -->|No| I[Create Next Pod] I --> C style A fill:#e1f5ff style H fill:#e8f5e9 style G fill:#ffe1e1 style F fill:#fff4e1

Why Use Jobs?

Jobs are perfect for:

✅ One-time tasks - Run a task once and stop
✅ Batch processing - Process a batch of data
✅ Database migrations - Run migration scripts
✅ Backups - One-time backup operations
✅ Data transformations - ETL jobs and data processing
✅ Parallel processing - Run multiple pods in parallel
✅ Retry logic - Automatic retries on failure

Job vs Deployment

Jobs and Deployments serve different purposes:

graph TB subgraph job[Job] A[Job] --> B[Run to Completion] B --> C[Task Finished] C --> D[Job Completed] E[Pod Fails] --> F[Retry] F --> B end style A fill:#e1f5ff style D fill:#e8f5e9

graph TB subgraph deployment[Deployment] G[Deployment] --> H[Run Continuously] H --> I[Maintain Replica Count] I --> J[Pod Fails] J --> K[Replace Pod] K --> H end style G fill:#fff4e1 style H fill:#f3e5f5

Use Jobs when:

Task runs to completion
One-time or batch work
Need to ensure task succeeds

Use Deployments when:

Application runs continuously
Need to maintain replica count
Stateless service

Job Completion

A Job is complete when:

Success - Specified number of pods complete successfully
Failure - Maximum retries exceeded or backoff limit reached
Manual deletion - Job is deleted (completed pods remain)

graph TD A[Job Created] --> B[Pods Running] B --> C{Completion Status} C -->|All Succeeded| D[Job Succeeded] C -->|Max Retries Exceeded| E[Job Failed] C -->|Manual Delete| F[Job Deleted] D --> G[Completed Pods Remain] E --> G F --> H[Pods Terminated] style A fill:#e1f5ff style D fill:#e8f5e9 style E fill:#ffe1e1 style G fill:#fff4e1

Basic Job Example

Here’s a simple Job that runs a task to completion:

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34
        command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

Key fields:

spec.template - Pod template (required)
restartPolicy - Must be Never or OnFailure (not Always)
backoffLimit - Maximum number of retries (default: 6)

Job Types

Non-Parallel Jobs

Runs a single pod until successful completion:

apiVersion: batch/v1
kind: Job
metadata:
  name: single-task
spec:
  template:
    spec:
      containers:
      - name: task
        image: busybox
        command: ["sh", "-c", "echo 'Task completed' && sleep 5"]
      restartPolicy: Never

Parallel Jobs with Fixed Completion Count

Runs multiple pods in parallel until a specific number succeed:

apiVersion: batch/v1
kind: Job
metadata:
  name: parallel-fixed
spec:
  completions: 5  # Need 5 successful completions
  parallelism: 2  # Run 2 pods at a time
  template:
    spec:
      containers:
      - name: worker
        image: busybox
        command: ["sh", "-c", "echo Processing item && sleep 10"]
      restartPolicy: Never

Parallel Jobs with Work Queue

Multiple pods process items from a queue until queue is empty:

apiVersion: batch/v1
kind: Job
metadata:
  name: parallel-queue
spec:
  parallelism: 3  # Run 3 pods in parallel
  completions: null  # No fixed completion count
  template:
    spec:
      containers:
      - name: worker
        image: busybox
        command: ["sh", "-c", "process-queue-items"]
      restartPolicy: Never

Job Lifecycle

graph TD A[Job Created] --> B[Active Phase] B --> C[Pods Created] C --> D{Pods Running} D --> E{Pod Completes} E -->|Success| F{Completions Met?} E -->|Failure| G{Retries Left?} G -->|Yes| H[Backoff Wait] H --> I[Create New Pod] I --> D G -->|No| J[Job Failed] F -->|Yes| K[Job Succeeded] F -->|No| L[Create Next Pod] L --> D style A fill:#e1f5ff style K fill:#e8f5e9 style J fill:#ffe1e1 style H fill:#fff4e1

Job Completion Modes

NonIndexed (Default)

Each pod is independent. Job completes when the required number of pods succeed:

spec:
  completions: 5
  parallelism: 2
  completionMode: NonIndexed  # Default

Indexed

Each pod gets a unique index (0 to completions-1). Useful for partitioning work:

spec:
  completions: 5
  parallelism: 2
  completionMode: Indexed
  template:
    spec:
      containers:
      - name: worker
        image: busybox
        command: ["sh", "-c", "process-item-$JOB_COMPLETION_INDEX"]

The JOB_COMPLETION_INDEX environment variable contains the pod’s index.

Retry and Backoff

Jobs automatically retry failed pods with exponential backoff:

spec:
  backoffLimit: 4  # Retry up to 4 times
  activeDeadlineSeconds: 300  # Kill job after 5 minutes

Backoff behavior:

First retry: 10 seconds
Second retry: 20 seconds
Third retry: 40 seconds
And so on (exponential backoff)

Job Completion and Cleanup

By default, completed Jobs and their pods remain in the cluster. You can configure automatic cleanup:

apiVersion: batch/v1
kind: Job
metadata:
  name: cleanup-example
spec:
  ttlSecondsAfterFinished: 100  # Delete job 100 seconds after completion
  template:
    spec:
      containers:
      - name: task
        image: busybox
        command: ["echo", "done"]
      restartPolicy: Never

Or use a CronJob’s successfulJobsHistoryLimit and failedJobsHistoryLimit for automatic cleanup.

Common Use Cases

1. Database Migration

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
spec:
  template:
    spec:
      containers:
      - name: migration
        image: postgres:15
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: url
        command:
        - /bin/sh
        - -c
        - |
          psql $DATABASE_URL -f /migrations/001_schema.sql
          psql $DATABASE_URL -f /migrations/002_data.sql
        volumeMounts:
        - name: migrations
          mountPath: /migrations
      volumes:
      - name: migrations
        configMap:
          name: migration-scripts
      restartPolicy: Never
  backoffLimit: 3
  activeDeadlineSeconds: 600

2. Data Processing

apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing
spec:
  completions: 10
  parallelism: 3
  template:
    spec:
      containers:
      - name: processor
        image: data-processor:latest
        command: ["process-batch"]
        env:
        - name: BATCH_SIZE
          value: "1000"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
      restartPolicy: Never

3. Backup Job

apiVersion: batch/v1
kind: Job
metadata:
  name: backup
spec:
  template:
    spec:
      containers:
      - name: backup
        image: backup-tool:latest
        command:
        - /bin/sh
        - -c
        - |
          backup-database
          upload-to-s3 s3://backups/$(date +%Y%m%d).sql
        volumeMounts:
        - name: backup-dir
          mountPath: /backups
      volumes:
      - name: backup-dir
        emptyDir: {}
      restartPolicy: OnFailure
  backoffLimit: 2
  activeDeadlineSeconds: 3600

Best Practices

Set appropriate restartPolicy - Use Never or OnFailure, not Always
Set backoffLimit - Control how many times to retry

backoffLimit: 4  # Retry up to 4 times

Use activeDeadlineSeconds - Prevent jobs from running indefinitely

activeDeadlineSeconds: 3600  # 1 hour timeout

Set resource limits - Jobs should have resource constraints

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Use ttlSecondsAfterFinished - Automatically clean up completed jobs

ttlSecondsAfterFinished: 300  # Delete after 5 minutes

Handle failures gracefully - Make sure your application exits with proper exit codes
- Exit code 0: Success
- Non-zero: Failure (triggers retry)
Use parallel jobs wisely - Balance parallelism with resource availability
Monitor job status - Check job completion and pod logs
Use ConfigMaps/Secrets - Store configuration and credentials securely
Consider using CronJobs - For recurring tasks

Common Operations

Create a Job

# Create from YAML
kubectl create -f job.yaml

# Create from command
kubectl create job myjob --image=busybox -- echo "Hello"

View Job Status

# List jobs
kubectl get jobs

# Detailed information
kubectl describe job myjob

# View job pods
kubectl get pods -l job-name=myjob

# View pod logs
kubectl logs -l job-name=myjob

Delete a Job

# Delete job (pods are also deleted)
kubectl delete job myjob

# Delete without cascading (orphans pods)
kubectl delete job myjob --cascade=orphan

Check Job Completion

# Check if job succeeded
kubectl get job myjob -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}'

# View job completion time
kubectl get job myjob -o jsonpath='{.status.completionTime}'

Troubleshooting

Job Not Starting

# Check job events
kubectl describe job myjob

# Check for resource constraints
kubectl get events --sort-by=.metadata.creationTimestamp

# Verify pod template
kubectl get job myjob -o yaml

Pods Failing

# Check pod logs
kubectl logs -l job-name=myjob

# Check pod events
kubectl describe pod -l job-name=myjob

# Check exit codes
kubectl get pods -l job-name=myjob -o jsonpath='{.items[*].status.containerStatuses[*].state.terminated.exitCode}'

Job Hanging

# Check active deadline
kubectl get job myjob -o jsonpath='{.spec.activeDeadlineSeconds}'

# Check job status
kubectl describe job myjob

# Check if pods are stuck
kubectl get pods -l job-name=myjob