KubeCon + CloudNativeCon 2018: From “Adoption” to “Operations at Scale”

K8s Guru
9 min read
KubeCon + CloudNativeCon 2018: From “Adoption” to “Operations at Scale”

1) Why this KubeCon matters right now

If 2016 was about Kubernetes becoming a platform and 2017 was about converging on an operating model, 2018 is when that operating model meets production pressure. The ecosystem is no longer debating whether Kubernetes “works.” The harder question is dominant: how do you operate Kubernetes (and the layers around it) safely at scale—across many teams and clusters—without creating a bespoke distribution of one?

The combined view across the 2018 events is revealing because the emphasis shifts within the year. In spring, the energy is still about making the platform usable: shaping workflows, standardizing a minimal platform, and exploring new layers like service-to-service policy. By late year, the tone is more operational: upgrades, drift control, ownership boundaries, and what happens when “platform” becomes a dependency for dozens of services. Progress is being measured less by features and more by predictability—interfaces you can depend on, upgrades you can schedule, and failures you can diagnose under pressure.

i Context (2018)
In 2018, many organizations are past their first cluster and into their second or third. That changes incentives. The painful work is day‑2: upgrades, incident response, policy, and keeping platform components coherent. Kubernetes may be stable enough, but pipelines, networking layers, and “platform glue” are where complexity often accumulates.

Trend 1: Kubernetes is becoming a fleet problem, not a cluster problem

The most consistent signal across 2018 is that “Kubernetes adoption” is no longer the finish line. The real technical conversation is about running many clusters with consistent behavior: upgrading on cadence, controlling drift, enforcing policy uniformly, and providing stable interfaces for application teams.

Why it matters:

  • Drift is an incident multiplier. When clusters diverge, every outage becomes archaeology.
  • Lifecycle becomes the security posture. If upgrades are rare, patching and rotation become unrealistic.

Compared to 2016–2017 (bootstrapping and first production clusters), 2018 assumes clusters already exist; the hard part is keeping them operable with predictable upgrades and clear support boundaries.

Trend 2: Interface boundaries (runtime, storage, and networking) are becoming operational contracts

In earlier years, “pluggable” infrastructure was a design aspiration. In 2018, modularity is increasingly treated as an operational necessity: the runtime layer, storage integration, and networking need clearer contracts so upgrades and security improvements don’t require coordinated rewrites.

Why it matters:

  • Decoupled change reduces blast radius. A runtime or storage change should not imply a full platform rebuild.
  • Operational debugging improves when responsibilities are separated (runtime vs kubelet vs CNI vs CSI) and each layer is measurable.
  • Portability becomes more honest. Not “everything runs everywhere with no changes,” but “integration points are known and stable.” The practical implication is that understanding these interfaces stops being “platform trivia.” It becomes necessary to debug real incidents: disk pressure, networking pathologies, runtime behavior, and control-plane backpressure.

Trend 3: Policy, identity, and the supply chain are moving into the reliability domain

By 2018, multi-team usage is the norm, and the Kubernetes control plane is treated as a policy surface. The notable shift is that policy and identity are discussed less as “security add-ons” and more as correctness tools: guardrails for shared environments, enforced conventions, and automated checks that scale better than human review.

In parallel, delivery pipelines are being pulled into the same conversation: most outages are change-induced, and both security and reliability depend on a controlled software supply chain.

Why it matters:

  • Multi-tenancy fails by accident. Many severe issues are misconfigurations with large blast radius.
  • Auditability becomes operational. During incidents, teams need to know what changed and why, not just what is running.
  • Identity becomes a dependency. Workload identity and service-to-service auth tie directly into platform design.
  • Controlled promotion reduces outages. Repeatable builds, gated rollouts, and clear provenance reduce “unknown unknowns” during deployments.

This doesn’t mean policy is solved. The hard part is operational: ownership, exceptions, and evolution without blocking delivery.

Trend 4: Service networking is consolidating—toward meshes, gateways, and more explicit L7 ownership

By 2018, the community’s language around service networking changes. It’s less about “ingress controllers” as a single component and more about a layered set of concerns: edge routing, internal traffic policy, mTLS and identity, retries/timeouts, and telemetry consistency.

Why it matters:

  • Reliability behavior becomes platform-defined. Timeouts, retries, and circuit breaking are not just app decisions; they shape system behavior under load and failure.
  • Security moves into the data plane. Encrypting service-to-service traffic and establishing identity at the workload level becomes a platform capability, not a per-team implementation detail.
  • Uniform telemetry becomes feasible when traffic policy is centralized.

How it differs from previous years:

  • In 2017, service mesh ideas were emerging; in 2018, they are debated as something you might actually run—meaning the conversation becomes operational: failure modes, upgrade strategies, performance overhead, and ownership boundaries.
! The “second control plane” tax
Several 2018 solutions to traffic, security, and observability introduce another control plane into production. The value can be real, but so is the cost: upgrades, configuration drift, and unclear ownership. Without rollback and clear ownership, you’re likely adding fragility.
A useful framing for 2018 decisions
When evaluating a new platform component, don’t ask “is it popular?” Ask: does it reduce uncertainty? Concretely, does it make upgrades safer, failures easier to diagnose, or ownership boundaries clearer?

3) Signals from CNCF and major ecosystem players (what it actually means)

The most important CNCF signal in 2018 is not any single project announcement. It is the implicit governance model: cloud native is being defined as interoperable primitives with an expectation of operational maturity. The portfolio is growing, but the bar for credibility is shifting toward operability: upgrades, defaults, security posture, and clear integration points.

What that means in practice is straightforward:

  • Standardization and conformance matter. “Boring compatibility” lowers upgrade and integration risk.
  • Operational maturity is a selection criterion. Teams evaluate projects by upgrade story, observability, performance, and security posture.
  • Decision frameworks beat tool encyclopedias. Platform leaders need supported paths, owners, and lifecycle policies.

From major cloud providers and vendors, the meaningful signal is continued alignment on upstream Kubernetes semantics as the stable base, with differentiation moving upward into fleet operations, workflow opinions, and day‑2 tooling.

4) What this means

For engineers

Skills worth learning already in 2018:

  • Distributed systems debugging under churn: retries amplifying load, bounded timeouts, and how partial failures show up in signals.
  • Kubernetes failure modes: DNS/network pathologies, storage semantics, and control-plane/node pressure patterns.
  • Policy/identity basics: RBAC modeling, admission control concepts, and workload identity patterns.
  • Observability discipline: label hygiene and cardinality management.

Skills starting to lose competitive advantage:

  • “kubectl fluency” without operational reasoning. Knowing commands is table stakes; understanding failure shape is the differentiator.
  • One-off cluster builds without lifecycle. If your automation can’t support upgrades and rollback, it’s not a platform skill—it’s a prototype skill.
  • YAML as a craft without engineering practices. The value is in reviewable change, safe rollout, and ownership boundaries.

For platform teams

Roles that start to emerge more clearly in 2018:

  • Fleet/platform SRE: upgrade cadence, capacity planning, incident response, and measurable reliability for the cluster and its shared services.
  • Policy and identity engineer: RBAC models, admission policies, integration with identity providers, and audit workflows.
  • Developer-experience owner: defining “paved roads” and supported paths so app teams can ship without negotiating infrastructure details.

Operationally, platform teams need explicit product boundaries (supported paths vs “best effort”) and must treat deprecations and upgrades as first-class work.

For companies running Kubernetes in production

Three practical takeaways from the 2018 signals:

  • Treat upgrades as routine work. Budget headcount and time for them. If upgrades are rare, they become risky, and “security posture” becomes mostly aspirational.
  • Standardize a minimal platform set. Most organizations benefit from standardizing networking, ingress/gateway strategy, telemetry, and access control first—then adding higher-level layers only to reduce measurable toil.
  • Measure platform success by outcomes. Upgrade cadence, incident rate, MTTR, and the ability to support many teams safely are better indicators than the number of clusters or tools deployed.

5) What is concerning or raises questions

Two concerns show up repeatedly in 2018 conversations.

First, there are still too few detailed production failure stories. The ecosystem improves fastest through postmortems (control-plane overload, etcd cliffs, network failures, upgrade regressions, and the organizational mistakes that amplify incidents).

Second, the ecosystem risks equating more components with more maturity. Integration tax and ambiguous ownership show up as cross-boundary incidents—app, platform, mesh/gateway, cloud infrastructure—where no one owns end-to-end diagnosis.

From the 2018 signals, a measured forecast for 2019–2020 looks like this:

  • Fleet operations will become the mainstream problem. Expect better tooling and stronger opinions around multi-cluster consistency, upgrades, and centralized policy—because that’s where enterprise pain concentrates.
  • Service networking will consolidate into clearer patterns. Not every team will run a full service mesh, but more organizations will standardize mTLS, identity, and L7 ownership—either through mesh-like approaches or gateway-centric designs. Selective adoption will outperform blanket adoption.
  • Policy-as-code will move from “security initiative” to “platform default.” As cluster usage expands, organizations will encode operational correctness (resource constraints, safe workload patterns, exception handling) into automated checks and admission policies.
  • Supply chain controls will tighten. Provenance, repeatable builds, controlled promotion, and clearer separation between build and deploy responsibilities will increasingly be treated as reliability work, not just security work.

If 2017 was about converging on an operating model, 2018 is about testing that model under real production constraints. The winners over the next two years will not be the teams that adopt the most projects, but the teams that reduce uncertainty: predictable upgrades, clear ownership, constrained and reviewable change, and operational signals that make failures legible.