DevOps

Kubernetes Best Practices for Production: Security, Scaling, and Cost Control

DSi Team

· November 9, 2023 · 13 min read

Kubernetes won the container orchestration war years ago. Today, the question is no longer whether to use Kubernetes — it is whether you are running it well enough. Most production clusters we audit have the same problems: overly permissive security defaults, autoscaling that does not actually scale, and cloud bills that grow faster than the traffic they serve.

The gap between a Kubernetes cluster that works in staging and one that is truly production-ready is enormous. It spans security hardening, intelligent scaling, cost discipline, deployment automation, and observability — each with its own set of decisions that compound over time.

This guide covers the practices that matter most for production Kubernetes today. Not the basics of writing a Deployment manifest, but the operational decisions that determine whether your cluster is secure, cost-efficient, and resilient under real-world conditions. Whether you are tightening an existing production setup or planning your first serious Kubernetes deployment, these are the practices your team needs to get right.

Security: Locking Down Your Cluster

Kubernetes is not secure by default. Out of the box, pods run with more privileges than they need, network traffic flows unrestricted between namespaces, and RBAC policies are often too broad. In production, these defaults are liabilities. The cost of neglecting security debt in Kubernetes compounds quickly — a single misconfigured pod can become the entry point for a cluster-wide compromise.

Pod Security Standards

Since Kubernetes 1.25, Pod Security Admission (PSA) replaced the deprecated PodSecurityPolicy. Every production namespace should enforce one of the three built-in levels: Privileged, Baseline, or Restricted. For most workloads, Restricted is the correct choice.

Restricted mode enforces critical constraints: containers must run as non-root, cannot escalate privileges, must drop all Linux capabilities, and must use a read-only root filesystem. These are not aspirational guidelines — they are the minimum viable security posture for any container running in production.

Set runAsNonRoot: true on every container. No exceptions unless the workload genuinely requires root access, and even then, use a separate namespace with documented justification.
Drop all capabilities with drop: ["ALL"] and add back only what is needed. Most application containers need zero Linux capabilities.
Enable read-only root filesystems with readOnlyRootFilesystem: true. Use emptyDir volumes for any directories the application needs to write to.
Set allowPrivilegeEscalation: false to prevent processes from gaining more privileges than their parent.
Enforce seccomp profiles using the RuntimeDefault profile at minimum, which blocks dangerous system calls without breaking most applications.

RBAC: Least privilege, enforced

Role-Based Access Control is only as strong as the policies you write. The most common RBAC mistake in production clusters is granting cluster-admin to service accounts and human users who do not need it. A developer who needs to deploy to one namespace does not need read access to secrets in every namespace.

Use Roles over ClusterRoles whenever possible. Namespace-scoped Roles limit blast radius by default.
Bind service accounts to specific namespaces and never reuse the default service account for application workloads.
Audit RBAC regularly using tools like kubectl-who-can or Fairwinds Polaris. If you cannot explain why every ClusterRoleBinding exists, your RBAC is too permissive.
Implement just-in-time access for break-glass scenarios rather than giving permanent elevated access to on-call engineers.

Network policies: Default deny, explicit allow

Without network policies, every pod can communicate with every other pod in the cluster. In a production environment, this means a compromised pod in a low-security workload can reach your database pods, secrets store, or internal APIs.

Start with a default-deny policy in every namespace, then add explicit ingress and egress rules for each workload. This is tedious upfront but prevents the lateral movement that turns a single vulnerability into a cluster-wide breach.

Apply a default-deny ingress and egress policy to every production namespace before deploying any workloads.
Allow traffic explicitly based on pod labels and namespace selectors. Document every allow rule with a comment explaining why it exists.
Restrict egress to known endpoints. If a pod only needs to talk to a database and an external API, it should not be able to reach anything else.
Use a CNI that supports network policies natively — Calico, Cilium, or Antrea. The default kubenet CNI does not enforce network policies.

Image security and supply chain

Every container image you deploy is a potential attack vector. Today, supply chain security is not optional — it is a compliance requirement for most regulated industries.

Scan every image in your CI/CD pipeline with tools like Trivy, Grype, or Snyk. Fail builds on critical and high-severity CVEs.
Sign images with Cosign or Notary and enforce signature verification at admission time using policy engines like Kyverno or OPA Gatekeeper.
Use distroless or minimal base images. Every unnecessary package in your base image is an additional attack surface.
Pin image tags to digests rather than mutable tags like latest. A tag can be overwritten; a digest cannot.

Scaling: HPA, VPA, and Beyond

Autoscaling is the reason most teams adopt Kubernetes in the first place. But poorly configured autoscaling is worse than no autoscaling — it either fails to scale when traffic spikes or scales aggressively and burns through your cloud budget. Getting scaling right requires understanding the three autoscaling mechanisms and when to use each one.

Horizontal Pod Autoscaler (HPA)

HPA is the workhorse of Kubernetes scaling. It adjusts the number of pod replicas based on observed metrics — CPU utilization, memory consumption, or custom application metrics like requests per second or queue depth.

Always set resource requests on containers that HPA targets. HPA calculates utilization as a percentage of the requested resources. Without requests, HPA has nothing to calculate against.
Use custom metrics over CPU for application workloads. CPU-based scaling is a proxy for load, but it is an imprecise one. Scaling on requests per second, queue length, or p99 latency gives you more predictable behavior.
Configure stabilization windows to prevent thrashing. The behavior field lets you set scale-down stabilization (default 300 seconds) and scale-up policies that prevent the autoscaler from oscillating.
Set sensible min and max replica counts. A minimum of 2 replicas ensures availability during node failures. The maximum should reflect your budget ceiling and cluster capacity.

Vertical Pod Autoscaler (VPA)

VPA adjusts the CPU and memory requests of individual pods based on actual usage patterns. It solves the problem of resource requests being guesswork — developers either over-provision (wasting money) or under-provision (causing OOMKills and throttling).

Start VPA in recommendation mode. The Off updateMode generates recommendations without applying them. Use this to understand your actual resource consumption before letting VPA auto-adjust.
Do not run VPA and HPA on the same metric. If HPA scales on CPU and VPA adjusts CPU requests, they will fight each other. Use HPA for horizontal scaling on custom metrics and VPA for right-sizing resource requests.
Set resource bounds. VPA's minAllowed and maxAllowed fields prevent it from setting requests too low (causing instability) or too high (wasting resources).

KEDA for event-driven scaling

For workloads driven by external events — message queues, cron schedules, database streams — KEDA (Kubernetes Event-Driven Autoscaling) is the standard today. KEDA extends HPA with scalers that understand external event sources.

Scale to zero for workloads that do not need to run continuously. KEDA can scale deployments down to zero pods when there are no events to process, eliminating idle compute costs entirely.
Use the right scaler for your event source. KEDA supports 60+ scalers — Kafka, RabbitMQ, AWS SQS, Azure Service Bus, PostgreSQL, and many more. Each scaler understands the native metrics of its event source.
Configure cooldown periods to prevent premature scale-down when event bursts are intermittent.

The biggest scaling mistake we see in production clusters is not a lack of autoscaling — it is autoscaling configured on the wrong metrics. CPU utilization tells you how busy your containers are, not whether your users are having a good experience. Scale on what matters to your application: request latency, queue depth, error rates, or business-specific metrics.

Cost Optimization: Spending Less Without Breaking Things

Kubernetes makes it easy to provision infrastructure. It also makes it easy to waste money. The abstraction layer between developers and cloud resources means teams often run clusters that cost two to three times more than they need to. Cost optimization is not about cutting corners — it is about scaling efficiently so your infrastructure spend grows proportionally with your actual usage.

Resource quotas and limit ranges

Without resource quotas, a single team or workload can consume the entire cluster's capacity. Resource quotas enforce per-namespace limits on CPU, memory, storage, and object counts.

Set ResourceQuotas on every namespace. Define maximum CPU, memory, and storage consumption per namespace to prevent resource hoarding.
Use LimitRanges to set defaults. LimitRanges automatically apply default resource requests and limits to containers that do not specify them, preventing pods from running with unbounded resources.
Review and adjust quotas quarterly. Quotas set at cluster creation often do not reflect actual usage patterns six months later. Use VPA recommendations and historical usage data to right-size quotas.

Spot instances and node pools

Spot instances (AWS), preemptible VMs (GCP), and spot VMs (Azure) offer 60 to 90 percent discounts compared to on-demand pricing. The trade-off is that the cloud provider can reclaim these instances with short notice — typically two minutes on AWS.

Workload Type	Recommended Node Pool	Reason
Stateless APIs and web servers	Spot instances	Easily replaced; traffic reroutes to surviving pods instantly
Batch processing and CI/CD jobs	Spot instances	Can checkpoint and retry; interruptions add minimal overhead
Databases and stateful workloads	On-demand instances	Data durability and failover require stable nodes
System components (control plane add-ons)	On-demand instances	Cluster stability depends on these components running continuously
ML training jobs	Spot with checkpointing	Large savings on GPU instances; checkpointing handles interruptions
Event-driven workers	Spot instances with KEDA	Scale to zero when idle; spot pricing reduces burst costs

Use pod disruption budgets (PDBs) to ensure spot instance reclamation does not take down more pods than your application can tolerate. A PDB with minAvailable: 50% prevents Kubernetes from evicting more than half your pods simultaneously.
Implement graceful shutdown handlers. Your application should catch SIGTERM, finish in-flight requests, and shut down cleanly within the termination grace period.
Diversify instance types. Request multiple instance families and sizes in your spot pool to improve availability. Using a single instance type increases the chance of simultaneous reclamation.

Right-sizing and waste elimination

The most impactful cost optimization is not switching to spot instances — it is right-sizing your existing workloads. Most clusters have 30 to 50 percent of their allocated resources sitting idle because developers requested more CPU and memory than their containers actually use.

Deploy Kubecost or OpenCost for real-time cost visibility per namespace, workload, and label. You cannot optimize what you cannot see.
Use VPA recommendations to right-size requests. Run VPA in recommendation mode across all namespaces and generate a report of the difference between requested and actual resource usage.
Identify and eliminate zombie workloads — deployments that are running but no longer receive traffic or serve any function. These are common in clusters older than a year.
Schedule non-production workloads to scale down off-hours. Development and staging environments that run 24/7 are burning money for 16 hours a day when nobody is using them.

GitOps: Declarative Deployments with ArgoCD and Flux

GitOps is the practice of using Git repositories as the single source of truth for your cluster's desired state. Instead of running kubectl apply from a developer's laptop or a CI script, a GitOps controller running inside the cluster continuously reconciles the actual state with the declared state in Git. Today, GitOps is the standard deployment model for production Kubernetes. If your team is still deploying through imperative CI/CD pipelines, you are carrying unnecessary risk from accumulated operational debt.

ArgoCD vs. Flux: Choosing your controller

Both ArgoCD and Flux are mature, CNCF-graduated projects. The choice between them comes down to operational preferences:

ArgoCD provides a rich web UI for visualizing application state, sync status, and resource health. It excels in environments where platform teams need visual oversight of deployments across multiple clusters. Its Application and ApplicationSet CRDs make it straightforward to manage hundreds of applications declaratively.
Flux takes a more CLI-native, composable approach. It uses a set of specialized controllers (source-controller, kustomize-controller, helm-controller) that can be deployed independently. Flux is a strong choice for teams that prefer a headless, API-driven workflow and already have strong observability tooling.

Both support Helm charts, Kustomize overlays, and plain manifests. Both handle multi-cluster deployments. Pick the one that fits your team's workflow and stick with it — the benefits of GitOps come from the practice, not the tool.

GitOps best practices

Separate application code from deployment manifests. Your Kubernetes manifests should live in a dedicated repository (or a dedicated directory in a monorepo), not embedded in the application source tree. This decouples deployment configuration from application releases.
Use environment-specific overlays. Kustomize overlays or Helm values files for dev, staging, and production prevent configuration drift between environments while keeping the base manifests DRY.
Enforce pull request reviews for manifest changes. Every change to production manifests should go through the same review process as application code. This is your audit trail and your safety net.
Enable automated drift detection and alerting. Both ArgoCD and Flux can detect when the cluster state diverges from the Git state. Configure alerts so your team knows immediately when manual changes bypass the GitOps workflow.
Implement progressive delivery. Use Argo Rollouts or Flagger to add canary deployments and blue-green strategies on top of your GitOps workflow. This gives you automated rollback based on metrics, not just manual intervention.

Observability: Seeing What Your Cluster Is Actually Doing

A production Kubernetes cluster without observability is a black box. When something breaks — and in a distributed system, something always breaks — your team's ability to diagnose and resolve the issue depends entirely on the data available. The maturity of your DevOps observability directly correlates with your mean time to recovery.

The three pillars, implemented

Metrics: Prometheus remains the standard for Kubernetes metrics collection today. Deploy it with the kube-prometheus-stack, which bundles Prometheus, Alertmanager, Grafana, and a comprehensive set of recording rules and dashboards for cluster and workload monitoring. Use Thanos or Cortex for long-term storage and multi-cluster aggregation.
Logs: Centralize logs with a stack like Loki (lightweight, label-based) or the EFK stack (Elasticsearch, Fluentd, Kibana). Structure your application logs as JSON so they are queryable. Set retention policies that balance debugging needs with storage costs.
Traces: Distributed tracing with OpenTelemetry is critical for understanding request flow across microservices. Instrument your applications with the OpenTelemetry SDK and export traces to Jaeger, Tempo, or a commercial APM. Traces answer the question that metrics and logs cannot: where exactly did this request spend its time?

Alerting that works

The goal of alerting is not to notify your team about every anomaly — it is to notify them about conditions that require human intervention. Noisy alerts train on-call engineers to ignore alerts, which is worse than having no alerts at all.

Alert on symptoms, not causes. Alert when error rate exceeds your SLO, not when CPU hits 80 percent. CPU at 80 percent might be perfectly normal under load.
Use multi-window, multi-burn-rate alerts for SLO-based monitoring. This approach, documented in Google's SRE workbook, catches both sudden spikes and slow burns without false positives.
Require runbooks for every alert. If an alert fires and the on-call engineer does not know what to do, the alert is incomplete. Every alert should link to a runbook with diagnostic steps and remediation actions.

Multi-Cluster Strategies

As organizations grow, a single Kubernetes cluster often cannot satisfy all requirements — compliance boundaries, geographic latency, blast radius isolation, or sheer scale. Multi-cluster architectures are increasingly common today, but they introduce significant operational complexity.

When to go multi-cluster

Regulatory or compliance isolation: When workloads handling PII, healthcare data, or financial transactions must run in physically or logically isolated environments.
Geographic distribution: When users in different regions need sub-100ms latency and a single-region cluster cannot deliver it.
Blast radius reduction: When a single cluster failure affecting all services is an unacceptable business risk.
Scale limits: When your workloads approach the practical limits of a single cluster (around 5,000 nodes or 150,000 pods in most managed Kubernetes services).

Multi-cluster tooling

Cluster API for provisioning and managing the lifecycle of Kubernetes clusters declaratively. It brings the same infrastructure-as-code principles to cluster management that Kubernetes brings to workload management.
Service mesh (Istio, Linkerd) for cross-cluster service discovery, traffic management, and mTLS. A service mesh is practically a requirement for multi-cluster architectures where services need to communicate across cluster boundaries.
ArgoCD ApplicationSets or Flux multi-cluster for deploying workloads across clusters from a single Git repository. This ensures consistency and prevents configuration drift between clusters.
External DNS and global load balancers for routing user traffic to the nearest healthy cluster. AWS Route 53, GCP Cloud DNS, or Cloudflare all support health-checked, latency-based routing.

Do not adopt multi-cluster prematurely. The operational overhead — managing cluster lifecycles, cross-cluster networking, consistent RBAC, and centralized observability — is substantial. Start with namespace-based isolation within a single cluster and only move to multi-cluster when you have a clear requirement that single-cluster architecture cannot satisfy.

Putting It All Together: A Production Readiness Checklist

The practices in this guide are not independent — they reinforce each other. Pod security standards are more effective when combined with network policies. Autoscaling delivers better results when resource requests are right-sized. GitOps becomes powerful when backed by comprehensive observability.

Here is a condensed checklist for evaluating your production Kubernetes readiness:

Security: Pod Security Standards enforced at Restricted level. RBAC follows least privilege. Network policies default to deny. Images scanned and signed. Secrets managed externally.
Scaling: HPA configured on application-specific metrics. VPA running in recommendation mode. Resource requests reflect actual usage. KEDA deployed for event-driven workloads.
Cost: Resource quotas on every namespace. Spot instances for stateless workloads. Right-sizing reviews quarterly. Non-production environments scale down off-hours. Cost visibility per team and workload.
Deployments: GitOps controller (ArgoCD or Flux) managing all production deployments. Drift detection enabled. Progressive delivery for critical services. Pull request review required for manifest changes.
Observability: Metrics, logs, and traces collected and centralized. SLO-based alerting with runbooks. Dashboards for cluster health, workload performance, and cost. On-call rotation with clear escalation paths.

Production Kubernetes is not a destination — it is a continuous practice. The clusters that run reliably at scale are the ones where teams treat operational excellence as a feature, not an afterthought. Every item on this checklist should be reviewed, tested, and improved on a regular cadence, just like your application code.

Conclusion

Running Kubernetes in production demands more than knowing how to write manifests. It requires a disciplined approach to security, a data-driven approach to scaling, a cost-conscious approach to resource management, and automated deployment practices that eliminate human error.

The good news is that none of these practices require proprietary tools or exotic expertise. Pod Security Standards, RBAC, network policies, HPA, VPA, GitOps controllers, and the Prometheus observability stack are all open-source, battle-tested, and well-documented. The challenge is not technology — it is organizational maturity and the discipline to implement these practices consistently across every cluster, every namespace, and every deployment.

Start with the area that represents your biggest risk. If your cluster has no network policies, that is your first priority. If your cloud bill is growing faster than your traffic, focus on right-sizing and spot instances. If deployments are manual and error-prone, implement GitOps. Tackle one area at a time, measure the improvement, and move on to the next.

At DSi, our DevOps and platform engineering teams help organizations build and operate production Kubernetes infrastructure. Whether you need to harden an existing cluster, implement GitOps workflows, or optimize your cloud spend, our engineers embed directly into your team and work alongside your engineers to build lasting operational capability.

FAQ

Frequently Asked
Questions

The most critical practices are enforcing Pod Security Standards at the Restricted level, implementing least-privilege RBAC with regular audits, applying network policies to restrict pod-to-pod traffic, scanning container images for vulnerabilities in your CI/CD pipeline, and rotating secrets automatically using tools like External Secrets Operator. These five controls address the majority of Kubernetes security incidents seen in production environments.

Use both, but for different purposes. Horizontal Pod Autoscaler (HPA) adds or removes pod replicas based on metrics like CPU, memory, or custom application metrics — it is the primary scaling mechanism for stateless workloads. Vertical Pod Autoscaler (VPA) adjusts the CPU and memory requests of individual pods to right-size them. Run VPA in recommendation mode first to understand your actual resource usage, then use HPA for production scaling. For event-driven workloads, consider KEDA, which scales based on external event sources like message queues.

Spot instances typically save 60 to 90 percent compared to on-demand pricing. In practice, most teams achieve 40 to 60 percent overall savings on their Kubernetes compute costs by running stateless and fault-tolerant workloads on spot instances while keeping stateful workloads and critical control plane components on on-demand instances. The key is designing your workloads to handle interruptions gracefully using pod disruption budgets, graceful shutdown handlers, and checkpointing for long-running jobs.

Yes, for any team running more than a handful of services in production. GitOps eliminates configuration drift by making Git the single source of truth for your cluster state. ArgoCD and Flux both provide automated sync, drift detection, and rollback capabilities. The initial setup takes one to two weeks for a production-grade configuration, but the payoff is significant: faster deployments, complete audit trails, easier rollbacks, and the ability to recreate your entire cluster from a Git repository. Most teams see a return on the setup investment within the first month.

Consider multi-cluster when you need workload isolation for compliance or regulatory requirements, geographic distribution for latency-sensitive applications, blast radius reduction so a single cluster failure does not take down all services, or when a single cluster exceeds practical scaling limits (typically around 5,000 nodes or 150,000 pods). Do not adopt multi-cluster prematurely — it adds significant operational complexity. Start with namespace-based isolation within a single cluster and move to multi-cluster only when you have a clear operational or compliance requirement that demands it.

Kubernetes Best Practices for Production: Security, Scaling, and Cost Control