Deploying to AWS EKS with CDK: Lessons from Production

TL;DR

After deploying multiple production workloads to Amazon EKS using AWS CDK, we've learned that the gap between a working cluster and a production-ready one is enormous. This article shares our battle-tested patterns for VPC architecture, node group strategies, security hardening with IRSA and Pod Identity, GitOps bootstrapping with ArgoCD via CDK Blueprints, observability pipelines, and the CDK constructs that saved us from 3 AM production incidents.

Introduction

Amazon EKS is the gravitational center of modern cloud-native infrastructure on AWS. Paired with the AWS Cloud Development Kit, it promises infrastructure as code with the expressiveness of a real programming language — TypeScript, Python, Go — instead of YAML templates that sprawl across thousands of lines.

The promise is real. But the path from cdk init to a production cluster that handles real traffic, survives availability zone failures, and doesn't hemorrhage money is paved with decisions that documentation barely mentions.

At Atbion, we've deployed EKS clusters for fintech platforms processing thousands of transactions per second, multi-tenant SaaS applications serving enterprise clients, and real-time data pipelines handling terabytes daily. Each deployment taught us something the docs didn't.

This article is the guide we wish we had on day one. Not a tutorial — a field manual from production.

The Foundation: VPC Architecture That Doesn't Bite You at 3 AM

Why Default VPCs Are Production Landmines

The first mistake teams make with EKS is deploying into a default VPC or one hastily configured with minimal subnets. In production, your VPC is the foundation everything else stands on. Get it wrong, and you'll be refactoring under pressure when you run out of IP addresses at 2 AM during a traffic spike.

Our standard EKS VPC pattern uses three availability zones — not two. The cost difference is negligible, but the resilience difference is substantial. We learned this the hard way when us-east-1 had an AZ outage that took down a two-AZ cluster for a client. Three AZs survived the same incident for another client.

Each AZ gets three subnet tiers: public (for load balancers and NAT gateways), private with egress (for worker nodes and pods), and isolated (for databases and caches). This separation isn't bureaucratic — it's the difference between a security audit that takes a week and one that takes a month.

TypeScript

import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as eks from 'aws-cdk-lib/aws-eks';

const vpc = new ec2.Vpc(this, 'EksVpc', {
  maxAzs: 3,
  natGateways: 3, // One per AZ for resilience
  subnetConfiguration: [
    {
      name: 'Public',
      subnetType: ec2.SubnetType.PUBLIC,
      cidrMask: 24,
    },
    {
      name: 'Private',
      subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
      cidrMask: 18, // Large range for pods
    },
    {
      name: 'Isolated',
      subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
      cidrMask: 24,
    },
  ],
});

const cluster = new eks.Cluster(this, 'Production', {
  version: eks.KubernetesVersion.V1_32,
  vpc,
  vpcSubnets: [
    { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS }
  ],
  defaultCapacityType: eks.DefaultCapacityType.NODEGROUP,
});

The /18 CIDR mask for private subnets is intentional. The AWS VPC CNI plugin assigns one IP address per pod from the subnet range. A /24 gives you 251 usable IPs — enough for a demo, catastrophic for production. With /18 subnets across three AZs, you get approximately 48,000 pod IPs.

We deployed a fintech platform with /24 private subnets. At 3 AM on a Monday, the payment processing service tried to scale from 8 to 40 pods during a transaction surge. IP exhaustion cascaded into failed health checks, rolling restarts, and 23 minutes of degraded service. The fix was a VPC migration that took two weeks. The prevention was a larger CIDR mask on day one.

Cluster Configuration: Beyond the Defaults

The Settings That Separate Dev Clusters from Production

A default EKS cluster works. It runs pods, serves traffic, and gives you a functioning Kubernetes API. But a default cluster in production is a security audit failure waiting to happen. Three non-negotiable configurations we enable on every production cluster: envelope encryption for Kubernetes secrets, control plane logging to CloudWatch, and private endpoint access.

TypeScript

import * as kms from 'aws-cdk-lib/aws-kms';

const secretsKey = new kms.Key(this, 'EksSecretsKey', {
  alias: 'eks/secrets-encryption',
  enableKeyRotation: true,
});

const cluster = new eks.Cluster(this, 'Production', {
  version: eks.KubernetesVersion.V1_32,
  vpc,
  secretsEncryptionKey: secretsKey,
  endpointAccess: eks.EndpointAccess.PRIVATE,
  clusterLogging: [
    eks.ClusterLoggingTypes.API,
    eks.ClusterLoggingTypes.AUDIT,
    eks.ClusterLoggingTypes.AUTHENTICATOR,
    eks.ClusterLoggingTypes.CONTROLLER_MANAGER,
    eks.ClusterLoggingTypes.SCHEDULER,
  ],
});

Version pinning deserves special attention. Never use latest or auto for production Kubernetes versions. Pin to a specific version and upgrade deliberately. We maintain a version matrix that maps our Kubernetes version to validated addon versions — CoreDNS, kube-proxy, VPC CNI — and test the entire matrix before upgrading production.

Node Groups: The Decision That Defines Your Bill

Node group strategy is where architecture meets economics. Choose wrong, and you're either overpaying by 60% or suffering from resource starvation during peak load. EKS offers three compute models, and we've used all of them — sometimes in the same cluster.

Managed Node Groups are our default for stateful and predictable workloads. Fargate Profiles are ideal for batch jobs and intermittent workloads. Karpenter is the game-changer we adopted — it provisions exactly the right instance type for pending pods in real-time and reduced our compute costs by 40% on one cluster.

TypeScript

// Managed Node Group with multiple instance types
cluster.addNodegroupCapacity('WorkerNodes', {
  instanceTypes: [
    new ec2.InstanceType('m6i.xlarge'),
    new ec2.InstanceType('m6a.xlarge'),
    new ec2.InstanceType('m5.xlarge'),
  ],
  minSize: 3,
  maxSize: 20,
  desiredSize: 5,
  diskSize: 100,
  labels: { role: 'worker', environment: 'production' },
});

// Fargate for batch processing namespace
cluster.addFargateProfile('BatchProfile', {
  selectors: [
    { namespace: 'batch-jobs' },
    { namespace: 'cron-tasks' },
  ],
});

When to use Managed Node Groups

Stateful workloads (databases, caches)
Predictable baseline traffic
DaemonSet requirements (logging, monitoring)
GPU/ML inference workloads
Workloads needing privileged containers

When to use Fargate

Batch processing and cron jobs
Short-lived tasks and CI runners
Dev/staging environments
Workloads with unpredictable schedules
Cost optimization for idle-heavy patterns

Security: IRSA, Pod Identity, and Least Privilege

The most impactful security decision in an EKS cluster is how pods authenticate to AWS services. IAM Roles for Service Accounts (IRSA) maps each Kubernetes service account to a specific IAM role with precisely scoped permissions. EKS Pod Identity is the evolution — simpler to configure, with cross-account support and session tags.

TypeScript

import * as iam from 'aws-cdk-lib/aws-iam';

// IRSA: Service account with scoped IAM role
const sa = cluster.addServiceAccount('AppServiceAccount', {
  name: 'app-service-account',
  namespace: 'production',
});

// Grant only what the pod needs — nothing more
sa.addToPrincipalPolicy(new iam.PolicyStatement({
  actions: ['s3:GetObject', 's3:ListBucket'],
  resources: [
    bucket.bucketArn,
    `${bucket.bucketArn}/*`,
  ],
}));

sa.addToPrincipalPolicy(new iam.PolicyStatement({
  actions: ['sqs:SendMessage', 'sqs:ReceiveMessage'],
  resources: [queue.queueArn],
}));

Network policies are the second critical security layer. By default, every pod can communicate with every other pod. We deploy the VPC CNI's built-in network policy support to enforce namespace isolation and prevent lateral movement between services. Security Groups for Pods extend AWS-native security groups to individual pods — particularly valuable in regulated industries.

Networking: CNI, Load Balancers, and Ingress

The AWS VPC CNI plugin gives each pod a real VPC IP address — not an overlay network — which means pods can communicate directly with RDS instances, ElastiCache clusters, and other VPC resources without NAT. The AWS Load Balancer Controller is non-negotiable for production, provisioning ALBs for Ingress resources and NLBs for LoadBalancer services.

For ingress strategy, we use a two-tier approach: a shared ALB for internal services using path-based routing, and dedicated NLBs for external-facing services. One pattern that saved us repeatedly: annotating Ingress resources with alb.ingress.kubernetes.io/group.name to share a single ALB across multiple services — preventing 15 ALBs costing $240/month when they could share one.

GitOps with ArgoCD via CDK Blueprints

CDK EKS Blueprints transformed how we bootstrap ArgoCD. Instead of manually installing ArgoCD, configuring repositories, and wiring up each addon, Blueprints provides a single construct that installs ArgoCD and configures it to sync from a git repository. Every addon — Metrics Server, Cluster Autoscaler, AWS Load Balancer Controller, VPC CNI — can be managed through ArgoCD's Application resources.

TypeScript

import * as blueprints from '@aws-quickstart/eks-blueprints';

const addOns: Array<blueprints.ClusterAddOn> = [
  new blueprints.addons.ArgoCDAddOn({
    bootstrapRepo: {
      repoUrl: 'https://github.com/acme/platform-addons',
      path: 'envs/production',
      targetRevision: 'main',
    },
    adminPasswordSecretName: 'argocd-admin-secret',
  }),
  new blueprints.addons.MetricsServerAddOn(),
  new blueprints.addons.ClusterAutoScalerAddOn(),
  new blueprints.addons.AwsLoadBalancerControllerAddOn(),
  new blueprints.addons.VpcCniAddOn(),
  new blueprints.addons.CoreDnsAddOn(),
  new blueprints.addons.KubeProxyAddOn(),
  new blueprints.addons.ContainerInsightsAddOn(),
];

blueprints.EksBlueprint.builder()
  .account(account)
  .region('us-east-1')
  .version('auto')
  .addOns(...addOns)
  .useDefaultSecretEncryption(true)
  .enableGitOps(blueprints.GitOpsMode.APPLICATION)
  .build(app, 'production-cluster');

One construct. Twenty lines. That deploys a production EKS cluster with ArgoCD, autoscaling, load balancing, monitoring, and network policy support. Before CDK Blueprints, this was 800+ lines of CloudFormation, three Helm installations, and a prayer that the IRSA roles were configured correctly.

Observability: You Can't Fix What You Can't See

Our standard observability stack has three pillars: Metrics (Container Insights + Prometheus with Grafana), Logs (Fluent Bit shipping to CloudWatch Logs), and Traces (AWS X-Ray with the OpenTelemetry collector). The entire stack runs inside the cluster, costs nothing beyond compute, and gives us the same visibility as expensive SaaS observability platforms.

CDK Patterns That Saved Us in Production

Multi-stack architecture: never put your VPC, EKS cluster, and application workloads in the same CDK stack. We use three stacks minimum — NetworkStack, ClusterStack, and WorkloadStack — with cross-stack references via SSM Parameter Store instead of CloudFormation exports. Environment-specific configuration through CDK context keeps the same code with different parameters.

TypeScript

// Multi-stack pattern with SSM Parameter Store

// Stack 1: Network
class NetworkStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: cdk.StackProps) {
    super(scope, id, props);
    const vpc = new ec2.Vpc(this, 'Vpc', { maxAzs: 3 });
    new ssm.StringParameter(this, 'VpcId', {
      parameterName: '/eks/production/vpc-id',
      stringValue: vpc.vpcId,
    });
  }
}

// Stack 2: Cluster (reads VPC from SSM)
class ClusterStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: cdk.StackProps) {
    super(scope, id, props);
    const vpcId = ssm.StringParameter.valueForStringParameter(
      this, '/eks/production/vpc-id'
    );
    const vpc = ec2.Vpc.fromLookup(this, 'Vpc', { vpcId });
    // ... cluster definition
  }
}

The Numbers: Cost and Performance Impact

After 18 months of running production EKS clusters with CDK, here's what the numbers tell us:

-40%

Compute costs with Karpenter

99.97%

Cluster uptime across all clients

4 min

Average deploy to production

-70%

Infrastructure code with CDK Blueprints

< 8 min

Mean time to recovery

Lessons Learned

Start with CDK Blueprints, not raw constructs

We spent three months building custom CDK constructs before discovering CDK EKS Blueprints did everything we built — and more. Build on top of Blueprints; don't reinvent it.

Separate your stacks or pay the price

A single-stack EKS deployment works in development. In production, a failed Helm release rollback cascaded into a CloudFormation rollback that tried to delete the VPC.

Over-provision your VPC CIDR, not your compute

IP addresses are free. Expanding a VPC CIDR after deployment ranges from painful to impossible. Use /16 for the VPC and /18 for private subnets.

Observability before traffic, not after outages

Our first production EKS cluster had no Container Insights, no Prometheus, and minimal logging. The first incident took 4 hours to diagnose. Deploy the full observability stack before any application workload.

Test your disaster recovery before you need it

We run quarterly DR drills: destroy a staging cluster, rebuild from CDK, restore data, validate. Every drill reveals something the code doesn't capture.

Conclusion

Deploying to AWS EKS with CDK is not a weekend project. It's a series of decisions — VPC sizing, node group strategy, security boundaries, addon management, observability depth — that compound into either a resilient platform or a fragile one.

The tools are extraordinary. CDK gives you infrastructure with the ergonomics of application code. EKS Blueprints compresses weeks of addon wiring into composable constructs. ArgoCD turns deployments into git commits. Karpenter turns capacity planning into something the cluster handles for you.

At Atbion, we've learned these lessons so our clients don't have to. Every production cluster we deploy carries the accumulated wisdom of every incident we've survived. That's the real value of experience — not avoiding mistakes entirely, but making each mistake only once.

Deploying to AWS EKS with CDK: Lessons from Production

Introduction

The Foundation: VPC Architecture That Doesn't Bite You at 3 AM

Why Default VPCs Are Production Landmines

Cluster Configuration: Beyond the Defaults

The Settings That Separate Dev Clusters from Production

Node Groups: The Decision That Defines Your Bill

Security: IRSA, Pod Identity, and Least Privilege

Networking: CNI, Load Balancers, and Ingress

GitOps with ArgoCD via CDK Blueprints

Observability: You Can't Fix What You Can't See

CDK Patterns That Saved Us in Production

The Numbers: Cost and Performance Impact

Lessons Learned

Conclusion

Need a production-ready EKS deployment?

Cookie Preferences