Intent:
- Increase awareness of architectural best practices
- Addresses foundational areas that are often neglected
- Provide consistent approach to evaluating architectures
- Influence future architectures
Areas:
- Application checklist for Kubernetes
- Cluster ready checklist for Kubernetes
- Operational consideration for Kubernetes
Standard WAR pillars:
- Security
- Cost Optimizationm
- Operational Excellence
- Performance
- Reliability
- Pod readiness checks
- Liveness checks
- Metric instrumentations (i.e. Prometheus, New Relic, Datadog, etc...)
- Dashboards - standard K8, Grafana or alternatives
- Playbooks and Runbooks
- Limits and Requests
- Labels and annotations
- Pod placement
- How many pods per application?
- Taints and Tolerations
- Pod affinity / anti-affinity
- Node selectors
- Alerting
- Structured logging output (ELK stack or commercial options)
- Tracing (X-Ray, Zipkin, Lightstep, Appdash, Jaeger)
- Graceful shutdowns (i.e. how does app respond to SIGTERM)
- Graceful dependencies (Apps should not assume dependencies are available)
- Configmaps (Apps should use them for dependency injection)
- Labeled images using commit SHA (do not use "latest" image)
- Locked down runtime context (i.e. no root user)
- Consider using Pod Security Policy (PSP)
- Consider using AppArmor or SELinux security context
- Build Pipeline - CI portion (Jenkins, Travis, CircleCI, CodeBuild)
- Deployment Pipeline - CD portion (GitOps using Weave Cloud and Flux)
- Image Registry (DockerHub, JFrog or ECR)
- Private Repos require credential storage
- Monitoring infrastructure by collecting and storing metrics (Prometheus or CloudWatch)
- Databases or Stateful Apps
- Storage
- CSI drivers for block storage
- CSI drivers for shared file storage (EFS)
- OpenEBS
- Portworx
- Rook / Ceph
- Secrets Management (Bitnami Sealed Secrets, Hashicorp Vault, etc...)
- Bitnami Sealed Secrets
- GoDaddy External Secrets
- SOPS
- kubesec
- Jeremy's prototype
- HashiCorp Vault
- Ingress Controller (ALB, nginx, Kong, Solo gloo, Traefik, HAProxy, Ambassador, etc...)
- Service Mesh (AppMesh, Istio, Linkerd)
- Service Catalog functionality
- service catalog
- AWS Operator
- User and Pod Authorization
- IAM users or roles
- SAML Federation
- IRSA for Pods, KIAM or Kube2IAM
- Network Policies (Tigera Calico)
- Static or Dynamic Image/Runtime Scanning
- ECR (static only)
- Twistlock
- Aqua Security
- Stack Rox
- Sysdig
- Log Aggregation
- CloudWatch / Container Insights (i.e. Fluent bit or FluentD forwarder)
- Splunk
- Others
- Horizontal Pod Autoscaling
- Metrics Server
- Use AWS CloudWatch or external metrics
- Vertical Pod Autoscaling
- Cluster autoscaling or AWS native ASG
- cluster autoscaler is not AZ aware
- cluster autoscaler will dynamically move pods around and terminate instances
- cluster autoscaler is reactive
- cluster autoscaler assumes nodegroups are homogenous
- How do you bootstrap a new cluster
- helm files
- GitOps
- scripts
- Utilizing Namespaces for team/developer isolation
- How to create clusters using IaC
- eksctl
- CloudFormation / CDK
- Terraform
- Pulumi
- Other
- How do you upgrade your node groups?
- eksctl
- AWS managed node groups
- manually
- Do you have SSH access into your worker nodes
- SSH keys
- SSM agent
- AMI choice
- AWS Linux 2 / EKS Optimized Linux
- Ubuntu
- Custom AMI
- VPC design
- AWS CNI requires large number of IP addresses
- Overlay CIDR block possible
- 6x /20 subnets recommended (3 public and 3 private) over 3x Availability Zones
- Public or Private DNS settings
- Private endpoints require additional work
- Hybrid design
- Worker Nodes
- Instance sizing impacts number of pods/IPs
- Fargate for EKS
- DNS (CoreDNS)
- EKS only uses 2 DNS pods by default
- Daemon set may be better
- Daemon set with local access only
- External DNS
- Control Plane logging - CloudTrail (off by default)
- Disaster Recovery
Includes the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies
-
Apply security at all layers
-
Enable traceability
-
Implement a principle of least privilege
-
Focus on securing your system
-
AWS Shared Responsibility Model
-
Automate security best practices
- Detective Controls
- Infrastructure Protection
- Data Protection
- Incident Response
- IAM
- Root account
- MFA
- Not used
- Key rotation
- IAM role
- Federation
- Encryption
- At rest
- In transit
- Key storage
- KMS
- CloudHSM
- Other
- Network / VPC
- Security Groups
- NACLS
- Pen tests
- Host based firewalls
- WAF
- Monitoring and Logging
- Cloudtrail
- CloudWatch logs
- VPC flow logs
- Third Party Systems (Splunk, AppMonitoring)
The ability of a system to recover from infrastructure or service failures, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues
- Test recovery procedures
- Automatically recover from failure
- Scale horizontally to increase aggregate system availability
- Stop guessing capacity
- Manage change using automation
- Limits monitoring
- HA/Failover
- Autoscaling
- Monitoring
- Change management
- GitOps or Infrastructure as Code
- Chaos Testing
- Backup and recovery
- Planning for DR
- Did you have Enterprise support?
The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve
- Democratize advanced technologies
- Go global in minutes
- Use serverless architectures
- Experiment more often
- Mechanical sympathy
- Instance selection
- Instance monitoring
- Autoscaling
- Database selection
- Load testing
The ability to avoid or eliminate unneeded cost or suboptimal resources while meeting your functional requirements
- Cost-effective resources
- Matching supply with demand
- Expenditure awareness
- Optimizing over time
- Governance
- Spend monitoring
- Usage to spend monitoring
- Storage usage
- CDN
- RI's
- Spot
- Use higher level services
- SQS
- DDB
- SNS
- etc
- Cleanup/decommissioning
The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures
- Preparation
- Operation
- Response
- What best practices for cloud operations are you using?
- How are you doing configuration management for your workload?
- How are you evolving your workload while minimizing the impact of change?
- How do you monitor your workload to ensure it is operating as expected?
- How do you respond to unplanned operational events?
- How is escalation managed when responding to unplanned operational events?