The moment you mention 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 in a Devops interview, expect a deep dive
Here are 17 Kubernetes questions I was asked that dive into architecture, troubleshooting, and real-world decision-making:
-
Your pod keeps getting stuck in CrashLoopBackOff, but logs show no errors. How would you approach debugging and resolution?
-
You have a StatefulSet deployed with persistent volumes, and one of the pods is not recreating properly after deletion. What could be the reasons, and how do you fix it without data loss?
-
Your cluster autoscaler is not scaling up even though pods are in Pending state. What would you investigate?
-
A network policy is blocking traffic between services in different namespaces. How would you design and debug the policy to allow only specific communication paths?
-
One of your microservices has to connect to an external database via a VPN inside the cluster. How would you architect this in Kubernetes with HA and security in mind?
-
You're running a multi-tenant platform on a single EKS cluster. How do you isolate workloads and ensure security, quotas, and observability for each tenant?
-
You notice the kubelet is constantly restarting on a particular node. What steps would you take to isolate the issue and ensure node stability?
-
A critical pod in production gets evicted due to node pressure. How would you prevent this from happening again, and how do QoS classes play a role?
-
You need to deploy a service that requires TCP and UDP on the same port. How would you configure this in Kubernetes using Services and Ingress?
-
An application upgrade caused downtime even though you had rolling updates configured. What advanced strategies would you apply to ensure zero-downtime deployments next time?
-
Your service mesh sidecar (e.g., Istio Envoy) is consuming more resources than the app itself. How do you analyze and optimize this setup?
-
You need to create a Kubernetes operator to automate complex application lifecycle events. How do you design the CRD and controller loop logic?
-
Multiple nodes are showing high disk IO usage due to container logs. What Kubernetes features or practices can you apply to avoid this scenario?
-
Your Kubernetes cluster's etcd performance is degrading. What are the root causes and how do you ensure etcd high availability and tuning?
-
You want to enforce that all images used in the cluster must come from a trusted internal registry. How do you implement this at the policy level?
-
You're managing multi-region deployments using a single Kubernetes control plane. What architectural considerations must you address to avoid cross-region latency and single points of failure?
-
During peak traffic, your ingress controller fails to route requests efficiently. How would you diagnose and scale ingress resources effectively under heavy load?
Answer Approach:
- Check resource limits and requests (CPU/memory constraints)
- Examine readiness and liveness probes configuration
- Verify container startup commands and entry points
- Check for missing dependencies or configuration files
- Review environment variables and secrets
- Analyze previous container logs:
kubectl logs <pod> --previous - Use
kubectl describe pod <pod>for detailed events - Check container image compatibility and startup time requirements
Key Steps:
kubectl get pods -o wide- Check pod status and nodekubectl logs <pod> --previous- Previous container logskubectl describe pod <pod>- Event timeline- Check resource quotas and limits
- Verify health check configurations
- Test container locally if possible
Learning Resources:
Answer Approach:
- Understand that StatefulSets maintain pod identity and PV binding
- Check PVC status and binding to correct PV
- Verify storage class and provisioner health
- Examine node affinity and zone constraints
- Review PV reclaim policy and access modes
Key Steps:
kubectl get pvc,pv- Check volume binding statuskubectl describe pvc <pvc-name>- Check binding events- Verify node has available storage and proper labels
- Check if PV is still bound to the old pod identity
- Examine storage class and provisioner logs
- Validate volume mount paths and permissions
Data Loss Prevention:
- Never delete PVCs directly unless intentional
- Use
kubectl patch statefulset <name> -p '{"spec":{"replicas":0}}'to scale down safely - Backup data before troubleshooting
- Check PV reclaim policy (should be Retain for production)
Learning Resources:
Answer Approach:
- Check autoscaler configuration and node group limits
- Verify resource requests are specified on pending pods
- Examine node taints, tolerations, and affinity rules
- Review autoscaler logs for scaling decisions
- Check cloud provider quotas and instance availability
Investigation Steps:
kubectl get pods --field-selector=status.phase=Pending- List pending podskubectl describe pod <pending-pod>- Check scheduling constraints- Review autoscaler logs:
kubectl logs -n kube-system deployment/cluster-autoscaler - Check node group configuration and limits
- Verify cloud provider quotas and instance types
- Examine pod resource requests and node capacity
Common Issues:
- Missing resource requests on pods
- Node group at maximum size
- Cloud provider quota limits
- Incompatible node selectors or affinity rules
- Autoscaler misconfiguration
Learning Resources:
Answer Approach:
- Understand default deny vs allow behavior
- Design policies using namespace and pod selectors
- Test connectivity systematically
- Use network policy testing tools
Design Strategy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-cross-namespace
namespace: target-namespace
spec:
podSelector:
matchLabels:
app: target-app
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: source-namespace
- podSelector:
matchLabels:
app: source-app
ports:
- protocol: TCP
port: 8080Debugging Steps:
kubectl get networkpolicies --all-namespaces- Test connectivity:
kubectl exec -it <pod> -- nc -zv <target-service> <port> - Use network policy testing tools
- Check CNI plugin compatibility
- Verify namespace labels for selectors
Learning Resources:
Answer Approach:
- Use VPN gateway as a deployment or daemon set
- Implement connection pooling and failover
- Secure credentials with secrets and service accounts
- Design for high availability across zones
Architecture Components:
- VPN Gateway: Dedicated pods with VPN client configuration
- Database Proxy: Connection pooling (e.g., PgBouncer for PostgreSQL)
- Service Mesh: For traffic management and security
- Secrets Management: External secrets operator or Vault integration
- Network Policies: Restrict database access to authorized services
HA Considerations:
- Multiple VPN gateway replicas across zones
- Database connection failover logic
- Health checks and circuit breakers
- Persistent connections through service endpoints
Security Measures:
- Network policies for database access control
- TLS encryption for all connections
- Service account-based authentication
- Regular credential rotation
- Audit logging for database access
Learning Resources:
Answer Approach:
- Implement namespace-based isolation
- Use RBAC for granular permissions
- Apply resource quotas and limit ranges
- Set up network policies for traffic isolation
- Implement observability per tenant
Isolation Strategies:
- Namespace Isolation: One namespace per tenant
- RBAC: Role-based access control per tenant
- Resource Quotas: CPU, memory, storage limits
- Network Policies: Traffic segmentation
- Pod Security Standards: Security contexts and policies
Security Implementation:
- Service accounts per tenant
- OPA Gatekeeper for policy enforcement
- Image scanning and admission controllers
- Secrets management per tenant
- Audit logging and compliance
Quota Management:
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-quota
namespace: tenant-a
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
persistentvolumeclaims: "4"Observability Per Tenant:
- Prometheus metrics with tenant labels
- Grafana dashboards per tenant
- Log aggregation with tenant filtering
- Cost allocation and chargeback
Learning Resources:
Answer Approach:
- Check kubelet logs and system resources
- Verify node configuration and certificates
- Examine container runtime health
- Review system-level issues (disk space, memory)
Investigation Steps:
sudo systemctl status kubelet- Check service statussudo journalctl -u kubelet -f- View kubelet logs- Check disk space:
df -h - Verify container runtime:
sudo systemctl status containerd - Check node resources:
kubectl top node <node-name> - Examine kubelet configuration:
/var/lib/kubelet/config.yaml
Common Causes:
- Insufficient disk space
- Certificate expiration or rotation issues
- Container runtime problems
- Memory pressure on the node
- Misconfigured kubelet parameters
- Network connectivity issues
Resolution Steps:
- Clean up disk space (images, logs, temp files)
- Restart container runtime if needed
- Check and renew certificates
- Adjust kubelet configuration
- Monitor resource usage patterns
Learning Resources:
Answer Approach:
- Understand QoS classes: Guaranteed, Burstable, BestEffort
- Set appropriate resource requests and limits
- Implement pod disruption budgets
- Use priority classes for critical workloads
QoS Classes:
- Guaranteed: requests = limits for all containers
- Burstable: at least one container has requests < limits
- BestEffort: no requests or limits specified
Prevention Strategies:
apiVersion: v1
kind: Pod
spec:
priorityClassName: high-priority
containers:
- name: app
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"Pod Disruption Budget:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: critical-appMonitoring and Alerting:
- Set up node resource monitoring
- Alert on memory/disk pressure
- Monitor eviction events
- Track QoS class distribution
Learning Resources:
Answer Approach:
- Use multiple Service definitions
- Configure load balancer annotations
- Implement separate ingress rules
- Consider service mesh solutions
Service Configuration:
apiVersion: v1
kind: Service
metadata:
name: app-tcp
spec:
type: LoadBalancer
ports:
- port: 8080
protocol: TCP
selector:
app: myapp
---
apiVersion: v1
kind: Service
metadata:
name: app-udp
spec:
type: LoadBalancer
ports:
- port: 8080
protocol: UDP
selector:
app: myappIngress Considerations:
- Most ingress controllers only support HTTP/HTTPS (TCP)
- Use TCP/UDP ingress controllers for Layer 4 traffic
- Consider service mesh for advanced traffic management
Alternative Solutions:
- Use different ports for different protocols
- Implement application-level protocol switching
- Use service mesh with traffic splitting capabilities
Learning Resources:
Answer Approach:
- Implement blue-green deployments
- Use canary deployments with traffic splitting
- Configure proper readiness probes
- Implement pre-stop hooks and graceful shutdowns
Advanced Strategies:
- Blue-Green Deployment: Complete environment switch
- Canary Deployment: Gradual traffic shifting
- Rolling Updates with Traffic Management: Service mesh control
- Feature Flags: Application-level traffic control
Configuration Example:
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
template:
spec:
containers:
- name: app
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
lifecycle:
preStop:
exec:
command: ["/bin/sleep", "15"]Best Practices:
- Never set maxUnavailable to 100%
- Implement comprehensive health checks
- Use graceful shutdown procedures
- Test deployment strategies in staging
- Monitor deployment metrics and rollback triggers
Learning Resources:
Answer Approach:
- Analyze sidecar resource usage patterns
- Tune sidecar configuration parameters
- Implement resource limits and requests
- Consider sidecar-less architectures for specific workloads
Analysis Steps:
- Monitor sidecar CPU and memory usage
- Analyze traffic patterns and connection counts
- Review sidecar configuration and feature usage
- Check for memory leaks or connection pooling issues
- Benchmark different configuration options
Optimization Techniques:
apiVersion: v1
kind: ConfigMap
metadata:
name: istio-proxy-config
data:
ProxyStatsMatcher: |
inclusionRegexps:
- ".*circuit_breakers.*"
- ".*upstream_rq_retry.*"
exclusionRegexps:
- ".*osconfig.*"Resource Tuning:
- Adjust proxy concurrency settings
- Tune memory limits for Envoy
- Configure connection pool settings
- Disable unused features and protocols
- Optimize stats collection
Alternative Approaches:
- Ambient mesh for reduced resource overhead
- Selective sidecar injection
- Service mesh bypass for internal services
- Direct service-to-service communication for low-latency requirements
Learning Resources:
Answer Approach:
- Design Custom Resource Definitions (CRDs)
- Implement controller reconciliation logic
- Handle error scenarios and retries
- Implement proper status reporting and events
CRD Design:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: myapps.example.com
spec:
group: example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
replicas:
type: integer
image:
type: string
status:
type: object
properties:
phase:
type: stringController Logic:
- Watch: Monitor CRD changes
- Reconcile: Compare desired vs actual state
- Update: Make necessary changes
- Status: Report current state
- Retry: Handle failures gracefully
Best Practices:
- Use controller-runtime framework
- Implement proper logging and metrics
- Handle cascading deletions with finalizers
- Use webhooks for validation and mutation
- Test with various failure scenarios
Learning Resources:
Answer Approach:
- Implement log rotation and retention policies
- Use centralized logging solutions
- Configure log drivers and formatters
- Set up log level management
Solutions:
- Log Rotation: Configure kubelet log rotation
- Centralized Logging: Fluentd, Fluent Bit, or similar
- Structured Logging: JSON format for efficient parsing
- Log Sampling: Reduce verbose logging in production
- Storage Optimization: Use faster storage for log volumes
Configuration Example:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
fluent-bit.conf: |
[SERVICE]
Flush 1
Daemon Off
Log_Level info
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser cri
Tag kube.*
Mem_Buf_Limit 5MBKubelet Configuration:
- Set containerLogMaxSize and containerLogMaxFiles
- Configure log rotation frequency
- Use log drivers that support compression
Monitoring and Alerting:
- Monitor disk usage on nodes
- Alert on high IO wait times
- Track log volume growth rates
- Set up automated cleanup procedures
Learning Resources:
Answer Approach:
- Identify etcd performance bottlenecks
- Implement proper backup and restore procedures
- Configure etcd for high availability
- Monitor etcd metrics and health
Performance Optimization:
- Storage: Use fast SSD storage with low latency
- Network: Ensure low-latency network between etcd nodes
- Resource Allocation: Adequate CPU and memory
- Tuning Parameters: Adjust heartbeat and election timeouts
- Compaction: Regular database compaction
HA Configuration:
- Odd number of etcd nodes (3, 5, 7)
- Geographic distribution across availability zones
- Load balancing for etcd clients
- Automated backup and disaster recovery
Monitoring Metrics:
# Key metrics to monitor
- etcd_server_has_leader
- etcd_server_leader_changes_seen_total
- etcd_disk_wal_fsync_duration_seconds
- etcd_network_peer_round_trip_time_seconds
- etcd_mvcc_db_total_size_in_bytesBackup Strategy:
- Automated snapshot backups
- Cross-region backup replication
- Regular restore testing
- Point-in-time recovery procedures
Learning Resources:
Answer Approach:
- Implement admission controllers for image validation
- Use OPA Gatekeeper or similar policy engines
- Configure image signing and verification
- Set up automated vulnerability scanning
Implementation Methods:
- OPA Gatekeeper: Policy-based admission control
- Admission Webhooks: Custom validation logic
- Pod Security Standards: Built-in security policies
- Image Policy Webhook: Image-specific validation
OPA Gatekeeper Example:
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
name: allowedregistries
spec:
crd:
spec:
names:
kind: AllowedRegistries
validation:
properties:
registries:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package allowedregistries
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not starts_with(container.image, input.parameters.registries[_])
msg := "Container image not from approved registry"
}Security Enhancements:
- Image signature verification with Cosign
- Vulnerability scanning with Trivy or similar
- Runtime security monitoring
- Supply chain security with SLSA framework
Learning Resources:
Answer Approach:
- Design for cross-region latency tolerance
- Implement proper data replication strategies
- Plan for disaster recovery scenarios
- Consider federation or multi-cluster alternatives
Architectural Considerations:
- Network Latency: Impact on etcd performance and API response times
- Data Replication: Cross-region etcd cluster with proper quorum
- Failure Scenarios: Region isolation and split-brain prevention
- Workload Placement: Node affinity and topology constraints
Alternative Architectures:
- Cluster Federation: Multiple clusters with centralized management
- Multi-Cluster Service Mesh: Cross-cluster service discovery
- GitOps Multi-Cluster: Centralized configuration management
- Hierarchical Namespaces: Administrative domain separation
Best Practices:
- Implement circuit breakers for cross-region calls
- Use local caching where possible
- Design applications for eventual consistency
- Plan for region-specific compliance requirements
Disaster Recovery:
- Automated backup and restore procedures
- Cross-region data replication
- Runbook for disaster scenarios
- Regular disaster recovery testing
Learning Resources:
- Multi-Cluster Kubernetes Patterns
- Admiralty Multi-Cluster Scheduler
- Cluster API for Multi-Cluster Management
Answer Approach:
- Analyze ingress controller resource utilization
- Implement horizontal and vertical scaling
- Optimize configuration for high throughput
- Use multiple ingress controllers with traffic distribution
Performance Analysis:
- Metrics Collection: Monitor request rate, latency, error rate
- Resource Usage: CPU, memory, network utilization
- Connection Patterns: Keep-alive, connection pooling
- Backend Health: Upstream service performance
Scaling Strategies:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-ingress-controller
spec:
replicas: 5
template:
spec:
containers:
- name: nginx-ingress-controller
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 2000m
memory: 2GiConfiguration Optimization:
- Tune worker processes and connections
- Configure appropriate buffer sizes
- Implement connection keep-alive
- Use efficient load balancing algorithms
- Enable HTTP/2 and compression
Advanced Solutions:
- Multiple ingress controller classes
- Geographic traffic distribution
- CDN integration for static content
- Service mesh for advanced traffic management
Monitoring and Alerting:
- Request rate and latency metrics
- Error rate and status code distribution
- Resource utilization alerts
- Capacity planning based on traffic patterns
Learning Resources:
Remember to practice these scenarios in a lab environment and understand the underlying concepts rather than just memorizing commands.