Skip to content

Instantly share code, notes, and snippets.

@dasiths
Last active June 30, 2025 02:50
Show Gist options
  • Save dasiths/65ba0cc05788b7f0fe69da47832cbc7f to your computer and use it in GitHub Desktop.
Save dasiths/65ba0cc05788b7f0fe69da47832cbc7f to your computer and use it in GitHub Desktop.

https://www.linkedin.com/posts/akhilesh-mishra-0ab886124_the-moment-you-mention-%F0%9D%97%9E%F0%9D%98%82%F0%9D%97%AF%F0%9D%97%B2%F0%9D%97%BF%F0%9D%97%BB%F0%9D%97%B2%F0%9D%98%81%F0%9D%97%B2-activity-7343950304729583616-u0Zm/

The moment you mention 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 in a Devops interview, expect a deep dive

Here are 17 Kubernetes questions I was asked that dive into architecture, troubleshooting, and real-world decision-making:

  1. Your pod keeps getting stuck in CrashLoopBackOff, but logs show no errors. How would you approach debugging and resolution?

  2. You have a StatefulSet deployed with persistent volumes, and one of the pods is not recreating properly after deletion. What could be the reasons, and how do you fix it without data loss?

  3. Your cluster autoscaler is not scaling up even though pods are in Pending state. What would you investigate?

  4. A network policy is blocking traffic between services in different namespaces. How would you design and debug the policy to allow only specific communication paths?

  5. One of your microservices has to connect to an external database via a VPN inside the cluster. How would you architect this in Kubernetes with HA and security in mind?

  6. You're running a multi-tenant platform on a single EKS cluster. How do you isolate workloads and ensure security, quotas, and observability for each tenant?

  7. You notice the kubelet is constantly restarting on a particular node. What steps would you take to isolate the issue and ensure node stability?

  8. A critical pod in production gets evicted due to node pressure. How would you prevent this from happening again, and how do QoS classes play a role?

  9. You need to deploy a service that requires TCP and UDP on the same port. How would you configure this in Kubernetes using Services and Ingress?

  10. An application upgrade caused downtime even though you had rolling updates configured. What advanced strategies would you apply to ensure zero-downtime deployments next time?

  11. Your service mesh sidecar (e.g., Istio Envoy) is consuming more resources than the app itself. How do you analyze and optimize this setup?

  12. You need to create a Kubernetes operator to automate complex application lifecycle events. How do you design the CRD and controller loop logic?

  13. Multiple nodes are showing high disk IO usage due to container logs. What Kubernetes features or practices can you apply to avoid this scenario?

  14. Your Kubernetes cluster's etcd performance is degrading. What are the root causes and how do you ensure etcd high availability and tuning?

  15. You want to enforce that all images used in the cluster must come from a trusted internal registry. How do you implement this at the policy level?

  16. You're managing multi-region deployments using a single Kubernetes control plane. What architectural considerations must you address to avoid cross-region latency and single points of failure?

  17. During peak traffic, your ingress controller fails to route requests efficiently. How would you diagnose and scale ingress resources effectively under heavy load?

Answers To Learn From

1. CrashLoopBackOff Debugging (No Errors in Logs)

Answer Approach:

  • Check resource limits and requests (CPU/memory constraints)
  • Examine readiness and liveness probes configuration
  • Verify container startup commands and entry points
  • Check for missing dependencies or configuration files
  • Review environment variables and secrets
  • Analyze previous container logs: kubectl logs <pod> --previous
  • Use kubectl describe pod <pod> for detailed events
  • Check container image compatibility and startup time requirements

Key Steps:

  1. kubectl get pods -o wide - Check pod status and node
  2. kubectl logs <pod> --previous - Previous container logs
  3. kubectl describe pod <pod> - Event timeline
  4. Check resource quotas and limits
  5. Verify health check configurations
  6. Test container locally if possible

Learning Resources:

2. StatefulSet Pod Recreation Issues with Persistent Volumes

Answer Approach:

  • Understand that StatefulSets maintain pod identity and PV binding
  • Check PVC status and binding to correct PV
  • Verify storage class and provisioner health
  • Examine node affinity and zone constraints
  • Review PV reclaim policy and access modes

Key Steps:

  1. kubectl get pvc,pv - Check volume binding status
  2. kubectl describe pvc <pvc-name> - Check binding events
  3. Verify node has available storage and proper labels
  4. Check if PV is still bound to the old pod identity
  5. Examine storage class and provisioner logs
  6. Validate volume mount paths and permissions

Data Loss Prevention:

  • Never delete PVCs directly unless intentional
  • Use kubectl patch statefulset <name> -p '{"spec":{"replicas":0}}' to scale down safely
  • Backup data before troubleshooting
  • Check PV reclaim policy (should be Retain for production)

Learning Resources:

3. Cluster Autoscaler Not Scaling Despite Pending Pods

Answer Approach:

  • Check autoscaler configuration and node group limits
  • Verify resource requests are specified on pending pods
  • Examine node taints, tolerations, and affinity rules
  • Review autoscaler logs for scaling decisions
  • Check cloud provider quotas and instance availability

Investigation Steps:

  1. kubectl get pods --field-selector=status.phase=Pending - List pending pods
  2. kubectl describe pod <pending-pod> - Check scheduling constraints
  3. Review autoscaler logs: kubectl logs -n kube-system deployment/cluster-autoscaler
  4. Check node group configuration and limits
  5. Verify cloud provider quotas and instance types
  6. Examine pod resource requests and node capacity

Common Issues:

  • Missing resource requests on pods
  • Node group at maximum size
  • Cloud provider quota limits
  • Incompatible node selectors or affinity rules
  • Autoscaler misconfiguration

Learning Resources:

4. Network Policy Cross-Namespace Communication

Answer Approach:

  • Understand default deny vs allow behavior
  • Design policies using namespace and pod selectors
  • Test connectivity systematically
  • Use network policy testing tools

Design Strategy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-cross-namespace
  namespace: target-namespace
spec:
  podSelector:
    matchLabels:
      app: target-app
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: source-namespace
    - podSelector:
        matchLabels:
          app: source-app
    ports:
    - protocol: TCP
      port: 8080

Debugging Steps:

  1. kubectl get networkpolicies --all-namespaces
  2. Test connectivity: kubectl exec -it <pod> -- nc -zv <target-service> <port>
  3. Use network policy testing tools
  4. Check CNI plugin compatibility
  5. Verify namespace labels for selectors

Learning Resources:

5. External Database via VPN with HA and Security

Answer Approach:

  • Use VPN gateway as a deployment or daemon set
  • Implement connection pooling and failover
  • Secure credentials with secrets and service accounts
  • Design for high availability across zones

Architecture Components:

  1. VPN Gateway: Dedicated pods with VPN client configuration
  2. Database Proxy: Connection pooling (e.g., PgBouncer for PostgreSQL)
  3. Service Mesh: For traffic management and security
  4. Secrets Management: External secrets operator or Vault integration
  5. Network Policies: Restrict database access to authorized services

HA Considerations:

  • Multiple VPN gateway replicas across zones
  • Database connection failover logic
  • Health checks and circuit breakers
  • Persistent connections through service endpoints

Security Measures:

  • Network policies for database access control
  • TLS encryption for all connections
  • Service account-based authentication
  • Regular credential rotation
  • Audit logging for database access

Learning Resources:

6. Multi-Tenant EKS Platform

Answer Approach:

  • Implement namespace-based isolation
  • Use RBAC for granular permissions
  • Apply resource quotas and limit ranges
  • Set up network policies for traffic isolation
  • Implement observability per tenant

Isolation Strategies:

  1. Namespace Isolation: One namespace per tenant
  2. RBAC: Role-based access control per tenant
  3. Resource Quotas: CPU, memory, storage limits
  4. Network Policies: Traffic segmentation
  5. Pod Security Standards: Security contexts and policies

Security Implementation:

  • Service accounts per tenant
  • OPA Gatekeeper for policy enforcement
  • Image scanning and admission controllers
  • Secrets management per tenant
  • Audit logging and compliance

Quota Management:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-quota
  namespace: tenant-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    persistentvolumeclaims: "4"

Observability Per Tenant:

  • Prometheus metrics with tenant labels
  • Grafana dashboards per tenant
  • Log aggregation with tenant filtering
  • Cost allocation and chargeback

Learning Resources:

7. Kubelet Constantly Restarting

Answer Approach:

  • Check kubelet logs and system resources
  • Verify node configuration and certificates
  • Examine container runtime health
  • Review system-level issues (disk space, memory)

Investigation Steps:

  1. sudo systemctl status kubelet - Check service status
  2. sudo journalctl -u kubelet -f - View kubelet logs
  3. Check disk space: df -h
  4. Verify container runtime: sudo systemctl status containerd
  5. Check node resources: kubectl top node <node-name>
  6. Examine kubelet configuration: /var/lib/kubelet/config.yaml

Common Causes:

  • Insufficient disk space
  • Certificate expiration or rotation issues
  • Container runtime problems
  • Memory pressure on the node
  • Misconfigured kubelet parameters
  • Network connectivity issues

Resolution Steps:

  • Clean up disk space (images, logs, temp files)
  • Restart container runtime if needed
  • Check and renew certificates
  • Adjust kubelet configuration
  • Monitor resource usage patterns

Learning Resources:

8. Pod Eviction Due to Node Pressure and QoS Classes

Answer Approach:

  • Understand QoS classes: Guaranteed, Burstable, BestEffort
  • Set appropriate resource requests and limits
  • Implement pod disruption budgets
  • Use priority classes for critical workloads

QoS Classes:

  1. Guaranteed: requests = limits for all containers
  2. Burstable: at least one container has requests < limits
  3. BestEffort: no requests or limits specified

Prevention Strategies:

apiVersion: v1
kind: Pod
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    resources:
      requests:
        memory: "1Gi"
        cpu: "500m"
      limits:
        memory: "2Gi"
        cpu: "1"

Pod Disruption Budget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: critical-app

Monitoring and Alerting:

  • Set up node resource monitoring
  • Alert on memory/disk pressure
  • Monitor eviction events
  • Track QoS class distribution

Learning Resources:

9. TCP and UDP on Same Port

Answer Approach:

  • Use multiple Service definitions
  • Configure load balancer annotations
  • Implement separate ingress rules
  • Consider service mesh solutions

Service Configuration:

apiVersion: v1
kind: Service
metadata:
  name: app-tcp
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: myapp
---
apiVersion: v1
kind: Service
metadata:
  name: app-udp
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    protocol: UDP
  selector:
    app: myapp

Ingress Considerations:

  • Most ingress controllers only support HTTP/HTTPS (TCP)
  • Use TCP/UDP ingress controllers for Layer 4 traffic
  • Consider service mesh for advanced traffic management

Alternative Solutions:

  • Use different ports for different protocols
  • Implement application-level protocol switching
  • Use service mesh with traffic splitting capabilities

Learning Resources:

10. Zero-Downtime Deployment Strategies

Answer Approach:

  • Implement blue-green deployments
  • Use canary deployments with traffic splitting
  • Configure proper readiness probes
  • Implement pre-stop hooks and graceful shutdowns

Advanced Strategies:

  1. Blue-Green Deployment: Complete environment switch
  2. Canary Deployment: Gradual traffic shifting
  3. Rolling Updates with Traffic Management: Service mesh control
  4. Feature Flags: Application-level traffic control

Configuration Example:

apiVersion: apps/v1
kind: Deployment
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  template:
    spec:
      containers:
      - name: app
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sleep", "15"]

Best Practices:

  • Never set maxUnavailable to 100%
  • Implement comprehensive health checks
  • Use graceful shutdown procedures
  • Test deployment strategies in staging
  • Monitor deployment metrics and rollback triggers

Learning Resources:

11. Service Mesh Sidecar Resource Optimization

Answer Approach:

  • Analyze sidecar resource usage patterns
  • Tune sidecar configuration parameters
  • Implement resource limits and requests
  • Consider sidecar-less architectures for specific workloads

Analysis Steps:

  1. Monitor sidecar CPU and memory usage
  2. Analyze traffic patterns and connection counts
  3. Review sidecar configuration and feature usage
  4. Check for memory leaks or connection pooling issues
  5. Benchmark different configuration options

Optimization Techniques:

apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-proxy-config
data:
  ProxyStatsMatcher: |
    inclusionRegexps:
    - ".*circuit_breakers.*"
    - ".*upstream_rq_retry.*"
    exclusionRegexps:
    - ".*osconfig.*"

Resource Tuning:

  • Adjust proxy concurrency settings
  • Tune memory limits for Envoy
  • Configure connection pool settings
  • Disable unused features and protocols
  • Optimize stats collection

Alternative Approaches:

  • Ambient mesh for reduced resource overhead
  • Selective sidecar injection
  • Service mesh bypass for internal services
  • Direct service-to-service communication for low-latency requirements

Learning Resources:

12. Kubernetes Operator Development

Answer Approach:

  • Design Custom Resource Definitions (CRDs)
  • Implement controller reconciliation logic
  • Handle error scenarios and retries
  • Implement proper status reporting and events

CRD Design:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: myapps.example.com
spec:
  group: example.com
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              replicas:
                type: integer
              image:
                type: string
          status:
            type: object
            properties:
              phase:
                type: string

Controller Logic:

  1. Watch: Monitor CRD changes
  2. Reconcile: Compare desired vs actual state
  3. Update: Make necessary changes
  4. Status: Report current state
  5. Retry: Handle failures gracefully

Best Practices:

  • Use controller-runtime framework
  • Implement proper logging and metrics
  • Handle cascading deletions with finalizers
  • Use webhooks for validation and mutation
  • Test with various failure scenarios

Learning Resources:

13. High Disk IO from Container Logs

Answer Approach:

  • Implement log rotation and retention policies
  • Use centralized logging solutions
  • Configure log drivers and formatters
  • Set up log level management

Solutions:

  1. Log Rotation: Configure kubelet log rotation
  2. Centralized Logging: Fluentd, Fluent Bit, or similar
  3. Structured Logging: JSON format for efficient parsing
  4. Log Sampling: Reduce verbose logging in production
  5. Storage Optimization: Use faster storage for log volumes

Configuration Example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        1
        Daemon       Off
        Log_Level    info
    [INPUT]
        Name         tail
        Path         /var/log/containers/*.log
        Parser       cri
        Tag          kube.*
        Mem_Buf_Limit 5MB

Kubelet Configuration:

  • Set containerLogMaxSize and containerLogMaxFiles
  • Configure log rotation frequency
  • Use log drivers that support compression

Monitoring and Alerting:

  • Monitor disk usage on nodes
  • Alert on high IO wait times
  • Track log volume growth rates
  • Set up automated cleanup procedures

Learning Resources:

14. etcd Performance and High Availability

Answer Approach:

  • Identify etcd performance bottlenecks
  • Implement proper backup and restore procedures
  • Configure etcd for high availability
  • Monitor etcd metrics and health

Performance Optimization:

  1. Storage: Use fast SSD storage with low latency
  2. Network: Ensure low-latency network between etcd nodes
  3. Resource Allocation: Adequate CPU and memory
  4. Tuning Parameters: Adjust heartbeat and election timeouts
  5. Compaction: Regular database compaction

HA Configuration:

  • Odd number of etcd nodes (3, 5, 7)
  • Geographic distribution across availability zones
  • Load balancing for etcd clients
  • Automated backup and disaster recovery

Monitoring Metrics:

# Key metrics to monitor
- etcd_server_has_leader
- etcd_server_leader_changes_seen_total
- etcd_disk_wal_fsync_duration_seconds
- etcd_network_peer_round_trip_time_seconds
- etcd_mvcc_db_total_size_in_bytes

Backup Strategy:

  • Automated snapshot backups
  • Cross-region backup replication
  • Regular restore testing
  • Point-in-time recovery procedures

Learning Resources:

15. Trusted Image Registry Policy Enforcement

Answer Approach:

  • Implement admission controllers for image validation
  • Use OPA Gatekeeper or similar policy engines
  • Configure image signing and verification
  • Set up automated vulnerability scanning

Implementation Methods:

  1. OPA Gatekeeper: Policy-based admission control
  2. Admission Webhooks: Custom validation logic
  3. Pod Security Standards: Built-in security policies
  4. Image Policy Webhook: Image-specific validation

OPA Gatekeeper Example:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: allowedregistries
spec:
  crd:
    spec:
      names:
        kind: AllowedRegistries
      validation:
        properties:
          registries:
            type: array
            items:
              type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package allowedregistries
        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not starts_with(container.image, input.parameters.registries[_])
          msg := "Container image not from approved registry"
        }

Security Enhancements:

  • Image signature verification with Cosign
  • Vulnerability scanning with Trivy or similar
  • Runtime security monitoring
  • Supply chain security with SLSA framework

Learning Resources:

16. Multi-Region Single Control Plane Architecture

Answer Approach:

  • Design for cross-region latency tolerance
  • Implement proper data replication strategies
  • Plan for disaster recovery scenarios
  • Consider federation or multi-cluster alternatives

Architectural Considerations:

  1. Network Latency: Impact on etcd performance and API response times
  2. Data Replication: Cross-region etcd cluster with proper quorum
  3. Failure Scenarios: Region isolation and split-brain prevention
  4. Workload Placement: Node affinity and topology constraints

Alternative Architectures:

  • Cluster Federation: Multiple clusters with centralized management
  • Multi-Cluster Service Mesh: Cross-cluster service discovery
  • GitOps Multi-Cluster: Centralized configuration management
  • Hierarchical Namespaces: Administrative domain separation

Best Practices:

  • Implement circuit breakers for cross-region calls
  • Use local caching where possible
  • Design applications for eventual consistency
  • Plan for region-specific compliance requirements

Disaster Recovery:

  • Automated backup and restore procedures
  • Cross-region data replication
  • Runbook for disaster scenarios
  • Regular disaster recovery testing

Learning Resources:

17. Ingress Controller Performance Under Heavy Load

Answer Approach:

  • Analyze ingress controller resource utilization
  • Implement horizontal and vertical scaling
  • Optimize configuration for high throughput
  • Use multiple ingress controllers with traffic distribution

Performance Analysis:

  1. Metrics Collection: Monitor request rate, latency, error rate
  2. Resource Usage: CPU, memory, network utilization
  3. Connection Patterns: Keep-alive, connection pooling
  4. Backend Health: Upstream service performance

Scaling Strategies:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-ingress-controller
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: nginx-ingress-controller
        resources:
          requests:
            cpu: 1000m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi

Configuration Optimization:

  • Tune worker processes and connections
  • Configure appropriate buffer sizes
  • Implement connection keep-alive
  • Use efficient load balancing algorithms
  • Enable HTTP/2 and compression

Advanced Solutions:

  • Multiple ingress controller classes
  • Geographic traffic distribution
  • CDN integration for static content
  • Service mesh for advanced traffic management

Monitoring and Alerting:

  • Request rate and latency metrics
  • Error rate and status code distribution
  • Resource utilization alerts
  • Capacity planning based on traffic patterns

Learning Resources:


Additional Learning Resources

General Kubernetes Documentation

Troubleshooting and Operations

Advanced Topics

Remember to practice these scenarios in a lab environment and understand the underlying concepts rather than just memorizing commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment