https://www.linkedin.com/posts/akhilesh-mishra-0ab886124_the-moment-you-mention-%F0%9D%97%9E%F0%9D%98%82%F0%9D%97%AF%F0%9D%97%B2%F0%9D%97%BF%F0%9D%97%BB%F0%9D%97%B2%F0%9D%98%81%F0%9D%97%B2-activity-7343950304729583616-u0Zm/

The moment you mention 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 in a Devops interview, expect a deep dive

Here are 17 Kubernetes questions I was asked that dive into architecture, troubleshooting, and real-world decision-making:

Your pod keeps getting stuck in CrashLoopBackOff, but logs show no errors. How would you approach debugging and resolution?
You have a StatefulSet deployed with persistent volumes, and one of the pods is not recreating properly after deletion. What could be the reasons, and how do you fix it without data loss?
Your cluster autoscaler is not scaling up even though pods are in Pending state. What would you investigate?
A network policy is blocking traffic between services in different namespaces. How would you design and debug the policy to allow only specific communication paths?
One of your microservices has to connect to an external database via a VPN inside the cluster. How would you architect this in Kubernetes with HA and security in mind?
You're running a multi-tenant platform on a single EKS cluster. How do you isolate workloads and ensure security, quotas, and observability for each tenant?
You notice the kubelet is constantly restarting on a particular node. What steps would you take to isolate the issue and ensure node stability?
A critical pod in production gets evicted due to node pressure. How would you prevent this from happening again, and how do QoS classes play a role?
You need to deploy a service that requires TCP and UDP on the same port. How would you configure this in Kubernetes using Services and Ingress?
An application upgrade caused downtime even though you had rolling updates configured. What advanced strategies would you apply to ensure zero-downtime deployments next time?
Your service mesh sidecar (e.g., Istio Envoy) is consuming more resources than the app itself. How do you analyze and optimize this setup?
You need to create a Kubernetes operator to automate complex application lifecycle events. How do you design the CRD and controller loop logic?
Multiple nodes are showing high disk IO usage due to container logs. What Kubernetes features or practices can you apply to avoid this scenario?
Your Kubernetes cluster's etcd performance is degrading. What are the root causes and how do you ensure etcd high availability and tuning?
You want to enforce that all images used in the cluster must come from a trusted internal registry. How do you implement this at the policy level?
You're managing multi-region deployments using a single Kubernetes control plane. What architectural considerations must you address to avoid cross-region latency and single points of failure?
During peak traffic, your ingress controller fails to route requests efficiently. How would you diagnose and scale ingress resources effectively under heavy load?

Answers To Learn From

1. CrashLoopBackOff Debugging (No Errors in Logs)

Answer Approach:

Check resource limits and requests (CPU/memory constraints)
Examine readiness and liveness probes configuration
Verify container startup commands and entry points
Check for missing dependencies or configuration files
Review environment variables and secrets
Analyze previous container logs: kubectl logs <pod> --previous
Use kubectl describe pod <pod> for detailed events
Check container image compatibility and startup time requirements

Key Steps:

kubectl get pods -o wide - Check pod status and node
kubectl logs <pod> --previous - Previous container logs
kubectl describe pod <pod> - Event timeline
Check resource quotas and limits
Verify health check configurations
Test container locally if possible

Learning Resources:

2. StatefulSet Pod Recreation Issues with Persistent Volumes

Answer Approach:

Understand that StatefulSets maintain pod identity and PV binding
Check PVC status and binding to correct PV
Verify storage class and provisioner health
Examine node affinity and zone constraints
Review PV reclaim policy and access modes

Key Steps:

kubectl get pvc,pv - Check volume binding status
kubectl describe pvc <pvc-name> - Check binding events
Verify node has available storage and proper labels
Check if PV is still bound to the old pod identity
Examine storage class and provisioner logs
Validate volume mount paths and permissions

Data Loss Prevention:

Never delete PVCs directly unless intentional
Use kubectl patch statefulset <name> -p '{"spec":{"replicas":0}}' to scale down safely
Backup data before troubleshooting
Check PV reclaim policy (should be Retain for production)

Learning Resources:

3. Cluster Autoscaler Not Scaling Despite Pending Pods

Answer Approach:

Check autoscaler configuration and node group limits
Verify resource requests are specified on pending pods
Examine node taints, tolerations, and affinity rules
Review autoscaler logs for scaling decisions
Check cloud provider quotas and instance availability

Investigation Steps:

kubectl get pods --field-selector=status.phase=Pending - List pending pods
kubectl describe pod <pending-pod> - Check scheduling constraints
Review autoscaler logs: kubectl logs -n kube-system deployment/cluster-autoscaler
Check node group configuration and limits
Verify cloud provider quotas and instance types
Examine pod resource requests and node capacity

Common Issues:

Missing resource requests on pods
Node group at maximum size
Cloud provider quota limits
Incompatible node selectors or affinity rules
Autoscaler misconfiguration

Learning Resources:

4. Network Policy Cross-Namespace Communication

Answer Approach:

Understand default deny vs allow behavior
Design policies using namespace and pod selectors
Test connectivity systematically
Use network policy testing tools

Design Strategy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-cross-namespace
  namespace: target-namespace
spec:
  podSelector:
    matchLabels:
      app: target-app
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: source-namespace
    - podSelector:
        matchLabels:
          app: source-app
    ports:
    - protocol: TCP
      port: 8080

Debugging Steps:

kubectl get networkpolicies --all-namespaces
Test connectivity: kubectl exec -it <pod> -- nc -zv <target-service> <port>
Use network policy testing tools
Check CNI plugin compatibility
Verify namespace labels for selectors

Learning Resources:

5. External Database via VPN with HA and Security

Answer Approach:

Use VPN gateway as a deployment or daemon set
Implement connection pooling and failover
Secure credentials with secrets and service accounts
Design for high availability across zones

Architecture Components:

VPN Gateway: Dedicated pods with VPN client configuration
Database Proxy: Connection pooling (e.g., PgBouncer for PostgreSQL)
Service Mesh: For traffic management and security
Secrets Management: External secrets operator or Vault integration
Network Policies: Restrict database access to authorized services

HA Considerations:

Multiple VPN gateway replicas across zones
Database connection failover logic
Health checks and circuit breakers
Persistent connections through service endpoints

Security Measures:

Network policies for database access control
TLS encryption for all connections
Service account-based authentication
Regular credential rotation
Audit logging for database access

Learning Resources:

6. Multi-Tenant EKS Platform

Answer Approach:

Implement namespace-based isolation
Use RBAC for granular permissions
Apply resource quotas and limit ranges
Set up network policies for traffic isolation
Implement observability per tenant

Isolation Strategies:

Namespace Isolation: One namespace per tenant
RBAC: Role-based access control per tenant
Resource Quotas: CPU, memory, storage limits
Network Policies: Traffic segmentation
Pod Security Standards: Security contexts and policies

Security Implementation:

Service accounts per tenant
OPA Gatekeeper for policy enforcement
Image scanning and admission controllers
Secrets management per tenant
Audit logging and compliance

Quota Management:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-quota
  namespace: tenant-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    persistentvolumeclaims: "4"

Observability Per Tenant:

Prometheus metrics with tenant labels
Grafana dashboards per tenant
Log aggregation with tenant filtering
Cost allocation and chargeback

Learning Resources:

7. Kubelet Constantly Restarting

Answer Approach:

Check kubelet logs and system resources
Verify node configuration and certificates
Examine container runtime health
Review system-level issues (disk space, memory)

Investigation Steps:

sudo systemctl status kubelet - Check service status
sudo journalctl -u kubelet -f - View kubelet logs
Check disk space: df -h
Verify container runtime: sudo systemctl status containerd
Check node resources: kubectl top node <node-name>
Examine kubelet configuration: /var/lib/kubelet/config.yaml

Common Causes:

Insufficient disk space
Certificate expiration or rotation issues
Container runtime problems
Memory pressure on the node
Misconfigured kubelet parameters
Network connectivity issues

Resolution Steps:

Clean up disk space (images, logs, temp files)
Restart container runtime if needed
Check and renew certificates
Adjust kubelet configuration
Monitor resource usage patterns

Learning Resources:

8. Pod Eviction Due to Node Pressure and QoS Classes

Answer Approach:

Understand QoS classes: Guaranteed, Burstable, BestEffort
Set appropriate resource requests and limits
Implement pod disruption budgets
Use priority classes for critical workloads

QoS Classes:

Guaranteed: requests = limits for all containers
Burstable: at least one container has requests < limits
BestEffort: no requests or limits specified

Prevention Strategies:

apiVersion: v1
kind: Pod
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    resources:
      requests:
        memory: "1Gi"
        cpu: "500m"
      limits:
        memory: "2Gi"
        cpu: "1"

Pod Disruption Budget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: critical-app

Monitoring and Alerting:

Set up node resource monitoring
Alert on memory/disk pressure
Monitor eviction events
Track QoS class distribution

Learning Resources:

9. TCP and UDP on Same Port

Answer Approach:

Use multiple Service definitions
Configure load balancer annotations
Implement separate ingress rules
Consider service mesh solutions

Service Configuration:

apiVersion: v1
kind: Service
metadata:
  name: app-tcp
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: myapp
---
apiVersion: v1
kind: Service
metadata:
  name: app-udp
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    protocol: UDP
  selector:
    app: myapp

Ingress Considerations:

Most ingress controllers only support HTTP/HTTPS (TCP)
Use TCP/UDP ingress controllers for Layer 4 traffic
Consider service mesh for advanced traffic management

Alternative Solutions:

Use different ports for different protocols
Implement application-level protocol switching
Use service mesh with traffic splitting capabilities

Learning Resources:

10. Zero-Downtime Deployment Strategies

Answer Approach:

Implement blue-green deployments
Use canary deployments with traffic splitting
Configure proper readiness probes
Implement pre-stop hooks and graceful shutdowns

Advanced Strategies:

Blue-Green Deployment: Complete environment switch
Canary Deployment: Gradual traffic shifting
Rolling Updates with Traffic Management: Service mesh control
Feature Flags: Application-level traffic control

Configuration Example:

apiVersion: apps/v1
kind: Deployment
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  template:
    spec:
      containers:
      - name: app
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sleep", "15"]

Best Practices:

Never set maxUnavailable to 100%
Implement comprehensive health checks
Use graceful shutdown procedures
Test deployment strategies in staging
Monitor deployment metrics and rollback triggers

Learning Resources:

11. Service Mesh Sidecar Resource Optimization

Answer Approach:

Analyze sidecar resource usage patterns
Tune sidecar configuration parameters
Implement resource limits and requests
Consider sidecar-less architectures for specific workloads

Analysis Steps:

Monitor sidecar CPU and memory usage
Analyze traffic patterns and connection counts
Review sidecar configuration and feature usage
Check for memory leaks or connection pooling issues
Benchmark different configuration options

Optimization Techniques:

apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-proxy-config
data:
  ProxyStatsMatcher: |
    inclusionRegexps:
    - ".*circuit_breakers.*"
    - ".*upstream_rq_retry.*"
    exclusionRegexps:
    - ".*osconfig.*"

Resource Tuning:

Adjust proxy concurrency settings
Tune memory limits for Envoy
Configure connection pool settings
Disable unused features and protocols
Optimize stats collection

Alternative Approaches:

Ambient mesh for reduced resource overhead
Selective sidecar injection
Service mesh bypass for internal services
Direct service-to-service communication for low-latency requirements

Learning Resources:

12. Kubernetes Operator Development

Answer Approach:

Design Custom Resource Definitions (CRDs)
Implement controller reconciliation logic
Handle error scenarios and retries
Implement proper status reporting and events

CRD Design:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: myapps.example.com
spec:
  group: example.com
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              replicas:
                type: integer
              image:
                type: string
          status:
            type: object
            properties:
              phase:
                type: string

Controller Logic:

Watch: Monitor CRD changes
Reconcile: Compare desired vs actual state
Update: Make necessary changes
Status: Report current state
Retry: Handle failures gracefully

Best Practices:

Use controller-runtime framework
Implement proper logging and metrics
Handle cascading deletions with finalizers
Use webhooks for validation and mutation
Test with various failure scenarios

Learning Resources:

13. High Disk IO from Container Logs

Answer Approach:

Implement log rotation and retention policies
Use centralized logging solutions
Configure log drivers and formatters
Set up log level management

Solutions:

Log Rotation: Configure kubelet log rotation
Centralized Logging: Fluentd, Fluent Bit, or similar
Structured Logging: JSON format for efficient parsing
Log Sampling: Reduce verbose logging in production
Storage Optimization: Use faster storage for log volumes

Configuration Example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        1
        Daemon       Off
        Log_Level    info
    [INPUT]
        Name         tail
        Path         /var/log/containers/*.log
        Parser       cri
        Tag          kube.*
        Mem_Buf_Limit 5MB

Kubelet Configuration:

Set containerLogMaxSize and containerLogMaxFiles
Configure log rotation frequency
Use log drivers that support compression

Monitoring and Alerting:

Monitor disk usage on nodes
Alert on high IO wait times
Track log volume growth rates
Set up automated cleanup procedures

Learning Resources:

14. etcd Performance and High Availability

Answer Approach:

Identify etcd performance bottlenecks
Implement proper backup and restore procedures
Configure etcd for high availability
Monitor etcd metrics and health

Performance Optimization:

Storage: Use fast SSD storage with low latency
Network: Ensure low-latency network between etcd nodes
Resource Allocation: Adequate CPU and memory
Tuning Parameters: Adjust heartbeat and election timeouts
Compaction: Regular database compaction

HA Configuration:

Odd number of etcd nodes (3, 5, 7)
Geographic distribution across availability zones
Load balancing for etcd clients
Automated backup and disaster recovery

Monitoring Metrics:

# Key metrics to monitor
- etcd_server_has_leader
- etcd_server_leader_changes_seen_total
- etcd_disk_wal_fsync_duration_seconds
- etcd_network_peer_round_trip_time_seconds
- etcd_mvcc_db_total_size_in_bytes

Backup Strategy:

Automated snapshot backups
Cross-region backup replication
Regular restore testing
Point-in-time recovery procedures

Learning Resources:

15. Trusted Image Registry Policy Enforcement

Answer Approach:

Implement admission controllers for image validation
Use OPA Gatekeeper or similar policy engines
Configure image signing and verification
Set up automated vulnerability scanning

Implementation Methods:

OPA Gatekeeper: Policy-based admission control
Admission Webhooks: Custom validation logic
Pod Security Standards: Built-in security policies
Image Policy Webhook: Image-specific validation

OPA Gatekeeper Example:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: allowedregistries
spec:
  crd:
    spec:
      names:
        kind: AllowedRegistries
      validation:
        properties:
          registries:
            type: array
            items:
              type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package allowedregistries
        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not starts_with(container.image, input.parameters.registries[_])
          msg := "Container image not from approved registry"
        }

Security Enhancements:

Image signature verification with Cosign
Vulnerability scanning with Trivy or similar
Runtime security monitoring
Supply chain security with SLSA framework

Learning Resources:

16. Multi-Region Single Control Plane Architecture

Answer Approach:

Design for cross-region latency tolerance
Implement proper data replication strategies
Plan for disaster recovery scenarios
Consider federation or multi-cluster alternatives

Architectural Considerations:

Network Latency: Impact on etcd performance and API response times
Data Replication: Cross-region etcd cluster with proper quorum
Failure Scenarios: Region isolation and split-brain prevention
Workload Placement: Node affinity and topology constraints

Alternative Architectures:

Cluster Federation: Multiple clusters with centralized management
Multi-Cluster Service Mesh: Cross-cluster service discovery
GitOps Multi-Cluster: Centralized configuration management
Hierarchical Namespaces: Administrative domain separation

Best Practices:

Implement circuit breakers for cross-region calls
Use local caching where possible
Design applications for eventual consistency
Plan for region-specific compliance requirements

Disaster Recovery:

Automated backup and restore procedures
Cross-region data replication
Runbook for disaster scenarios
Regular disaster recovery testing

Learning Resources:

17. Ingress Controller Performance Under Heavy Load

Answer Approach:

Analyze ingress controller resource utilization
Implement horizontal and vertical scaling
Optimize configuration for high throughput
Use multiple ingress controllers with traffic distribution

Performance Analysis:

Metrics Collection: Monitor request rate, latency, error rate
Resource Usage: CPU, memory, network utilization
Connection Patterns: Keep-alive, connection pooling
Backend Health: Upstream service performance

Scaling Strategies:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-ingress-controller
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: nginx-ingress-controller
        resources:
          requests:
            cpu: 1000m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi

Configuration Optimization:

Tune worker processes and connections
Configure appropriate buffer sizes
Implement connection keep-alive
Use efficient load balancing algorithms
Enable HTTP/2 and compression

Advanced Solutions:

Multiple ingress controller classes
Geographic traffic distribution
CDN integration for static content
Service mesh for advanced traffic management

Monitoring and Alerting:

Request rate and latency metrics
Error rate and status code distribution
Resource utilization alerts
Capacity planning based on traffic patterns

Learning Resources:

Additional Learning Resources

General Kubernetes Documentation

Troubleshooting and Operations

Advanced Topics

Remember to practice these scenarios in a lab environment and understand the underlying concepts rather than just memorizing commands.

dasiths/k8s learning.md

Select an option

No results found

Select an option

No results found

Answers To Learn From

1. CrashLoopBackOff Debugging (No Errors in Logs)

2. StatefulSet Pod Recreation Issues with Persistent Volumes

3. Cluster Autoscaler Not Scaling Despite Pending Pods

4. Network Policy Cross-Namespace Communication

5. External Database via VPN with HA and Security

6. Multi-Tenant EKS Platform

7. Kubelet Constantly Restarting

8. Pod Eviction Due to Node Pressure and QoS Classes

9. TCP and UDP on Same Port

10. Zero-Downtime Deployment Strategies

11. Service Mesh Sidecar Resource Optimization

12. Kubernetes Operator Development

13. High Disk IO from Container Logs

14. etcd Performance and High Availability

15. Trusted Image Registry Policy Enforcement

16. Multi-Region Single Control Plane Architecture

17. Ingress Controller Performance Under Heavy Load

Additional Learning Resources

General Kubernetes Documentation

Troubleshooting and Operations

Advanced Topics