Skip to content

Instantly share code, notes, and snippets.

@ganapativs
Last active July 30, 2025 18:32
Show Gist options
  • Save ganapativs/e9da5a16c8e4d9422779c47f4e61ea59 to your computer and use it in GitHub Desktop.
Save ganapativs/e9da5a16c8e4d9422779c47f4e61ea59 to your computer and use it in GitHub Desktop.
Platform Engineer: Interview Questions(Advanced)

Platform Engineer: Interview Questions(Advanced)

🧩 Top 6 Simulation Scenarios (Live Debug / On-Call Type)

1. SSH Access Fails to All EC2 Instances

Symptoms: Team reports SSH failure across multiple servers. Answer:

  • Check security group rules for port 22 and ensure it's open from known IPs.
  • Verify sshd status via EC2 user-data, or use SSM Session Manager as a backdoor.
  • Check for invalid SSH keys or IAM role changes.
sudo systemctl status sshd
aws ec2 describe-security-groups ...

Scorecard:

  • 5 – Checks SGs, SSHD, IAM/SSM
  • 3 – Suggests only rebooting
  • 1 – No idea how SSH works

For: Junior, Senior


2. Disk Usage 100% on / β€” System Unresponsive

Symptoms: Alerts on disk full; app crashes. Answer:

  • Use du -sh /* to find large directories.
  • Clean /var/log, rotate logs, clear docker images/layers.
  • Enable persistent journalctl trimming and cron cleanup.
journalctl --vacuum-time=7d
docker system prune -f

Scorecard:

  • 5 – Uses disk tools, cleans wisely
  • 3 – Deletes logs blindly
  • 1 – Doesn’t identify root cause

For: Junior, Senior


3. New Ansible Deployment Crashed NGINX on 5 Hosts

Symptoms: Sites down after last playbook run. Answer:

  • Run ansible-playbook --check --diff on dev group.
  • Look for nginx.conf overwrite or bad permissions.
  • Validate template logic and syntax (nginx -t).

Scorecard:

  • 5 – Uses dry-run, validates config
  • 3 – Reverts blindly
  • 1 – No Ansible debug awareness

For: Both


4. Latency Spike Between Two EC2s in Same AZ

Symptoms: Web tier can't hit backend reliably. Answer:

  • Use ping, traceroute, iperf3 to baseline.
  • Check EC2 type (burstable limit hit?).
  • Validate NACL rules and route table entries.
iperf3 -c backend.local
aws ec2 describe-instances ...

Scorecard:

  • 5 – Network tools + infra limits
  • 3 – Only looks at app logs
  • 1 – Doesn’t know layers of debugging

For: Senior


5. S3 Uploads Failing with 403 Access Denied

Symptoms: Script that used to work now fails. Answer:

  • Validate IAM role's policy (s3:PutObject, bucket ARN).
  • Check bucket policy for new restrictions (e.g. encryption requirement).
  • Confirm if bucket ownership/ACLs changed.
aws s3api get-bucket-policy --bucket my-bucket

Scorecard:

  • 5 – IAM + S3 policy mastery
  • 3 – Checks one side
  • 1 – Doesn’t know policies

For: Both


6. Application Logs Not Reaching Central Server

Symptoms: No logs from app nodes after restart. Answer:

  • Check rsyslog or fluentd config.
  • Validate port open to central log collector.
  • Look at disk queue/backpressure.
systemctl status rsyslog
netstat -an | grep 514

Scorecard:

  • 5 – End-to-end log pipeline awareness
  • 3 – Just restarts rsyslog
  • 1 – No idea about log shipping

For: Senior


πŸ’¬ Behavioral / Operational Excellence Questions (With Expected Answers)


1. Tell me about a time you handled a critical production outage.

Expected:

  • Clear timeline of issue β†’ mitigation β†’ RCA
  • Calm under pressure, involved team, postmortem done
  • Bonus: mentions alerting, prevention afterward

For: Both


2. How do you prioritize infra tasks (automation, security, support)?

Expected:

  • Business impact driven
  • Balances long-term automation with short-term fixes
  • Uses prioritization framework (e.g. impact x effort)

3. Describe a time you automated something tedious. What changed?

Expected:

  • Identified repetitive task
  • Wrote a script/role/tool
  • Saved time, improved reliability, shared with team

4. How do you approach writing documentation for your tools or infra?

Expected:

  • Keeps README or docs/ up to date
  • Explains how and why, not just what
  • Shares usage examples for onboarding

5. Describe a conflict or disagreement with a dev or ops teammate.

Expected:

  • Calm, constructive resolution
  • Focused on aligning goals
  • Outcome-focused rather than personal

6. When do you choose not to automate something?

Expected:

  • Cost vs. benefit
  • Volatility (changing frequently)
  • One-off or experimental cases

7. What do you do when a deployment goes wrong?

Expected:

  • Rollback plan or mitigation
  • Observability: checks metrics/logs
  • Communicates clearly with stakeholders

8. How do you keep systems secure across environments?

Expected:

  • Least privilege, IAM roles
  • Secrets management, hardened images
  • Patching or automated CVE scans

9. How do you support developers while keeping platform guardrails?

Expected:

  • Expose tools via self-service
  • Validate templates, sandboxing
  • Educate devs on best practices

10. What does operational excellence mean to you?

Expected:

  • Reliability, observability, predictability
  • Postmortems, blameless culture
  • Automation with empathy for users

πŸ› οΈ CI/CD + Tooling Awareness β€” What to Ask

🎯 Goal:

Ensure they understand how software moves from code to prod and where platform engineers enable safety, speed, and observability.


πŸ”Ή What to Ask:

  1. What CI/CD tools have you used? What was your role in maintaining or extending them? ➀ Expected: Jenkins, GitHub Actions, GitLab, etc. ➀ Bonus: Wrote pipelines, standardized templates, managed runners.

  2. How do you ensure infrastructure code changes (e.g. Terraform, Ansible) don’t break production? ➀ Looks for: Test environments, terraform plan, Ansible --check, staging layers.

  3. How do you handle secrets in your pipelines? ➀ Looks for: Vault, AWS Secrets Manager, GitHub Actions secrets, not hardcoded.

  4. What steps should a good CI/CD pipeline include before pushing to production? ➀ Linting β†’ Tests β†’ Build β†’ Artifact push β†’ Infra provisioning β†’ Deployment β†’ Smoke test β†’ Notification

  5. Have you ever had to debug a failing pipeline? What was the root cause? ➀ Good candidates give specific examples (e.g., test flakiness, permissions issue, environment mismatch).


πŸ”§ Tooling Ecosystem (Bonus Knowledge, Not Mandatory)

🎯 Goal:

Identify additional fluency with tools that enhance a modern platform engineer’s effectiveness.


πŸ”Ή Key Areas to Probe:

1. Terraform

  • What’s the difference between a module and a resource block?
  • Have you used terraform plan, apply, and remote backends?
  • How do you manage secrets/state files securely?

2. Docker / Containers

  • Can you write a simple Dockerfile for a Python app?
  • What’s the difference between a container and a VM?
  • How would you reduce image size?

3. Kubernetes

  • What is a pod vs deployment?
  • How do you handle secrets/configs in Kubernetes?
  • Have you used Helm or Kustomize?

4. Monitoring & Observability

  • What tools have you used for metrics/logs (e.g., Prometheus, Grafana, ELK)?
  • What are good alerts and bad alerts?
  • How do you avoid alert fatigue?

πŸ§ͺ Take-home Test vs Pairing: When & How

Option Pros Cons When to Use
Take-home Test Time-flexible, shows independence May be done by others or take too long Mid-level roles; early screen
Live Pairing Collaborative, real-time thought process Pressure may affect fairness Final rounds; simulating teamwork

βœ… Ideal Take-home Example:

β€œWrite an Ansible role to deploy a static site behind nginx, handle idempotency, and provide a rollback method. Document clearly.”

βœ… Ideal Pairing Task:

β€œLet’s live-debug an Ansible playbook that fails to deploy nginx on Ubuntu 22.04. I’ll be your observer.”


🀝 Team Fit & Engineering Maturity β€” What to Look For

🎯 Goal:

Can they scale themselves, collaborate, and act like a force multiplier?


πŸ”Ή Evaluation Prompts:

  1. Documentation Habit: ➀ Do they document infra decisions or just code?

  2. Mentorship: ➀ Have they onboarded/junior teammates or improved workflows?

  3. Feedback Receptiveness: ➀ Can they take architecture feedback without ego?

  4. Proactive Ownership: ➀ Do they improve pain points without being told?

  5. Cross-team Collaboration: ➀ Can they explain a complex system to a dev or PM?


πŸ”Ή Scorecard Traits:

Trait Strong Signal Weak Signal
Communication Documents tradeoffs, shares context Only answers technical bits
Reliability Mindset Thinks in SLAs, rollback, metrics Just wants things to β€œwork”
Empathy Balances dev speed with ops safety Dismisses dev problems
Curiosity Talks about learning new tools, improving infra Relies only on what they already know
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment