Symptoms: Team reports SSH failure across multiple servers. Answer:
- Check security group rules for port
22
and ensure it's open from known IPs. - Verify
sshd
status viaEC2 user-data
, or use SSM Session Manager as a backdoor. - Check for invalid SSH keys or IAM role changes.
sudo systemctl status sshd
aws ec2 describe-security-groups ...
Scorecard:
- 5 β Checks SGs, SSHD, IAM/SSM
- 3 β Suggests only rebooting
- 1 β No idea how SSH works
For: Junior, Senior
Symptoms: Alerts on disk full; app crashes. Answer:
- Use
du -sh /*
to find large directories. - Clean
/var/log
, rotate logs, clear docker images/layers. - Enable persistent
journalctl
trimming and cron cleanup.
journalctl --vacuum-time=7d
docker system prune -f
Scorecard:
- 5 β Uses disk tools, cleans wisely
- 3 β Deletes logs blindly
- 1 β Doesnβt identify root cause
For: Junior, Senior
Symptoms: Sites down after last playbook run. Answer:
- Run
ansible-playbook --check --diff
on dev group. - Look for
nginx.conf
overwrite or bad permissions. - Validate template logic and syntax (
nginx -t
).
Scorecard:
- 5 β Uses dry-run, validates config
- 3 β Reverts blindly
- 1 β No Ansible debug awareness
For: Both
Symptoms: Web tier can't hit backend reliably. Answer:
- Use
ping
,traceroute
,iperf3
to baseline. - Check EC2 type (burstable limit hit?).
- Validate NACL rules and route table entries.
iperf3 -c backend.local
aws ec2 describe-instances ...
Scorecard:
- 5 β Network tools + infra limits
- 3 β Only looks at app logs
- 1 β Doesnβt know layers of debugging
For: Senior
Symptoms: Script that used to work now fails. Answer:
- Validate IAM role's policy (
s3:PutObject
, bucket ARN). - Check bucket policy for new restrictions (e.g. encryption requirement).
- Confirm if bucket ownership/ACLs changed.
aws s3api get-bucket-policy --bucket my-bucket
Scorecard:
- 5 β IAM + S3 policy mastery
- 3 β Checks one side
- 1 β Doesnβt know policies
For: Both
Symptoms: No logs from app nodes after restart. Answer:
- Check
rsyslog
orfluentd
config. - Validate port open to central log collector.
- Look at disk queue/backpressure.
systemctl status rsyslog
netstat -an | grep 514
Scorecard:
- 5 β End-to-end log pipeline awareness
- 3 β Just restarts rsyslog
- 1 β No idea about log shipping
For: Senior
Expected:
- Clear timeline of issue β mitigation β RCA
- Calm under pressure, involved team, postmortem done
- Bonus: mentions alerting, prevention afterward
For: Both
Expected:
- Business impact driven
- Balances long-term automation with short-term fixes
- Uses prioritization framework (e.g. impact x effort)
Expected:
- Identified repetitive task
- Wrote a script/role/tool
- Saved time, improved reliability, shared with team
Expected:
- Keeps README or
docs/
up to date - Explains how and why, not just what
- Shares usage examples for onboarding
Expected:
- Calm, constructive resolution
- Focused on aligning goals
- Outcome-focused rather than personal
Expected:
- Cost vs. benefit
- Volatility (changing frequently)
- One-off or experimental cases
Expected:
- Rollback plan or mitigation
- Observability: checks metrics/logs
- Communicates clearly with stakeholders
Expected:
- Least privilege, IAM roles
- Secrets management, hardened images
- Patching or automated CVE scans
Expected:
- Expose tools via self-service
- Validate templates, sandboxing
- Educate devs on best practices
Expected:
- Reliability, observability, predictability
- Postmortems, blameless culture
- Automation with empathy for users
Ensure they understand how software moves from code to prod and where platform engineers enable safety, speed, and observability.
-
What CI/CD tools have you used? What was your role in maintaining or extending them? β€ Expected: Jenkins, GitHub Actions, GitLab, etc. β€ Bonus: Wrote pipelines, standardized templates, managed runners.
-
How do you ensure infrastructure code changes (e.g. Terraform, Ansible) donβt break production? β€ Looks for: Test environments,
terraform plan
, Ansible--check
, staging layers. -
How do you handle secrets in your pipelines? β€ Looks for: Vault, AWS Secrets Manager, GitHub Actions secrets, not hardcoded.
-
What steps should a good CI/CD pipeline include before pushing to production? β€ Linting β Tests β Build β Artifact push β Infra provisioning β Deployment β Smoke test β Notification
-
Have you ever had to debug a failing pipeline? What was the root cause? β€ Good candidates give specific examples (e.g., test flakiness, permissions issue, environment mismatch).
Identify additional fluency with tools that enhance a modern platform engineerβs effectiveness.
- Whatβs the difference between a module and a resource block?
- Have you used
terraform plan
,apply
, and remote backends? - How do you manage secrets/state files securely?
- Can you write a simple Dockerfile for a Python app?
- Whatβs the difference between a container and a VM?
- How would you reduce image size?
- What is a pod vs deployment?
- How do you handle secrets/configs in Kubernetes?
- Have you used Helm or Kustomize?
- What tools have you used for metrics/logs (e.g., Prometheus, Grafana, ELK)?
- What are good alerts and bad alerts?
- How do you avoid alert fatigue?
Option | Pros | Cons | When to Use |
---|---|---|---|
Take-home Test | Time-flexible, shows independence | May be done by others or take too long | Mid-level roles; early screen |
Live Pairing | Collaborative, real-time thought process | Pressure may affect fairness | Final rounds; simulating teamwork |
βWrite an Ansible role to deploy a static site behind nginx, handle idempotency, and provide a rollback method. Document clearly.β
βLetβs live-debug an Ansible playbook that fails to deploy nginx on Ubuntu 22.04. Iβll be your observer.β
Can they scale themselves, collaborate, and act like a force multiplier?
-
Documentation Habit: β€ Do they document infra decisions or just code?
-
Mentorship: β€ Have they onboarded/junior teammates or improved workflows?
-
Feedback Receptiveness: β€ Can they take architecture feedback without ego?
-
Proactive Ownership: β€ Do they improve pain points without being told?
-
Cross-team Collaboration: β€ Can they explain a complex system to a dev or PM?
Trait | Strong Signal | Weak Signal |
---|---|---|
Communication | Documents tradeoffs, shares context | Only answers technical bits |
Reliability Mindset | Thinks in SLAs, rollback, metrics | Just wants things to βworkβ |
Empathy | Balances dev speed with ops safety | Dismisses dev problems |
Curiosity | Talks about learning new tools, improving infra | Relies only on what they already know |