GitHub Infrastructure Engineer Questionnaire

Thanks again for applying to the Infrastructure Engineer job at GitHub! The purpose of this gist is to get a better sense of your technical skills and overall communication style. Take as much time as you need to answer these questions.

Section 1

Engineers at GitHub communicate primarily in written form, via GitHub Issues and Pull Requests. We expect our engineers to communicate clearly and effectively; they should be able to concisely express both their ideas as well as complex technological concepts.

Please answer the following questions in as much detail as you feel comfortable with. The questions are purposefully open-ended, and we hope you take the opportunity to show us your familiarity with various technologies, tools, and techniques. Limit each answer to half a page if possible; walls of text are not required, and you'll have a chance to discuss your answers in further detail during a phone interview if we move forward in the process. Finally, feel free to use google, man pages and other resources if you'd like.

Q1

A service daemon in production has stopped responding to network requests. You receive an alert about the health of the service, and log in to the affected node to troubleshoot. How would you gather more information about the process and what it is doing? What are common reasons a process might appear to be locked up, and how would you rule out each possibility?

A1:

problem
diagnostic

The problem is on the alerting tool node
see if the service is responding from my local machine, or from another node in the network

Node is down, or disconnected from network
ping or ssh

Process isn't running
ps -ef

Process isn't listening
netstat -lntp, or the equivalent

Load is too high for the process to run top

Process is spinning CPU
CPU utilization in top

Process is in a wait state, during an IO call for example
S or D in STAT field of ps aux

Kernel recv/send buffers for the socket are full and packets are being dropped
compare buffer size from /proc/net/ to max sizes in /proc/sys/net/ipv4, or inspect packets with tcpdump*

Server backlog is full, and connections are being rejected
Look for ECONNREFUSED errors on the client, and inspect packets with tcpdump on the server

Q2

A user on an ubuntu machine runs curl http://github.com. Describe the lifecycle of the curl process and explain what happens in the kernel, over the network, and on github.com's servers before the command completes.

A2:

The shell process forks and execs the curl process with the specified argument. When curl gets the processor, it resolves the host (getaddrinfo()), and connects to the resultant IP (socket(), connect()). In the kernel, the connect syscall negotiates the 3-way SYN/SYN-ACK/ACK TCP handshake with the remote host. Once completed, the socket is put in a CONNECTED state, and connect() returns. "curl" then constructs an HTTP GET message with a "HOST: github.com" header, and sends it over the network (send() or write()). The request fits in one packet (probably), so the entire message is sent, regardless of congestion/flow control limitations.

Meanwhile, the github web server has already created a socket (socket()), bound it (bind()) to the public IP for which github.com resolves, or some private proxy. The kernel negotiates the handshake with the client, and upon completion, places the connection in the server's backlog buffer, and wakes up the server (if it's using select(), poll(), or accept()). The accept() syscall on the server returns, providing a new file descriptor for the connected socket. The server can now handoff a connection to a thread or process for handling, so it can return to serving requests. The handling process constructs and HTTP response, and sends it back to the client (write() or send()).

The client, which has been blocking on recv(), reads the response into a buffer, and closes the client descriptor (close()), which prompts the server to close the TCP connection via a FIN packet. "curl" writes the response to stdout, and the process returns.

Q3

Explain in detail each line of the following shell script. What is the purpose of this script? How would you improve it?

#!/bin/bash
set -e
set -o pipefail
exec sudo ngrep -P ' ' -l -W single -d bond0 -q 'SELECT' 'tcp and dst port 3306' |
  egrep "\[AP\] .\s*SELECT " |
  sed -e 's/^T .*\[AP\?\] .\s*SELECT/SELECT/' -e 's/$/;/' |
  ssh $1 -- 'sudo parallel --recend "\n" -j16 --spreadstdin mysql github_production -f -ss'

A3:

On the non-zero exit status of any command, exit immediately.

2: The exit code of a pipeline is the first non-zero exit code, rather than the right-most exit code

3: For all outgoing mysql packets on the bond0 interface containing "SELECT", print them to stdout, such that each packet is on a single line, and control characters are printed as a ' ', rather than a '.'.

4: Select only those packets containing "[AP]" followed by a "SELECT", which limits the results to SQL SELECT statements, rather than statements that happen to have a "SELECT" substring.

5: Eliminate the ngrep prefix, thus recreating the original SQL statement, and ensure that all lines end with a semicolon.

6: On a remote node specified on the command line, create 16 worker processes to relay the modified SQL statements to production.

The purpose of the script is to ensure each SQL command is terminated with a semicolon, so that command termination is not ambiguous. I'm not sure how MYSQL handles unterminated commands, but I assume there's the danger that it might try to append the next command to the current one.

The script might be improved by batching sql commands under a single invocation of mysql, to avoid the overhead of starting the client for each select statement.

Section 2

The following areas map to technologies we use on a regular basis at GitHub. Experience in all of these areas is not a prerequisite for working here. We'd like to know how many of these overlap with your skill set so that we can tailor our interview questions if we move forward in the process.

Please assess your experience in the following areas on a 1-5 scale, where (1) is "no knowledge or experience" and (5) is "extensive professional experience". If you're not sure, feel free to leave it blank. Just place the number next to the corresponding areas listed here:

system administration
- puppet (1)
- ubuntu (3)
- debian packages (3)
- raid (1)
- new hardware burn-in testing (1)
virtualization
- lxc (1)
- xen/kvm (1)
- esx (1)
- aws (3)
troubleshooting
- debuggers (gdb, lldb) (3)
- profilers (perf, oprofile, perftools, strace) (2)
- network flow (tcpdump, pcap) (3)
large system design
- unix processes and threads (3)
- sockets (3)
- signals (3)
- mysql (3)
- redis (1)
- elasticsearch (2)
coding
- comp-sci fundamentals (data structures, big-O notation) (4)
- git usage (4)
- git internals (2)
- c programming (4)
- shell scripting (3)
- ruby programming (2)
- rails (2)
- javascript (3)
- coffeescript (3)
networking
- TCP/UDP (4)
- bgp (2)
- juniper (1)
- arista (1)
- DDoS mitigation strategies and tools (2)
- transit setup and troubleshooting (2)
operational experience
- reading and debugging code youâ€™ve never seen before (4)
- handling urgent incidents when on-call (2)
- helping other engineers understand and navigate production systems (3)
- handling large scale production incidents (external communications, internal coordination) (3)

mgummelt/gist:e26908fec9eea7078212

Select an option

No results found

Select an option

No results found

GitHub Infrastructure Engineer Questionnaire

Section 1

Q1

A1:

Q2

A2:

Q3

A3:

Section 2