luan-cestari · March 3, 2015 13:57
diff --git a/GitHub Infrastructure Engineer Questionnaire b/GitHub Infrastructure Engineer Questionnaire
 # GitHub Infrastructure Engineer Questionnaire

 Thanks again for applying to the Infrastructure Engineer job at GitHub! The purpose of this gist is to get a better sense of your technical skills and overall communication style. Take as much time as you need to answer these questions.

 ## Section 1

 Engineers at GitHub communicate primarily in written form, via GitHub Issues and Pull Requests. We expect our engineers to communicate clearly and effectively; they should be able to concisely express both their ideas as well as complex technological concepts.

 Please answer the following questions in as much detail as you feel comfortable with. The questions are purposefully open-ended, and we hope you take the opportunity to show us your familiarity with various technologies, tools, and techniques. Limit each answer to half a page if possible; walls of text are not required, and you'll have a chance to discuss your answers in further detail during a phone interview if we move forward in the process. Finally, feel free to use google, man pages and other resources if you'd like.

 ### Q1

 A service daemon in production has stopped responding to network requests. You receive an alert about the health of the service, and log in to the affected node to troubleshoot. How would you gather more information about the process and what it is doing? What are common reasons a process might appear to be locked up, and how would you rule out each possibility?

 ### A1: 

 I would see if the process still running (using `ps`/`top` commands), if it still running I would try to use `top -H -p $PID -bn1` to see the threads of that application (I would probably get more than one iteration or even try with `bn1` to see if the sample that I got is really representing the state of that process). If I see an thread stuck using 100% CPU. After that I would get some more information to understand the situation such as the application log, the system log (for example for systemd I would use journalctl) and other information about the environment and the process it self (like the configuration on /proc/$PID/ , if it is using SELinux, CGroups or other things. Probably I would made a custom report using sosreport to get all these things). If with all these information and I still have no clue what is happening (like we don't have any evidence that the process is running out of FD or memory or network problem, etc) , I would assume at this point we might have a bug in production and we need to move this investigation into a next level which we would need the debuginfo of that process I would generate a core dump or systemtap or strace or dtrace of the process to better understand why the programming is behaving like that. Also, from the very beginning I suppose we have other nodes  that would handle the load of the server that we are working on. If we have enough information that it is a bug and it can impact other nodes and our business at any time, I would suggest a rollback to the previous stable version of that application as soon as possible and I would reach the developer team to solve the issue (which might be a new issue or maybe they already found and need to back port this into a stable version).

 ### Q2

 A user on an ubuntu machine runs `curl http://github.com`. Describe the lifecycle of the curl process and explain what happens in the kernel, over the network, and on github.com's servers before the command completes.


 ### A2: 
 Curl will rely on the libcurl library (which can be used by other programs/programming languages). The curl will pass a set of default configurations to the library to handle that command line (for example, we could set the HTTP method, cookies and etc).  As the libcurl rely on the UNIX interface, it will use a set of system calls (*) to handle the request, from reading the argument from stdin to making an non blocking network request to the remote server. From the network perspective, we could say that process will use the configured DNS sewers to resolve the name to an IP address (if it have the firewall rules and network connectivity to do) and it will send and fetch the response of the HTTP request (and that can be handle by an reverse proxy). From the github backend perspective (as previously I was talking more about the client side). the external request would be processed by one of a set Load Balancers (which there would be firewall rules and other security resources specially if it is on the DMZ so it have to deal with the internet access to its internal services). The github backend probably have hardware LB like F5 to handle the most simples request like this to redirect to the HTTPS service and log all the information for security and BI proposes.

 *the kernel/system calls used in my machine which I used strace to get them: poll,mmap,mprotect,open,sendto,read,fstat,close,getsockopt,getpeername,access,execve,brk,getsockname,munmap,rt_sigaction,rt_sigprocmask,ioctl,pipe,madvise,socket,connect,recvfrom,setsockopt,clone,fcntl,getrlimit,statfs,arch_prctl,gettid,futex,set_tid_address,set_robust_list,sendmmsg

 ### Q3

 Explain in detail each line of the following shell script. What is the purpose of this script? How would you improve it?

 ```
 #!/bin/bash
 set -e
 set -o pipefail
 exec sudo ngrep -P ' ' -l -W single -d bond0 -q 'SELECT' 'tcp and dst port 3306' |
  egrep "\[AP\] .\s*SELECT " |
  sed -e 's/^T .*\[AP\?\] .\s*SELECT/SELECT/' -e 's/$/;/' |
  ssh $1 -- 'sudo parallel --recend "\n" -j16 --spreadstdin mysql github_production -f -ss'
 ```

 ### A3: 
 Lines comments
 #1 - indicates to the shell which program can properly execute the file
 #2 - it will make the script stop as soon as one of the command return a non-zero exist status that is not handled in the script
 #3 - this configuration enable the exist status from a set of commands using pipe (|) to get the last/rightmost command, so it would avoid some confusing using pipe and get a 0 returned but in the stdout/stderr you saw that if failed (and that was a command in the middle of the set of commands using pipe that I mentioned)
 #4 the command will replace the current shell using ngrep (processing network information) for the bond interface (a virtual interface that group two or more network interfaces)
 #5 egrep (which is grep with extended regex functions) will filter the content send throw the pipe (made by teh previous command)
 #6 it will process the output of the previous command to replace some part of the content
 #7 it will ssh the server informed as the argument of this shell script and it will run up to 6 jobs in parallel (by the default it would run one per CPU core) spreading the stdin to thm mysql command jobs

 One of the problems of this script is that it lacks for arguments validation, necessary for the ssh command. I think it could also have a notification in case one of the commands didn't work as expected (e.g., it couldn't connect into the remote server). 



 ## Section 2

 The following areas map to technologies we use on a regular basis at GitHub. Experience in all of these areas is not a prerequisite for working here. We'd like to know how many of these overlap with your skill set so that we can tailor our interview questions if we move forward in the process.

 Please assess your experience in the following areas on a 1-5 scale, where (1) is "no knowledge or experience" and (5) is "extensive professional experience". If you're not sure, feel free to leave it blank. Just place the number next to the corresponding areas listed here:

 - system administration 
  - puppet 3
  - ubuntu 4
  - debian packages 4
  - raid 3
  - new hardware burn-in testing 2
 - virtualization
  - lxc 4
  - xen/kvm 3
  - esx 0
  - aws 5
 - troubleshooting
  - debuggers (gdb, lldb) 3
  - profilers (perf, oprofile, perftools, strace) 3
  - network flow (tcpdump, pcap) 4
 - large system design
  - unix processes and threads 5
  - sockets 5
  - signals 5
  - mysql 4
  - redis 3
  - elasticsearch 3
 - coding
  - comp-sci fundamentals (data structures, big-O notation) 5
  - git usage 4
  - git internals 2
  - c programming 4
  - shell scripting 4
  - ruby programming 4
  - rails 3
  - javascript 4
  - coffeescript 3
 - networking
  - TCP/UDP 5
  - bgp 2
  - juniper 0
  - arista 0
  - DDoS mitigation strategies and tools 2
  - transit setup and troubleshooting 2
 - operational experience
  - reading and debugging code youâ€™ve never seen before 5
  - handling urgent incidents when on-call 4
  - helping other engineers understand and navigate production systems 5
  - handling large scale production incidents (external communications, internal coordination) 5
	# GitHub Infrastructure Engineer Questionnaire

	Thanks again for applying to the Infrastructure Engineer job at GitHub! The purpose of this gist is to get a better sense of your technical skills and overall communication style. Take as much time as you need to answer these questions.

	## Section 1

	Engineers at GitHub communicate primarily in written form, via GitHub Issues and Pull Requests. We expect our engineers to communicate clearly and effectively; they should be able to concisely express both their ideas as well as complex technological concepts.

	Please answer the following questions in as much detail as you feel comfortable with. The questions are purposefully open-ended, and we hope you take the opportunity to show us your familiarity with various technologies, tools, and techniques. Limit each answer to half a page if possible; walls of text are not required, and you'll have a chance to discuss your answers in further detail during a phone interview if we move forward in the process. Finally, feel free to use google, man pages and other resources if you'd like.

	### Q1

	A service daemon in production has stopped responding to network requests. You receive an alert about the health of the service, and log in to the affected node to troubleshoot. How would you gather more information about the process and what it is doing? What are common reasons a process might appear to be locked up, and how would you rule out each possibility?

	### A1:

	I would see if the process still running (using `ps`/`top` commands), if it still running I would try to use `top -H -p $PID -bn1` to see the threads of that application (I would probably get more than one iteration or even try with `bn1` to see if the sample that I got is really representing the state of that process). If I see an thread stuck using 100% CPU. After that I would get some more information to understand the situation such as the application log, the system log (for example for systemd I would use journalctl) and other information about the environment and the process it self (like the configuration on /proc/$PID/ , if it is using SELinux, CGroups or other things. Probably I would made a custom report using sosreport to get all these things). If with all these information and I still have no clue what is happening (like we don't have any evidence that the process is running out of FD or memory or network problem, etc) , I would assume at this point we might have a bug in production and we need to move this investigation into a next level which we would need the debuginfo of that process I would generate a core dump or systemtap or strace or dtrace of the process to better understand why the programming is behaving like that. Also, from the very beginning I suppose we have other nodes that would handle the load of the server that we are working on. If we have enough information that it is a bug and it can impact other nodes and our business at any time, I would suggest a rollback to the previous stable version of that application as soon as possible and I would reach the developer team to solve the issue (which might be a new issue or maybe they already found and need to back port this into a stable version).

	### Q2

	A user on an ubuntu machine runs `curl http://github.com`. Describe the lifecycle of the curl process and explain what happens in the kernel, over the network, and on github.com's servers before the command completes.


	### A2:
	Curl will rely on the libcurl library (which can be used by other programs/programming languages). The curl will pass a set of default configurations to the library to handle that command line (for example, we could set the HTTP method, cookies and etc). As the libcurl rely on the UNIX interface, it will use a set of system calls (*) to handle the request, from reading the argument from stdin to making an non blocking network request to the remote server. From the network perspective, we could say that process will use the configured DNS sewers to resolve the name to an IP address (if it have the firewall rules and network connectivity to do) and it will send and fetch the response of the HTTP request (and that can be handle by an reverse proxy). From the github backend perspective (as previously I was talking more about the client side). the external request would be processed by one of a set Load Balancers (which there would be firewall rules and other security resources specially if it is on the DMZ so it have to deal with the internet access to its internal services). The github backend probably have hardware LB like F5 to handle the most simples request like this to redirect to the HTTPS service and log all the information for security and BI proposes.

	*the kernel/system calls used in my machine which I used strace to get them: poll,mmap,mprotect,open,sendto,read,fstat,close,getsockopt,getpeername,access,execve,brk,getsockname,munmap,rt_sigaction,rt_sigprocmask,ioctl,pipe,madvise,socket,connect,recvfrom,setsockopt,clone,fcntl,getrlimit,statfs,arch_prctl,gettid,futex,set_tid_address,set_robust_list,sendmmsg

	### Q3

	Explain in detail each line of the following shell script. What is the purpose of this script? How would you improve it?

	```
	#!/bin/bash
	set -e
	set -o pipefail
	exec sudo ngrep -P ' ' -l -W single -d bond0 -q 'SELECT' 'tcp and dst port 3306' \|
	egrep "\[AP\] .\s*SELECT " \|
	sed -e 's/^T .\[AP\?\] .\sSELECT/SELECT/' -e 's/$/;/' \|
	ssh $1 -- 'sudo parallel --recend "\n" -j16 --spreadstdin mysql github_production -f -ss'
	```

	### A3:
	Lines comments
	#1 - indicates to the shell which program can properly execute the file
	#2 - it will make the script stop as soon as one of the command return a non-zero exist status that is not handled in the script
	#3 - this configuration enable the exist status from a set of commands using pipe (\|) to get the last/rightmost command, so it would avoid some confusing using pipe and get a 0 returned but in the stdout/stderr you saw that if failed (and that was a command in the middle of the set of commands using pipe that I mentioned)
	#4 the command will replace the current shell using ngrep (processing network information) for the bond interface (a virtual interface that group two or more network interfaces)
	#5 egrep (which is grep with extended regex functions) will filter the content send throw the pipe (made by teh previous command)
	#6 it will process the output of the previous command to replace some part of the content
	#7 it will ssh the server informed as the argument of this shell script and it will run up to 6 jobs in parallel (by the default it would run one per CPU core) spreading the stdin to thm mysql command jobs

	One of the problems of this script is that it lacks for arguments validation, necessary for the ssh command. I think it could also have a notification in case one of the commands didn't work as expected (e.g., it couldn't connect into the remote server).



	## Section 2

	The following areas map to technologies we use on a regular basis at GitHub. Experience in all of these areas is not a prerequisite for working here. We'd like to know how many of these overlap with your skill set so that we can tailor our interview questions if we move forward in the process.

	Please assess your experience in the following areas on a 1-5 scale, where (1) is "no knowledge or experience" and (5) is "extensive professional experience". If you're not sure, feel free to leave it blank. Just place the number next to the corresponding areas listed here:

	- system administration
	- puppet 3
	- ubuntu 4
	- debian packages 4
	- raid 3
	- new hardware burn-in testing 2
	- virtualization
	- lxc 4
	- xen/kvm 3
	- esx 0
	- aws 5
	- troubleshooting
	- debuggers (gdb, lldb) 3
	- profilers (perf, oprofile, perftools, strace) 3
	- network flow (tcpdump, pcap) 4
	- large system design
	- unix processes and threads 5
	- sockets 5
	- signals 5
	- mysql 4
	- redis 3
	- elasticsearch 3
	- coding
	- comp-sci fundamentals (data structures, big-O notation) 5
	- git usage 4
	- git internals 2
	- c programming 4
	- shell scripting 4
	- ruby programming 4
	- rails 3
	- javascript 4
	- coffeescript 3
	- networking
	- TCP/UDP 5
	- bgp 2
	- juniper 0
	- arista 0
	- DDoS mitigation strategies and tools 2
	- transit setup and troubleshooting 2
	- operational experience
	- reading and debugging code youâ€™ve never seen before 5
	- handling urgent incidents when on-call 4
	- helping other engineers understand and navigate production systems 5
	- handling large scale production incidents (external communications, internal coordination) 5