As you may already know, I really like strace. (It has a whole category on this blog). So when the people at Big Data Montreal asked if I wanted to give a talk about stracing Hadoop, the answer was YES OBVIOUSLY.
I set up a small Hadoop cluster (1 master, 2 workers, replication set to 1) on Google Compute Engine to get this working, so that's what we'll be talking about. It has one 14GB CSV file, which contains part of this Wikipedia revision history dataset
Let's start diving into HDFS! (If this is familiar to you, I talked about a lot of this already in Diving into HFDS. There are new things, though! At the end of this we edit the blocks on the data node and see what happens and it's GREAT.)
$ snakebite ls -h /