http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/est.fa.gz
The human genome, in FASTA format, from UCSC. I first read about this data set on Jonathan Dursi's blog comparing [random vs. streaming I/O](Jonathan Dursi) performance.
https://github.com/rozim/ChessData/archive/master.zip
Several million "quality chess games" in PGN format. I first read about this data set on Adam Drake's blog comparing Hadoop performance to local multithreaded bash.
http://blog.yhathq.com/posts/7-funny-datasets.html