A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.
###Dataflow Engines:
Dryad--general-purpose distributed parallel dataflow engine
http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
Spark--in memory dataflow
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Spark Streaming--building streaming on top of a distributed data flow engine
http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
Nectar--reusing previously computed results in dataflows (HT @squarecog)
http://static.usenix.org/events/osdi10/tech/full_papers/Gunda.pdf
Differential dataflow: fresh take on incremental computation and dataflow
http://research.microsoft.com/pubs/176693/differentialdataflow.pdf
DBToaster: fast, modern materialized view maintenance
http://vldb.org/pvldb/vol5/p968_yanifahmad_vldb2012.pdf
TelegraphCQ: good example of (old) stream processing systems--useful to contrast to, say, Storm
http://sites.google.com/site/sailesh/TCQcidr03.pdf
Borealis: research distributed stream processing system from the 2000s (HT @marcua)
http://www.cs.harvard.edu/~mdw/course/cs260r/papers/borealis-cidr05.pdf
###Full-stack "Database System" Category
C-Store: columnar storage, now Vertica
http://people.csail.mit.edu/tdanford/6830papers/stonebraker-cstore.pdf
Column stores vs row stores
http://www.courses.fas.harvard.edu/~cs265/papers/abadi-2008.pdf
Google Dremel--columnar storage for fast queries on disk (c.f. Impala)
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf
Google PowerDrill--columnar storage and some optimizations for fast in-memory queries (HT @squarecog)
http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf
Google Spanner--strongly consistent global database
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-osdi2012.pdf
Motivation behind VoltDB: lots of overhead in systems besides "useful work"; one of my favorite papers from recent years:
http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf
Google File System--de facto large scale distributed FS architecture
http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf
DryadLINQ--program collections, not dataflows directly
http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf
FlumeJava (similar to DryadLINQ, but from GOOG)
http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdf
Google Tenzing--SQL on MR
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/37200.pdf
Shark: Building Hive on Spark
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf
DynaMat: Useful for thinking about when to prematerialize views:
http://idke.ruc.edu.cn/seminars/phd/2008/01.08/DynaMat-%20A%20Dynamic%20View%20Management%20System%20for%20Data%20Warehouses.pdf
Jeff Ullman (among other awesome things, the co-author of the dragon compiler book) and friends give a neat and powerful greedy algorithm for efficient data cubing:
http://www.cs.aau.dk/~simas/dat5_08/papers/P205.pdf
Mesos--scheduling for DCs; some ideas adapted in YARN
http://www.mesosproject.org/papers/nsdi_mesos.pdf
Dominant Resource Fairness: multi-resource scheduling
http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf
###Slides and notes
http://rxin.github.com/db-readings/
http://www.cs.berkeley.edu/~istoica/classes/cs294/11/
http://www.courses.fas.harvard.edu/~cs265/syllabus.html
http://www.courses.fas.harvard.edu/~cs265/notes/
Nice. Tanks.
Just one Think though, you wrote "Google Dremel--columnar storage for fast queries on disk (c.f. Impala)", but Impala has nothing to do with dremel nor Dremel has to do with an MPP database. Impala is closer to what Teradara or Netezza do. The closest thing to Dremel is Apache Drill.