Quick and dirty (incomplete) list of interesting, mostly recent data warehousing/"big data" papers

A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.

###Dataflow Engines:

Dryad--general-purpose distributed parallel dataflow engine
http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf

Spark--in memory dataflow
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Streaming and Matviews

Spark Streaming--building streaming on top of a distributed data flow engine
http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

Nectar--reusing previously computed results in dataflows (HT @squarecog)
http://static.usenix.org/events/osdi10/tech/full_papers/Gunda.pdf

Differential dataflow: fresh take on incremental computation and dataflow
http://research.microsoft.com/pubs/176693/differentialdataflow.pdf

DBToaster: fast, modern materialized view maintenance
http://vldb.org/pvldb/vol5/p968_yanifahmad_vldb2012.pdf

TelegraphCQ: good example of (old) stream processing systems--useful to contrast to, say, Storm
http://sites.google.com/site/sailesh/TCQcidr03.pdf

Borealis: research distributed stream processing system from the 2000s (HT @marcua)
http://www.cs.harvard.edu/~mdw/course/cs260r/papers/borealis-cidr05.pdf

###Full-stack "Database System" Category

Mostly OLAP

C-Store: columnar storage, now Vertica
http://people.csail.mit.edu/tdanford/6830papers/stonebraker-cstore.pdf

Column stores vs row stores
http://www.courses.fas.harvard.edu/~cs265/papers/abadi-2008.pdf

Google Dremel--columnar storage for fast queries on disk (c.f. Impala)
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf

Google PowerDrill--columnar storage and some optimizations for fast in-memory queries (HT @squarecog)
http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf

Mostly Non-OLAP

Google Spanner--strongly consistent global database
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-osdi2012.pdf

Motivation behind VoltDB: lots of overhead in systems besides "useful work"; one of my favorite papers from recent years:
http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf

Google File System--de facto large scale distributed FS architecture
http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf

Languages/programming interfaces:

DryadLINQ--program collections, not dataflows directly
http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf

FlumeJava (similar to DryadLINQ, but from GOOG)
http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdf

Google Tenzing--SQL on MR
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/37200.pdf

Shark: Building Hive on Spark
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

Brief Data Warehousing details (older stuff)

DynaMat: Useful for thinking about when to prematerialize views:
http://idke.ruc.edu.cn/seminars/phd/2008/01.08/DynaMat-%20A%20Dynamic%20View%20Management%20System%20for%20Data%20Warehouses.pdf

Jeff Ullman (among other awesome things, the co-author of the dragon compiler book) and friends give a neat and powerful greedy algorithm for efficient data cubing:
http://www.cs.aau.dk/~simas/dat5_08/papers/P205.pdf

Scheduling

Mesos--scheduling for DCs; some ideas adapted in YARN
http://www.mesosproject.org/papers/nsdi_mesos.pdf

Dominant Resource Fairness: multi-resource scheduling
http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf

###Slides and notes http://rxin.github.com/db-readings/
http://www.cs.berkeley.edu/~istoica/classes/cs294/11/
http://www.courses.fas.harvard.edu/~cs265/syllabus.html
http://www.courses.fas.harvard.edu/~cs265/notes/

pbailis/list.md

Streaming and Matviews

Mostly OLAP

Mostly Non-OLAP

Languages/programming interfaces:

Brief Data Warehousing details (older stuff)

Scheduling

samklr commented Mar 4, 2013

Uh oh!

vishal0soni commented Mar 5, 2013

Uh oh!

vkushwaha commented Mar 5, 2013

Uh oh!

wudcwctw commented Dec 6, 2013

Uh oh!