building a production data warehouse

Building a Production Data Warehouse

So I hear business is going well...customers are flocking to your site/app...people just can't get enough? Congratulations, this is a huge step most companies don't reach.

What now? Well, business analysts, modelers, engineers, customer support and marketing want to access an authoritative, consistent, timely access to your organization's data. Perhaps we can build some beautiful dashboards to visualize KPIs for executive management.

You're in need of a Data Warehouse? Maybe a Data Lake? Where do you start? What sort of considerations & alternatives should you think about.

Turns out there are a variety of things you must take into consideration depending on your requirements. A good place to start is to interview & establish a relationship with team leads to better understand their needs. As you're going through the discovery process, here's an algorithm of the types of things you'll need to consider in your discovery phase.

What is a Data Warehouse? Data Lake?

Data Warehouse is a large store of data accumulated from a wide range of sources within a company and used to guide management decisions.

A Data Lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.

An important distinction I want to emphasize is that a Data Warehouse is a subset of a Data Lake. There are various types/formats of data which may not be available in your Enterprise Data Warehouse by design.

6 Steps to Production Data Warehouse/Lake

###1) Planning During the planning phase, here are some questions you'll want to think about

Cloud

Performance vs Price tradeoff
Qualified vendors
Cost vs on-premise solution
Cloud stack features

On-premise

How long will it take to build?
One-off vs platform
Do we have the staff/expertise?

As a Service

Identify the cons to this cloud agnostic, managed & automated service

Building blocks of a well designed Data Warehouse/Lake uses the following:

Component	AWS
Compute	Elastic Compute Cloud (EC2)
Disk Storage	Elastic Block Storage (EBS)
Object Storage	Amazon Simple Storage (S3)
Network	Virtual Private Cloud (VPC)
Key Management	AWS Key Management Service

###2) Provisioning ####Workload & Node types Bigger nodes aren't always better, find a sweet spot. Figure out if your workload is driven by EBS I/O or compute relative to cost.

Technology

Which technology? Which vendor?

ANSI SQL?
UDF?
What happens when queries run out of memory? Performance cliff?
Concurrency
Hadoop (catch all)
MPP SQL
Spark • Pipeline architecture? – Centralized
Distributed • Storage architecture
Persistent object store vs. instance store? Price/performance impact
Data collection, movement and ingest/extract architecture critical
What's the best way to perform configuration management?

###3) Enterprise Integration

Data Sources

Inernal
Third Party Data
Structured log files (batch or streaming)

Data Kinematics

Ingestion rate?
Update rate?
Compression?

Schema management

What format are files stored in?
Metadata needs to be associated with creation of schema to store the data

Monitoring & Alerting

What scenarios constitute a failure?
Can/How do we recover?
Can you undo?

###4) Compliance/Security

Stand up a VPC
Lock down single tenant VPC
Audit & log everything
Compliance to legal regulations
Encrypt data at rest
Encrypt data going in/out (in-flight)

###5) SLA

End-2-End Monitoring

Performance
Health
Dashboards for data pipelines

###6) Data Pipelines

Refresh rate
Which frameworks to use (if any)?
Would we ever need data marts? If so when?
Scheduler needs to setup to run jobs periodically

The steps above are simply a starting point to get your thinking of how your environment should look like!

donigian/building-a-production-data-warehouse.md