So I hear business is going well...customers are flocking to your site/app...people just can't get enough? Congratulations, this is a huge step most companies don't reach.
What now? Well, business analysts, modelers, engineers, customer support and marketing want to access an authoritative, consistent, timely access to your organization's data. Perhaps we can build some beautiful dashboards to visualize KPIs for executive management.
You're in need of a Data Warehouse? Maybe a Data Lake? Where do you start? What sort of considerations & alternatives should you think about.
Turns out there are a variety of things you must take into consideration depending on your requirements. A good place to start is to interview & establish a relationship with team leads to better understand their needs. As you're going through the discovery process, here's an algorithm of the types of things you'll need to consider in your discovery phase.
Data Warehouse is a large store of data accumulated from a wide range of sources within a company and used to guide management decisions.
A Data Lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.
An important distinction I want to emphasize is that a Data Warehouse is a subset of a Data Lake. There are various types/formats of data which may not be available in your Enterprise Data Warehouse by design.
###1) Planning During the planning phase, here are some questions you'll want to think about
- Performance vs Price tradeoff
- Qualified vendors
- Cost vs on-premise solution
- Cloud stack features
- How long will it take to build?
- One-off vs platform
- Do we have the staff/expertise?
- Identify the cons to this cloud agnostic, managed & automated service
Building blocks of a well designed Data Warehouse/Lake uses the following:
Component | AWS | |
---|---|---|
Compute | Elastic Compute Cloud (EC2) | |
Disk Storage | Elastic Block Storage (EBS) | |
Object Storage | Amazon Simple Storage (S3) | |
Network | Virtual Private Cloud (VPC) | |
Key Management | AWS Key Management Service |
###2) Provisioning ####Workload & Node types Bigger nodes aren't always better, find a sweet spot. Figure out if your workload is driven by EBS I/O or compute relative to cost.
Which technology? Which vendor?
-
ANSI SQL?
-
UDF?
-
What happens when queries run out of memory? Performance cliff?
-
Concurrency
-
Hadoop (catch all)
-
MPP SQL
-
Spark • Pipeline architecture? – Centralized
-
Distributed • Storage architecture
-
Persistent object store vs. instance store? Price/performance impact
-
Data collection, movement and ingest/extract architecture critical
-
What's the best way to perform configuration management?
###3) Enterprise Integration
- Inernal
- Third Party Data
- Structured log files (batch or streaming)
- Ingestion rate?
- Update rate?
- Compression?
- What format are files stored in?
- Metadata needs to be associated with creation of schema to store the data
- What scenarios constitute a failure?
- Can/How do we recover?
- Can you undo?
###4) Compliance/Security
- Stand up a VPC
- Lock down single tenant VPC
- Audit & log everything
- Compliance to legal regulations
- Encrypt data at rest
- Encrypt data going in/out (in-flight)
###5) SLA
- Performance
- Health
- Dashboards for data pipelines
###6) Data Pipelines
- Refresh rate
- Which frameworks to use (if any)?
- Would we ever need data marts? If so when?
- Scheduler needs to setup to run jobs periodically
The steps above are simply a starting point to get your thinking of how your environment should look like!