Data Engineer's Responsibilities (not all encompassing):
- Building data platforms
- Define data architecture and data modeling
- Handle data in various formats
- Create ETL or ELT pipelines as well as streaming data pipelines
- Schedule and deploy pipelines
- Build frameworks or code for data management activities
- Make data accessible with right governance in place
- Enable self service access to data
Why does data engineering exist? It exists as an answer to these questions from data analysts and scientists.
- How do I find my data?
- Every data has its own format
- How do I get pull/prepare data into my model
- How can I get the data to insight ready format
Essential skills:
- Python (and/or R programming)
- SQL (SQLZOO)
- Basic Statistics
- Data modeling (ETL/ELT)
- Data cleaning
- At Tuft & Needle Looker and Metabase (aka BI tool)
- At Tuft & Needle AWS & Docker containers (Some type of cloud platform experience Google, Amazon, Microsoft, IBM)
- One of the hurdles in learning data engineering is setting up a distributed cluster to develop on. Amazon provides a free-tier which can be used to learn the distributed technologies, rather than just using your local system.
Nice to haves:
- Bayesian statistics and/or machine learning knowledge
Most important books:
- https://github.com/andkret/Cookbook
- Big Data, the book from Apache Storm and Lambda Architecture creator, Nathan Marz. Our Fellows have found it really helpful and the first two chapters are available free online.
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems: https://www.amazon.com/dp/B06XPJML5D/?coliid=I2YK48GJ1AXOIA&colid=2TQE1S60MQO4T&psc=0&ref_=lv_ov_lig_dp_it
Data Engineering Online video courses or MOOCs:
- https://insightfellows.com/data-engineering
- https://learndataengineering.com/p/academy
- https://www.dataengineering.academy/
- https://www.datacamp.com/tracks/data-engineer-with-python
- https://www.thisismetis.com/bootcamps/online-data-engineering-bootcamp
Data Science MOOCs (further education):
- UofM: https://www.coursera.org/specializations/data-science-python?ranMID=40328&ranEAID=KCWgjpGqTUg&ranSiteID=KCWgjpGqTUg-RtsaLUHK0nhxLDPzmj9oOg&siteID=KCWgjpGqTUg-RtsaLUHK0nhxLDPzmj9oOg&utm_content=10&utm_medium=partners&utm_source=linkshare&utm_campaign=KCWgjpGqTUg
- UC San Diego edX: https://www.edx.org/micromasters/uc-san-diegox-data-science
- MIT edX: https://www.edx.org/micromasters/mitx-statistics-and-data-science
How to get into data engineering:
- Look into AWS - Kinesis (Buffer), Processing Framework (Lambda), S3 and/or Dynamodb (storage), Amazon API Gateway
- BI Tools - Tableu
Learning Path - Level 1:
- Programming Language - Python
- SQL
- Data Warehousing Concepts
- Understand Distributed Computing
- When to use data lake vs data warehousing vs rdbms vs nosql
- Mater Apache Spark (not sure this is a thing anymore)
- Understand and pick one db nosql or rdbms
Learning Path - Level 2:
- Understand various data architectures (Real Time, Batch, Event Driven, etc.)
- Learn one streaming platform and processing engine
- Pick one cloud provider and master their native data engineering product
- Focus on cloud data warehouses, cloud big data services and managed spark services
- Create and deploy pipelines on cloud with cloud based CI/CD
Learning Path - Level 3:
- Deep dive into data architectures and data modeling
- Understand and build Cloud Native data architectures and sandboxes (containers and K8s)
- Hybrid Cloud
- Focus on data management and Data Security Architecture
- Build platforms that can democratize data and accelerate analysis