Skip to content

Instantly share code, notes, and snippets.

@gabidavila
Last active July 3, 2016 12:28
Show Gist options
  • Select an option

  • Save gabidavila/d9df4967f1111c0710762e5000e2e3cf to your computer and use it in GitHub Desktop.

Select an option

Save gabidavila/d9df4967f1111c0710762e5000e2e3cf to your computer and use it in GitHub Desktop.

Why you need a Data Engineer

Tech evolves pretty quickly, when the buzzword Big Data started showing up more and more the market was in need of people being able to analyse and give meaning for what was collected, for instance, an article from 2012, from Harvard Business Review was entitled: Data Scientist: The Sexiest Job of the 21st Century.

Today we have DBA, Data Scientst, Data Engineer, Data Analyst, a wealth options with the "Data" as prefix. People more often than not put everyone in the same basket and assume everyone knows and have the same set of skills.

From my point of view and perception, and feel free to correct me if wrong, those are the differences:

  • DBA - Once the most hated person I ever had in the team. Seriously, why that human doesn't give me the necessary permissions on the database? If I had access, I would have done my job sooner... Well, that was my thought as a Software Engineer at the time. Turns out, DBAs are, what my friends and I used to call, the database baby sitter. You need to tunning and figure out why performance is not as it should? Help with a complicated query? Thats the goto person for it. But notice, this is RDBMS specific and heavily focused on the operational part.
  • Data Scientist - The market usually wants a professional with PhD in statistics or a heavily math oriented person, this person will be responsible for creating prediction models based on current data. Do you know how Amazon knows what you should buy next based on your browsing history? Yeah, this individual probably did the programming around that, have machine learning down to a T, needs to understand Product, Engineering and Statistical knowledge.
  • Data Analyst - This person also deals with a bit of statistics, but more in the business sense, dealing and creating reports for Business Intelligence. It tries to answer business questions and see where data acquisition/quality is failing, for instance.
  • Data Engineer - This role I can explain with more passion, it is what I do, so it probably will be biased. We are the bridge. We help Software Engineers to build the application for storage and retrieval in a manner to provide the Data Scientists and Data Analysts with the information they need to do their job.

So why you need a Data Engineer on your team?

We do ETL (Extract-Transform-Load). We put data in the Datawarehouse, divise the best strategy for caching information, design database architecture, NoSQL clusters.

Should this JSON return from the Facebook API be really stored in the relational database? (short answer: no). Should this query with a LIKE '%string%' be really running in the application and not getting data from Elasticsearch?

We sum, we work with RDBMS, NoSQL, Search Engines, Cache engines. I particularly do a lot of job with RDBMS since most of my work were on Legacy applications. One of our responsibilities is to lower the load on RDBMS for instance for unnecessary stored data presented there.

It is still necessary to know stuff like: indexing, transactions, query profiling and bit of performance tunning.

To sum up We are kind of the wild card in storage technology.

Developers don't usually care about the data. They want what is fast and easy to use. They think on delivery, and not in long term data retrieval for instance.

"I am going to store this access log for my website on that table"

They probably didn't stop to think that table will have million of records potentially in months. Why not ELK? Cassandra?

So who is curating your data?

## How should you work with one

You know that feature you want to implement? Talk to us first. You can design the application the way you want, but we give you the insight in the data layer.

Do not isolate us on your team. Again, we are the bridge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment