I am hesitant in using the word "Big Data" to describe what I'm working on with an adjective as part of the definition feels boasting (unless I work with LHC then maybe it is justifiable for such adjective!).
When asked to describe what is "Big Data" - I use the following description:
Big data is when an organization create such a "data rock" that it cannot lift.
from the animator survival kit
This is, of course, a playful paraphrase of the famous paradox. By choice, we engineer our collection of the data to s scale, which push our limit in try to lift, roll, dig, patch, mould this big piece of data rock. We do so to find a gem in the mass of data which we called insight.
I was luckily enough to grow 3 different data organizations in my past. These organizations vary from enterprise cybersecurity company to venture-funded data startup. Despite the difference, there were similar blind alleys that I ran into. Getting trapped under the data rock is a suffocating experience, whichever size of organization you work with.
Let's start with some quotes that I paraphrase from comments I hear:
-
"Let's hire a data scientist; we need a fancy reinforced learning model X to make sense of our data"
-
"Let's spend our time building our collection for this hypothesis X; the data will validate the idea and build us a great case study"
-
"Let's build our data pipeline to scale; we will throw in technology X to process all these data."
-
"Let's instrument and collect everything; we will have the ability to make sense of that later"
These are common themes that you will hear from leader of data organizations. I'm pretty sure that I have uttered similar phrases in my career. There are 2 issues I try to highlight with above statements.
- The order of these statements - This is summarized well by Instacart VP Jeremy Stanley's quote
“Data science requires data to science, and most companies don’t have much data on day one.”
- The bravado of these statements. If you use your hand to cover up the sub-clause after the semi-colon, these are typical operation of a data organization. (Data collection, engineering, modelling and experimentation). But with the sub-clause, I would have concern of that the team is chasing vanity metrics as oppose to focusing on growing it's data capability.
Growing an organization is about investing in the right resources, to execute on the right task. There are 2 frameworks that I used to judge/describe data organization - which I can use to describe.
-
Data organization hierarchy - describe a suitable order of the above statements
-
Data execution cycle - ensure that we keep being agile and de-risk data risk in small steps early.
Growing a data team is about investing resources and effort at the right place. These frameworks serves as a guide to what to focus, and where+when to focus on as team.
Within the field of psychology, there's a way to determine what is a human's psychological need - the famous Maslow's hierarchy of needs.
The hierarchy describes dependency between psychological needs. For example, people in a safety threaten environment (e.g. war) lives a difficult life even if they are well respected/high esteem. In a similar way, we can capture the gist of "Data scientist without data" issue with this hierarchy:
-
Data availability (Physiological): Do you collect the right data available?
-
Data accessibility (Safety): Is your data clean and validated, stored and accessible?
-
Descriptive Data trust (Belonging): Are you able to create simple descriptive metrics when asked - creating trust on data? (e.g. describe your conversion rate?)
-
Tactical data model (Esteem): Are you able to tactically improve organization efficiency with data (e.g. Carry out A/B test and confirm results)?
-
Strategic data model (Self-actualization): Are you able to strategically improve your organization with data? (e.g. using data to drive a product that deliver business value)
For example, It is futile to invest in Data accessibility/Engineering challenges for an organization that have no data.
The hierarchy model is a useful framework to help us to structure and identify the core need of organization. There are exceptions - like a Zen master who doesn't need to eat to transencd and self-actualize. However for us mere mortals, I'm thankful to be building my life based on plenty of safety and love from family and friends.
Let's go back to our analogy about lifting our "Data rock". Imagine you are a weight lifter - you have a mighty strong arms that can raise heavy weight. It would be a pity if your legs can't support the same weight and become the bottle neck.
Similar in a Data organization - we need to identify different parts required in an organization in order to lift our "data rock". It is not just collecting more data, or have the most scalable data pipeline. For this, I use the framework with data product.
The orginal blog from SAS does a way better job than I would to explain each of the steps. The point to emphasis is that we try to iterate as fast as possible through the iterations.
Take an example. One of my previous experience is to building a malware analytic database which aims to tells me if a new sample is malware or not at SophosLabs. It will need at least:
- a multitude of data extractor of incoming samples
- a few fancy technology to store and index the data
- fine-tuning and optimize the model to classify if a sample is malware.
To give you a scale of the challenge, I started the project in 2007 as a one man side-project. When I left the company in 2012, it was a still-growing project with multiple analysts and engineer contribution. So how did the project grow?
With hindsight, I was blessed with naivety, and resource constraint. My first question was - what data I should collect to bootstrap such idea? There are many different meta-data that I can spend time collecting for a malware data collection sample. This could take myself as a analysts months to build. Should I focus on data collection step and aim for big win, or be greedy and try to get whatever small win I can.
Being lazy and impatient, I took the greedy small win path. I duct-taped a set of scripts which extract string data from malware. It was a simple parser for the strings command - which comes as a standard UNIX tool and can be used by anybody. The initial tool doesn't automate detection, but only assists me to relate new samples with previous malware corpus.
This simple cmdline tool took me 1-2 months to build - chkword would become my first iteration of such data cycle. The tool I built was a far cry from the ultimate vision, but along the way it helps with myself and my collages to learn about all the challenges. With hindsight, a lot of I do with the project coincides with a saying at my current team at Mobify
"Favour small diffs over big bangs” Zen of Mobify
Our aim is to try to execute through the iteration of cycle as fast as possible. We try to break down our business problem into small chunks and validate, model and deploy as fast as we can.
Another reason why we want to run small iterations is that the data might not prove your hypothesis. Data is much more likely to let you down - providing the signal will not prove your hypothesis. The hope that a simple change of button color will increase your conversion is often reject by data.
We call this uncertainty as Data risk. With data scientist, you need not only enough quantity of data but also the quality (high signal low noise) data. If you are a green field in the domain of research, it is likely that you will get half of the bets of your hypothesis wrong.
We mitigate this risk by being agile and work in small iteration. Avoid this is don’t put all eggs in one basket. Imagine that if chkword didnt work out, it would not have wasted years of effort - but 1-2 months of my pet-project time. Compare to the potential option if I decided to focus on data collection, and took months to validate only to fail.
Building a data organization is no overnight task. I have collected my learning through different organization. I saw having minimal resources as a blessing to force me to iterate fast:
-
Get small win and run fast - eager to demonstrate value in data early
-
Be aware of data risk and de-risk early - e.g. Set up your team do that you do not specialize in one or two modelling techniques, or dependent on one hypothesis in your product plan
-
Identify bottleneck of your capability along the data hierarchy, aim for win to strive to achieve the next stage
In big data team, we have autonomy to shape the "data rock" that we lift. Vanity metrics will guide us to "build the biggest rock possible (hello collect everything)", or "use the fanciest technique to lift the rock (hello deep/reenforcement learning)". It is no different to real life situation where we might be stuck with greed to be rich, or famous.
I look at companies such as Etsy or AirBnB - where they have pushed using data to educate themselves to become better, as is the definition of self-actualization
Self-actualizing individuals are motivated to continual growth... motivate people and tries to determine how they define the self while maximizing their potential.
May you realize the full potential of your data.



