Last active
August 29, 2015 14:27
-
-
Save tnachen/8b234d91bba19babb64c to your computer and use it in GitHub Desktop.
Superpower your company's Big Data with Resource Management
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Up level your company's Big Data with Resource Management | |
| Where is Big data at now? | |
| ------------------ | |
| Big data was once one of the biggest technology hype, where tons of presentations and posts talk about how the new systems and tools allows large and complex data to be processed that traditional tools wasn't able to. While Big data was being at the peak of its hype, most companies are still getting familir of the new data processing frameworks such as Hadoop, and new databases such as HBase and Cassandra. Fast foward to now where Big data is still a popular topic, lots of companies has already jumped into the Big data bandwagon and are already moving past the first generation Hadoop to evaluate newer tools such as Spark and newer databases such as Firebase, NuoDB or Memsql. But most companies also learn from running all of these tools, is that deploying, operating and planning capacity for these tools are very hard and complicated. Although over time lots of these tools has became more mature, all of these tools are still usually running in their own independent clusters. | |
| It's also not rare to find multiple clusters of Hadoop in the same company since multi-tenant it's not builtin to many of these tools and you run the risk of overloading the cluster by a few non-critical big data jobs. | |
| Problems running indepdent Big data clusters | |
| -------------------------------------------- | |
| There are a lot of problems when you run a lot of these independent clusters. One of them is monitoring and visiblity, where all these clusters has their own management tools and to integrate the company's shared monitoring and management tools is a huge challenge especially when onboarding yet another framework with another cluster. Another problem is multi-tenancy. Although having independent clusters solves the problem another org's job can overtake the whole cluster, it still doesn't solve the problem where a bug in the Hadoop application just uses all the available resources and the pain of debugging this is horrific. A another problem is utilization, where a cluster is usually not 100% being utilized and all these instances running in Amazon or in your datacenter and just racking up bills for doing no work. There are a lot more major pain points that I don't have enough text limit to even describe. | |
| Hadoop v2 | |
| ------------------------------------ | |
| The Hadoop developers and operators saw this problem, and in the 2nd generation of the Hadoop developed a seperate resource management tool called YARN to have a single management framework that manages all the resources in the cluster from Hadoop, enforce the resource limitations of the jobs, integrates security in the workload and even optimizes the workload by placing jobs closer to the data automatically. This solves a huge problem when operating a Hadoop cluster, and also consolidate all the Hadoop clusters into one cluster since it allows more finer grain control over the workload and saves effiency of the cluster. | |
| Beyond Hadoop | |
| ---------------------------------------------------- | |
| Now with the vast amount of Big data technologies that is growing in the ecosystem, there is a need to integrate a common resource management layer among all the tools since without a single resource management system across all the frameworks we run back into the same problems as we mentioned before. Also when all these frameworks are running under one resource management platform, a lot of options for optimizations and resource scheduling are now possible. | |
| Here are some examples what could be possible with one resource management platform: | |
| - With one resource management platform the platform can understand all the cluster workload and available resources and can auto resize and scale up and down based on worklaods across all these tools. It can also resize jobs according to priority. | |
| - The cluster is able to detect under utilization from other jobs and offer the slack resources to Spark batch jobs while not impacting your very important workloads from other frameworks, and maintain the same business deadlines and saves a lot more cost. | |
| In the next post I'll continue to cover Mesos, which is one such resource management system and how the upcoming features in Mesos allows optimizations I mentinoed to be possible. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment