Skip to content

Instantly share code, notes, and snippets.

@gtrak
Last active August 29, 2015 14:18
Show Gist options
  • Save gtrak/b2cb99d89b954757c7a2 to your computer and use it in GitHub Desktop.
Save gtrak/b2cb99d89b954757c7a2 to your computer and use it in GitHub Desktop.
Mesos for Spark Users - Data Science MD Meetup 4/10/2015
* Mesos for Spark users
- Ben Hindman, started on mesos 2009
- berkeley
- twitter
- multiple 10k+ machines clusters
- Mesos
- 3 Mesos founders now work at Databricks on Spark
- Constraints/Design-goals
- static partitioning considered harmful
- eg. 3 hadoop boxes, 3 spark boxes, etc.
- Time running jobs is highly variable
- Utilization is low
- Some boxes are always idle
- people should be able to build new frameworks on top of common infra
- instead of everyone just using mapreduce
- Frameworks
- 1 coordinator node, multiple workers
- scheduler and tasks
- framework is a synonym for distributed system
- launch tasks
- tasks have commands
- can specify resources needed to run the task, eg. jars from s3
- tasks can have executors
- inversion of control
- executors decide how the task gets launched
- maybe run it a bunch of threads
- an extra level of indirection for task-running
- Adding a layer of indirection between coordinator/tasks
- the Mesos(master)
- provides
- failure detection
- task distribution
- task starting
- task monitoring
- task killing
- task cleanup
- Analogy to a virtual memory manager before VMMs existed
- Machines are abstracted as Resources
- IaaS is different
- provisions and manages machines
- 1:1 machine mapping
- Mesos is higher level
- Mesos can run on top of IaaS
- openstack
- ec2
- or bare metal
- PaaS is different
- deploy and manage applications/services
- Mesos is lower level
- Can easily build a PaaS on top of mesos
- apache aurora
- Marathon
- Mesos is a Datacenter Kernel
- a distributed system for building other distributed systems
- cluster manager
- multiple masters for high availability
- Comparison to older cluster managers, eg. PBS
- batch submit of a job spec
- application/human-user keeps rescheduling to try to get higher priority
- specifications pitfalls
- hard to specify desires/constraints
- hard to change specs dynamically, eg hadoop jobconfig
- 500 mappers for 100 machines, 10 reduces..
- first phase you want 100
- second phase you want 10
- shuffle phase might complicate things
- real specs are dynamic
- alternative - Framework Specifier Request
- what should cluster manager do if it can't satisfy a request?
- wait?
- like existing cluster managers
- allocate the best we can immediately?
- analogy
- non-blocking socket vs blocking
- partial bytes are written immediate vs wait till everything's done
- resource offers
- current snapshot of available resources
- requests aren't necessary
- Google Omega just offers
- Spark on Mesos
- cluster manager lives between driver/spark-context and worker-nodes
- spark uses mesos offers to perform its own scheduling
- 2-level scheduling model
- mesos controls resource allocations to each instance of spark
- spark makes decisions about what to run given those resources
- uses task+executor launch model
- one spark executor has RDD cache
- spark tasks run as threads within the executor process
- can have multiple executors on one spark instance
- Resource isolation
- docker, LXC
- can just hand a docker image instead of a task/executor
- without docker, still does resource isolation using cgroups+kernel-facilities
- Mesos can change the resource constraints on containers on the fly
- Metrics
- each container publishes metrics to a time-series db
- alerting
- 4 key problems of distributed systems
- configuration/package management
- deployment
- service discovery
- monitoring
- Why pick mesos if you've already got spark/yarn etc.
- multi-tenancy
- multiple frameworks on each box
- hadoop/storm/spark
- multiple instances of spark
- production/experimental
- multiple versions of spark
- fine-grained sharing
- container resource requirements changed on-the-fly
- worthwhile even on a single machine
- fault-tolerance/high-availability
- master election + failover
- everything keeps running if a master fails
- slave failover
- tasks continue to run even if a slave goes down
- new slaves reconnect to existing running tasks when they come live
- great for big batch jobs, minimizing restarts
- eg. large in-memory services, redis
- this doesn't work for docker yet, they're talking with docker guys
- framework failover
- all tasks keep running across framework failover
- what about hadoop? yarn..
- map/reduce
- hdfs
- Myriad project
- collaboration of Mesosphere/MapR+ebay
- lets mesos be a cluster manager for multiple yarn clusters
- how it works, alpha
1. job gets submitted to yarn
2. yarn asks mesos for resources
3. mesos gives offers
4. launches myriad executor
5. myriad executor launches yarn nodemanager
6. nodemanager talks back to yarn to figure out what to do
- long-lived services and other frameworks
- they're working on making HDFS a separate framework
- lets you have multiple hdfs's
- spark/chronos/jenkins/marathon all talking to mesos master
- Aurora/Marathon
- PaaS-like frameworks
- easy to launch long-running applications, web-servers
- Jenkins
- ebay uses it - bit.ly/1frLrLf
- used to have hundreds of jenkins clusters, 1 per development team
- bad resource utilization
- pooled all resources together
- one jenkins framework per team
- stateful frameworks
- reservations
- preset containers for different frameworks
- strong guarantees
- basically immutable
- dynamic reservations
- for things that need to write to disk, ACLs
- persistent volumes
- analogous to docker's volumes
- creates volumes when a framework accepts an offer
- persistent or ephemeral
- jira MESOS-1554
- inverse offers/maintenance
- framework gets deallocation requests as 'inverse offers'
- allows 'maintenance' or planned failures
- drain all the tasks from rack X
- gives a lot more flexibility, automates what humans would do
- jira MESOS-1592
- distributed systems engineers
- ops guys end up writing a lot of code for ops
- product devs write their app code
- mesos is a single abstraction they both can write to
- the datacenter is just another form factor
- don't reinvent the wheel
- Mesosphere
- trying to build the OS on top of the datacenter kernel
- GUIs
- modules
- artifact repositories
- Q&A
- how about something like kafka?
- framework's already being built
- When should I not use mesos?
- Ben's biased
- stateful primitives are still alpha
- don't run a big mysql cluster at the moment b/c you'll have to do static partitioning
- they're addressing this
- when it's addressed should be easy to swap in the right way
- twitter/google
- everything runs through cluster managers
- inverse offers - are resources preemptible?
- there's specifically revokable resources that ties in to dynamic reservations work
- lots of things you can run that way
- integration tests
- clarify docker restarting?
- all mesos tasks are still in a container
- keep running the container if the mesos agent process goes down
- if docker is running the container
- mesos can't reconnect
- container has to get shut down
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment