Last active
December 2, 2016 13:37
-
-
Save bergerx/6ecd9e7fb2cc6b5c5f26fede5397d7de to your computer and use it in GitHub Desktop.
DC/OS metrics system
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. cluster-level metrics and health (mesos-master, mesos-slave, | |
marathon, marathon-lb, mesos-dns, kafka ...) | |
Metrics for cluster components like mesos-master, mesos-slave, | |
frameworks (DC/OS services like zookeeper, marathon, marathon-lb, | |
mesos-dns, kafka,...). | |
These will be used to troubleshoot any problems at cluster-level. | |
Having each component's version as a metric label could help with | |
troubleshooting, for example seeing a modified marathon-lb's impact on | |
cluster (having a graph with both old an updated releases). | |
Cluster-level metric collection, storage and also representation | |
should not have a hard dependency on any DC/OS cluster component | |
(marathon, marathon-lb, mesos-dns, zookeeper), since it would also be | |
used to troubleshoot cluster-outage problems. E.g. if zookeeper is | |
down, mesos control plane stops and so mesos-dns and marathon. | |
2. node-level metrics (node resources) | |
Classical old style host based metrics. | |
* metric labels: nodes could have related labels and other | |
metadata by their executors (host/slave-id/ip, node attributes, ) | |
* metric values: Usual resource utilisation (cpu/mem utilisation, | |
net bandwith, ...). | |
But these should be | |
properly labeled (mesos node-id, ip, node attributes, ...) so | |
that one can generate aggregated metrics like: | |
* "This single app is assigned %30 of total CPU resource in cluster and utilising %43 in reality" | |
* "This application is using %90 of netowork bandwith on all nodes it has instances" | |
3. application-level metrics by containeriser (resource usage | |
collected from outside of application's context) | |
Each executor should configure the app to run and limit the | |
resources for each task they run: | |
* metric labels: tasks could have related labels and other | |
metadata by their executors (host/slave-id/ip, marathon | |
id/labels, docker id/image/labels, specific ENV values...) | |
* metric values: the resource utilisation (cpu/mem utilisation, | |
net bandwith), also other framework/executor metrics at | |
containerizer level (marathon cpu/mem limit) | |
Labels should be in sync with node-level metric labels so that | |
they can be used to aggregate different metrics. | |
4. application-level metrics from application (components to let | |
application push their metrics or let them expose metrics and | |
get them collected) | |
Metrics generated by application, should also be properly labeled | |
with mesos/framework/node metadata so that they can be used to | |
generate different levels of aggregation. | |
Solution should allow metrics to be pushed or exposed by and | |
endpoint to be pulled by another component. | |
Applications should not be expected to be aware about their | |
orchestrator/containerizer level metadata, these should be | |
auto-populated during collection. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment