Q1. The metric node_cpu_temp_celcius
reports the current temperature of a nodes CPU in celsius. What query will return the average temperature across all CPUs on a per node basis? The query should return {instance=“node1”} 23.5 //average temp across all CPUs on node1 {instance=“node2”} 33.5 //average temp across all CPUs on node2.
node_cpu_temp_celsius{instance="node1", cpu="0"} 28
node_cpu_temp_celsius{instance="node1", cpu="1"} 19
node_cpu_temp_celsius{instance="node2", cpu="0"} 36
node_cpu_temp_celsius{instance="node2", cpu="1"} 31
A1: `avg by(instance) (node_cpu_temp_celsius)
Q2: What method does Prometheus use to collect metrics from targets? A2: pull
Q3: An engineer forgot to address an alert, based off the alertmanager config below, how long will they need to wait to see the alert again?
route:
receiver: pager
group_by: [alertname]
group_wait: 10s
repeat_interval: 4h
group_interval: 5m
routes:
- match:
team: api
receiver: api-pager
- match:
team: frontend
receiver: frontend-pager
A3: 4h
Q4: Which query below will get all time series for metric node_disk_read_bytes_total
for job=web, and job=node?
A4: node_disk_read_bytes_total{job=~"web|node"}
Q5: What type of database does Prometheus use? A5: Time Series
Q6: Analyze the alertmanager configs below. For all the alerts that got generated, how many total notifications will be sent out?
route:
receiver: general-email
group_by: [alertname]
routes:
- receiver: frontend-email
group_by: [env]
matchers:
- team: frontend
The following alerts get generated by Prometheus with the defined labels.
alert1
team: frontend
env: dev
alert2team: frontend
env: dev
alert3
team: frontend
env: prod
alert4
team: frontend
env: prod
alert5
team: frontend
env: staging
A6: 3
Q7: What is the Prometheus client library used for? A7: Instrumenting applications to generate prometheus metrics and to push metrics to the Push Gateway
Q8: Management has decided to offer a file upload service where the SLO states that 97% of all upload should complete within 30s. A histogram metric is configured to track the upload time, which of the following bucket configurations is recommended for the desired SLO? A8: 10, 25, 27, 30, 32, 35, 49, 50 [since histogram quantiles are approximations, to find out if a SLO has been met make sure that a bucket is specified at the desired SLO value]
Q9: Which of the following is not a valid method for reloading alertmanager configuration? A9: hit the reload config button in alertmanager web ui
Q10: What two labels are assigned to every metric by default? A10: instance, job
Q11: What configuration will make it so Prometheus doesn’t scrape targets with a label of team: frontend
?
#Option A:
relabel_configs:
- source_labels: [team]
regex: frontend
action: drop
#Option B:
relabel_configs:
- source_labels: [frontend]
regex: team
action: drop
#Option C:
metric_relabel_configs:
- source_labels: [team]
regex: frontend
action: drop
#Option D:
relabel_configs:
- match: [team]
regex: frontend
action: drop
A11: Option A [relabel_configs is where you will define which targets Prometheus should scrape]
Q12: Where should alerting rules be defined?
scrape_configs:
- job_name: example
metric_relabel_configs:
- source_labels: [__name__]
regex: database_errors_total
action: replace
target_label: __name__
replacement: database_failures_total
A12: separate rules file
Q13: Which query below will give the 99% quantile of the metric http_requests_total
?
A13: histogram_quantile(0.99, http_requests_total_bucket)
Q14: What metric should be used to track the uptime of a server? A14: counter
Q15: Which component of the Prometheus architecture should be used to collect metrics of short-lived jobs? A15: push gateway
Q16: What is the purpose of Prometheus scrape_interval
?
A16: Defines how frequently to scrape a target
Q17: What does the following metric_relabel_config do?
scrape_configs:
- job_name: example
metric_relabel_configs:
- source_labels: [__name__]
regex: database_errors_total
action: replace
target_label: __name__
replacement: database_failures_total
A17: Renames the metric database_errors_total
to database_failures_total
Q18: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster? A18: service discovery
Q19: For a histogram metric, what are the different submetrics?
A19: __count
[total number of observations], __bucket
[number of observations for a specific bucket], __sum
[sum of all observations]
Q20: What is the default web port of Prometheus? A20: 9090
Q21: Add an annotation to the alert called description
that will print out the message that looks like this Instance has low disk space on filesystem, current free space is at %
groups:
- name: node
rules:
- alert: node_filesystem_free_percent
expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10
## Examples of the two metrics used in the alert can be seen below.
# node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}
# node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}
# Choose the correct answer:
# Option A:
description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%
# Option B:
description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%
# Option C:
description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%
# Option D:
description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%
A21: Option B
Q22: What does the double underscore __
before a label name signify?
A22: The label is reserved label
Q23: The metric http_errors_total
has 3 labels, path
, method
, error
. Which of the following queries will give the total number of errors for a path of /auth
, method of POST
, and error code of 401
?
A23: http_errors_total{path="/auth", method="POST", code="401"}
Q24: What are the different states a Prometheus alert can be in? A24: inactive, pending, firing
Q25: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects? A25: exporters
Q26: Which of the following is not a valid time value to be used in a range selector? A26: 2mo
Q27: Analyze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to team: api
and severity: critical
?
route:
receiver: general-email
routes:
- receiver: frontend-email
matchers:
- team: frontend
routes:
- matchers:
severity: critical
receiver: frontend-pager
- receiver: backend-email
matchers:
- team: backend
routes:
- matchers:
severity: critical
receiver: backend-pager
- receiver: auth-email
matchers:
- team: auth
routes:
- matchers:
severity: critical
receiver: auth-pager
receiver: auth-pager
A27: general-email
Q28: A metric to track requests to an api http_requests_total
is created. Which of the following would not be a good choice for a label?
A28: email
Q29: Which query below will return a range vector?
A29: node_boot_time_seconds[5m]
Q30: Based off the metrics below, which query will return the same result as the query database_write_timeouts / ignoring(error) database_error_total
database_write_timeouts{instance="db1", job="db", error="212, type="mysql"} 12
database_error_total{instance="db1", job="db", type="mysql"} 67
A30: database_write_timeouts / on(instance, job, type) database_error_total
Q31: What is the purpose of the for attribute in a Prometheus alert rule? A31: Determines how long a rule must be true before firing an alert
Q32: Which query will give sum of all filesystems on the machine? The metric node_filesystem_size_bytes
will list out all of the filesystems and their total size.
node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", mountpoint="/boot/efi"} 536834048
node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="192.168.1.168:9100", mountpoint="/"} 13268975616
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run"} 727924736
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/lock"} 5242880
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/snapd/ns"} 727924736
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/user/1000"} 727920640
A32: sum(node_filesystem_size_bytes{instance="192.168.1.168:9100"})
Q33: What are the 3 components of the prometheus server? A33: retrieval node, tsdb, http server
Q34: What selector will match on time series whose mountpoint
label doesn’t start with /run?
node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node1", mountpoint="/boot/efi"}
node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node2", mountpoint="/boot/efi"}
node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node1", mountpoint="/"}
node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node2", mountpoint="/"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/lock"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/snapd/ns"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/user/1000"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/lock"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/snapd/ns"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/user/1000"}
A34: node_filesysten_avail_bytes{mountpoint!~"/run.*"}
Q35: Which statement is true about the rate/irate functions?
A35: rate()
calculates average rate over entire interval, irate()
calculates the rate only between the last two datapoints in an interval
Q36: What is the default path Prometheus will scrape to collect metrics?
A36: /metrics
Q37: The following PromQL expression is trying to divide the the node_filesystem_avail_bytes
by node_filesystem_size_bytes
, and node_filesystem_avail_bytes
/ node_filesystem_size_bytes
. The PromQL expression does not return any results, fix the expression so that it successfully divides the two metric. This is what the two metrics look like before the division operation:
node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", class=”SSD” instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}
node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}
A37: node_filesystem_avail_bytes / ignoring(class) node_filesystem_size_bytes
Q38: What are the 3 components of observability? A38: logging, metrics, traces
Q39: Which of the following statements are true regarding Alert labels
and annotations
?
route:
receiver: staff
group_by: ['severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- matchers:
job: kubernetes
receiver: infra
group_by: ['severity']
A39: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, whereas annotations should be used for cosmetic descriptions of the alerts
Q40: The metric http_errors_total{code=”404”} tracks the number of 404 errors a web server has seen. Which query returns what is the average rate of 404s a server has seen for the past 2 hours? Use a 2m sample range and a query interval of 1m:
A40: avg_over_time(rate(http_errors_total{code="404"}[2m]) [2h:1m])
[since we need the average for the past 2 hours, the first value in the subquery will be 2h and the second number is the query interval]
Q41: Which query will return all time series for the metric node_network_transmit_drop_total
this is greater than 20 and less than 100?
A41: node_network_transmit_drop_total > 20 and node_network_transmit_drop_total < 100
Q42: What does the following metric_relabel_config
do?
scrape_configs:
- job_name: example
metric_relabel_configs:
- source_labels: [datacenter]
regex: (.*)
action: replace
target_label: location
replacement: dc-$1
A42: changes the datacenter label to location and prepends the value with dc-
Q43: What type of data should Prometheus monitor? A43: numeric
Q44: Which type of observability would be used to track a request/transaction as it traverses a system? A44: traces
Q45: Add an annotation to the alert called description that will print out the message that looks like this Instance has low disk space on filesystem , current free space is at %
groups:
- name: node
rules:
- alert: node_filesystem_free_percent
expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10
# Examples of the two metrics used in the alert can be seen below
# node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}
# node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}
# Choose the correct option:
#Option A:
description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%
#Option B:
description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%
#Option C:
description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%
#Option D:
description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%
A45: Option B
Q46: Regarding histogram and summary metrics, which of the following are true? A46: histogram is calculated server side and summary is calculated client side [for histograms, quantiles must be calculated server side thus they are less taxin on client libraries, whereas sumary metrics are the opposite]
Q47: What is this an example of? `Service provider guaranteed 99.999% uptime each month or else customer will be awarded $10k’ A47: SLA
Q48: Which of the following is Prometheus’ built in dashboarding/visualization feature? A48: Console templates
Q49: Which query below will give the active bytes on instance 10.1.1.1:9100 45m ago?
A49: node_memory_Active_bytes{instance="10.1.1.1:9100"} offset 45m
Q50: What type of metric should be used for measuring internal temperature of a server? A50: gauge
Q51: What is the name of the cli utility that comes with Prometheus? A51: promtool
Q52: How can alertmanager prevent certain alerts from generating notification for a temporary period of time? A52: Configuring a silence
Q53: In the scrape configs for a pushgateway, what is the purpose of the honor_labels: true
scrape_configs:
- job_name: pushgateway
honor_labels: true
static_configs:
- targets: ["192.168.1.168:9091"]
A53: Allows metrics to specify the instance and job labels instead of pulling it from scrape_configs
Q54: Analayze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to team: backend and severity: critical
route:
receiver: general-email
routes:
- receiver: frontend-email
matchers:
- team: frontend
routes:
- matchers:
severity: critical
receiver: frontend-pager
- receiver: backend-email
matchers:
- team: backend
routes:
- matchers:
severity: critical
receiver: backend-pager
- receiver: auth-email
matchers:
- team: auth
routes:
- matchers:
severity: critical
receiver: auth-pager
receiver: auth-pager
A54: backend-pager
Q55: Which of the following would make for a poor SLI? A55: high disk utilization [things like CPU, memory, disk utilization are poor as user may not experience any degradation of service during these events]
Q56: Which of the following is not a valid way to reload Prometheus configuration? A56: promtool config reload
Q57: Which of the following is not something that is tracked in a span within a trace? A57: complexity
Q58: You are writing your own exporter for a Redis database. Which of the following would be the correct name for a metric to represent used memory on the by the Redis instance?
A58: redis_mem_used_bytes
[the first should be the app, second metric name, third the unit]
Q59: Which cli command can be used to verify/validate prometheus configurations?
A59: promtool check config
Q60: Which query will return targets who have more than 50 arp entries?
A60: node_arp_entries{job="node"} > 50
Thank you !