Last active
May 24, 2019 15:20
-
-
Save moosh3/8a601bb1f092d31687964aa35cc283ec to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
groups: | |
- name: postgresql.rules | |
rules: | |
- alert: PostgreSQL_UnusedReplicationSlot | |
expr: 'pg_replication_slots_active == 0' | |
for: 30m | |
labels: | |
severity: warn | |
channel: database | |
annotations: | |
description: | | |
Unused {{$labels.slot_type}} slot "{{$labels.slot_name}}" | |
on {{$labels.instance}} | |
- alert: PostgreSQL_XLOGConsumptionTooHigh | |
expr: | | |
rate(pg_xlog_position_bytes[1m]) > 2.0e+07 | |
AND ON(instance) | |
(pg_replication_is_replica == 0) | |
annotations: | |
description: | | |
XLOG is being generated at a rate of | |
{{ $value | humanize1024 }}B/s on {{$labels.instance}} | |
runbook: troubleshooting/postgres.md#replication-is-lagging-or-has-stopped | |
title: 'Postgres is generating XLOG too fast, | |
expect this to cause replication lag' | |
- alert: PostgreSQL_ReplicationLagTooLarge | |
expr: | | |
(pg_replication_lag > 120) | |
AND ON(instance) | |
(pg_replication_is_replica == 1) | |
annotations: | |
description: | | |
Replication lag on server {{$labels.instance}} is currently | |
{{$value | humanizeDuration }} | |
runbook: troubleshooting/postgres.md#replication-is-lagging-or-has-stopped | |
title: 'Postgres Replication lag is over 2 minutes' | |
- alert: PostgreSQL_StatementTimeout_Errors | |
expr: | | |
rate(postgresql_errors_total{type="statement_timeout"}[1m]) > 0.5 | |
for: 5m | |
labels: | |
severity: warn | |
channel: database | |
annotations: | |
descrition: | | |
Database {{$labels.instance}} is logging | |
{{ $value | printf "%.1f" }} statement timeouts per second | |
runbook: troubleshooting/postgresql.md#errors | |
title: 'Postgres transactions showing high rate of statement timeouts' | |
- alert: PostgreSQL_RollbackRateTooHigh | |
expr: | | |
rate(pg_stat_database_xact_rollback{datname="gitlabhq_production"}[5m]) | |
/ ON(instance, datname) | |
rate(pg_stat_database_xact_commit{datname="gitlabhq_production"}[5m]) | |
> 0.02 | |
for: 5m | |
labels: | |
severity: warn | |
channel: database | |
annotations: | |
description: | | |
Ratio of transactions being aborted compared to committed is | |
{{$value | printf "%.2f" }} on {{$labels.instance}} | |
runbook: troubleshooting/postgresql.md#errors | |
title: 'Postgres transaction rollback rate is high' | |
rules: | |
- alert: PostgreSQL_ConnectionsTooHigh | |
expr: | | |
sum(pg_stat_activity_count) BY (environment, instance) | |
> ON(instance) | |
pg_settings_max_connections * 0.75 | |
for: 10m | |
labels: | |
severity: warn | |
channel: database | |
annotations: | |
runbook: troubleshooting/postgresql.md#connections | |
title: | | |
Postgres has {{$value}} connections on {{$labels.instance}} | |
which is close to the maximum | |
- alert: PostgreSQL_CommitRateTooLow | |
expr: | | |
rate(pg_stat_database_xact_commit{datname="gitlabhq_production", | |
environment="prd"}[1m]) < 1000 | |
for: 2m | |
labels: | |
severity: warn | |
channel: database | |
annotations: | |
description: | | |
Commits/s on {{$labels.instance}} database {{$labels.datname}} | |
is {{$value | printf "%.0f" }} which is implausibly low. | |
Perhaps the application is unable to connect | |
runbook: troubleshooting/postgresql.md#availability | |
title: 'Postgres seems to be processing very few transactions' | |
- alert: PostgreSQL_XLOGConsumptionTooLow | |
expr: 'rate(pg_xlog_position_bytes{environment="prd"}[1m]) < 200000' | |
for: 2m | |
labels: | |
severity: warn | |
channel: database | |
annotations: | |
description: | | |
XLOG throughput is {{ $value | humanize1024 }}B/s | |
on {{$labels.instance}} which is unusually low. | |
Perhaps the application is unable to connect | |
runbook: troubleshooting/postgresql.md#availability | |
title: 'Postgres seems to be consuming XLOG very slowly' | |
- alert: PostgreSQL_FleetSizeChange | |
expr: 'postgres:databases != postgres:databases OFFSET 2m' | |
annotations: | |
description: 'There are now {{$value}} databases in "{{$labels.environment}}"' | |
title: 'Number of PostgreSQL Databases in {{$labels.environment}} has changed' | |
- alert: PostgreSQL_RoleChange | |
expr: 'pg_replication_is_replica and changes(pg_replication_is_replica[1m]) > 0' | |
title: 'Postgres Database replica promotion occurred in "{{$labels.environment}}"' | |
- alert: PostgreSQL_ConfigurationChange | |
expr: | | |
{__name__=~"pg_settings_.*"} != | |
ON(__name__, instance) | |
{__name__=~"pg_settings_.*",__name__!="pg_settings_transaction_read_only"} | |
OFFSET 10m | |
- alert: PostgreSQL_SplitBrain | |
expr: 'count(pg_replication_is_replica == 0) BY (environment) != 1' | |
annotations: | |
title: | | |
Split Brain: more than one postgres databases in environment | |
{{$labels.environment}} in read-write (primary) mode | |
- alert: PostgreSQL_SplitBrain_Replicas | |
expr: | | |
count( | |
count(pg_stat_wal_receiver_status >= 0) BY (environment, upstream_host) | |
) BY (environment) > 1 | |
annotations: | |
title: | | |
Split Brain: replicas in environment {{$labels.environment}} | |
have different upstream databases configured | |
- alert: PostgresSQL_XIDConsumptionTooLow | |
expr: 'rate(pg_txid_current{environment="prd"}[1m]) < 5' | |
for: 1m | |
labels: | |
severity: warn | |
channel: database | |
annotations: | |
description: | | |
TXID/s is {{ $value | printf "%.1f" }} on {{$labels.instance}} | |
which is unusually low. Perhaps the application is unable to connect | |
runbook: troubleshooting/postgresql.md#availability | |
title: 'Postgres seems to be consuming transaction IDs very slowly' | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pg_replication: | |
query: "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::INT as lag, CASE WHEN pg_is_in_recovery() THEN 1 ELSE 0 END as is_replica" | |
metrics: | |
- lag: | |
usage: "GAUGE" | |
description: "Replication lag behind primary in seconds" | |
- is_replica: | |
usage: "GAUGE" | |
description: "Indicates if this host is a replica" | |
# Note that upstream_host and slot_name labels will cause broken time | |
# series which can be worked around using "ignoring" but it seems | |
# better to have these labels and have to work around them than to | |
# lose the data? | |
# Also in 9.6 and prior this view, like the pg_stat_statements view | |
# below requires a hacky SECURITY DEFINER function to grant access to | |
# the data inside as they filter out data for non-superuser even if | |
# you've granted access explicitly. | |
pg_stat_wal_receiver: | |
query: | | |
SELECT case status when 'stopped' then 0 when 'starting' then 1 when 'streaming' then 2 when 'waiting' then 3 when 'restarting' then 4 when 'stopping' then 5 else -1 end as status, | |
(receive_start_lsn- '0/0') % (2^52)::bigint as receive_start_lsn, | |
receive_start_tli, | |
(received_lsn- '0/0') % (2^52)::bigint as received_lsn, | |
received_tli, | |
extract(epoch from last_msg_send_time) as last_msg_send_time, | |
extract(epoch from last_msg_receipt_time) as last_msg_receipt_time, | |
(latest_end_lsn - '0/0') % (2^52)::bigint as latest_end_lsn, | |
extract(epoch from latest_end_time) as latest_end_time, | |
substring(slot_name from 'repmgr_slot_([0-9]*)') as upstream_node, | |
trim(both '''' from substring(conninfo from 'host=([^ ]*)')) as upstream_host, | |
slot_name | |
FROM pg_stat_wal_receiver | |
metrics: | |
- status: | |
usage: "GAUGE" | |
description: "Activity status of the WAL receiver process (0=stopped 1=starting 2=streaming 3=waiting 4=restarting 5=stopping)" | |
- receive_start_lsn: | |
usage: "COUNTER" | |
description: "First transaction log position used when WAL receiver is started" | |
- receive_start_tli: | |
usage: "GAUGE" | |
description: "First timeline number used when WAL receiver is started" | |
- received_lsn: | |
usage: "COUNTER" | |
description: "Last transaction log position already received and flushed to disk, the initial value of this field being the first log position used when WAL receiver is started" | |
- received_tli: | |
usage: "GAUGE" | |
description: "Timeline number of last transaction log position received and flushed to disk, the initial value of this field being the timeline number of the first log position used when WAL receiver is started" | |
- last_msg_send_time: | |
usage: "COUNTER" | |
description: "Send time of last message received from origin WAL sender" | |
- last_msg_receipt_time: | |
usage: "COUNTER" | |
description: "Receipt time of last message received from origin WAL sender" | |
- latest_end_lsn: | |
usage: "COUNTER" | |
description: "Last transaction log position reported to origin WAL sender" | |
- latest_end_time: | |
usage: "COUNTER" | |
description: "Time of last transaction log position reported to origin WAL sender" | |
- upstream_node: | |
usage: "GAUGE" | |
description: "The repmgr node from the upstream slot name" | |
- upstream_host: | |
usage: "LABEL" | |
description: "The upstream host this node is replicating from" | |
- slot_name: | |
usage: "LABEL" | |
description: "The upstream slot_name this node is replicating from" | |
pg_postmaster: | |
query: "SELECT pg_postmaster_start_time as start_time_seconds from pg_postmaster_start_time()" | |
metrics: | |
- start_time_seconds: | |
usage: "GAUGE" | |
description: "Time at which postmaster started" | |
pg_stat_user_tables: | |
query: "SELECT schemaname, relname, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del, n_tup_hot_upd, n_live_tup, n_dead_tup, n_mod_since_analyze, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze, vacuum_count, autovacuum_count, analyze_count, autoanalyze_count FROM pg_stat_user_tables" | |
metrics: | |
- schemaname: | |
usage: "LABEL" | |
description: "Name of the schema that this table is in" | |
- relname: | |
usage: "LABEL" | |
description: "Name of this table" | |
- seq_scan: | |
usage: "COUNTER" | |
description: "Number of sequential scans initiated on this table" | |
- seq_tup_read: | |
usage: "COUNTER" | |
description: "Number of live rows fetched by sequential scans" | |
- idx_scan: | |
usage: "COUNTER" | |
description: "Number of index scans initiated on this table" | |
- idx_tup_fetch: | |
usage: "COUNTER" | |
description: "Number of live rows fetched by index scans" | |
- n_tup_ins: | |
usage: "COUNTER" | |
description: "Number of rows inserted" | |
- n_tup_upd: | |
usage: "COUNTER" | |
description: "Number of rows updated" | |
- n_tup_del: | |
usage: "COUNTER" | |
description: "Number of rows deleted" | |
- n_tup_hot_upd: | |
usage: "COUNTER" | |
description: "Number of rows HOT updated (i.e., with no separate index update required)" | |
- n_live_tup: | |
usage: "GAUGE" | |
description: "Estimated number of live rows" | |
- n_dead_tup: | |
usage: "GAUGE" | |
description: "Estimated number of dead rows" | |
- n_mod_since_analyze: | |
usage: "GAUGE" | |
description: "Estimated number of rows changed since last analyze" | |
- last_vacuum: | |
usage: "GAUGE" | |
description: "Last time at which this table was manually vacuumed (not counting VACUUM FULL)" | |
- last_autovacuum: | |
usage: "GAUGE" | |
description: "Last time at which this table was vacuumed by the autovacuum daemon" | |
- last_analyze: | |
usage: "GAUGE" | |
description: "Last time at which this table was manually analyzed" | |
- last_autoanalyze: | |
usage: "GAUGE" | |
description: "Last time at which this table was analyzed by the autovacuum daemon" | |
- vacuum_count: | |
usage: "COUNTER" | |
description: "Number of times this table has been manually vacuumed (not counting VACUUM FULL)" | |
- autovacuum_count: | |
usage: "COUNTER" | |
description: "Number of times this table has been vacuumed by the autovacuum daemon" | |
- analyze_count: | |
usage: "COUNTER" | |
description: "Number of times this table has been manually analyzed" | |
- autoanalyze_count: | |
usage: "COUNTER" | |
description: "Number of times this table has been analyzed by the autovacuum daemon" | |
pg_statio_user_tables: | |
query: "SELECT schemaname, relname, heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit, toast_blks_read, toast_blks_hit, tidx_blks_read, tidx_blks_hit FROM pg_statio_user_tables" | |
metrics: | |
- schemaname: | |
usage: "LABEL" | |
description: "Name of the schema that this table is in" | |
- relname: | |
usage: "LABEL" | |
description: "Name of this table" | |
- heap_blks_read: | |
usage: "COUNTER" | |
description: "Number of disk blocks read from this table" | |
- heap_blks_hit: | |
usage: "COUNTER" | |
description: "Number of buffer hits in this table" | |
- idx_blks_read: | |
usage: "COUNTER" | |
description: "Number of disk blocks read from all indexes on this table" | |
- idx_blks_hit: | |
usage: "COUNTER" | |
description: "Number of buffer hits in all indexes on this table" | |
- toast_blks_read: | |
usage: "COUNTER" | |
description: "Number of disk blocks read from this table's TOAST table (if any)" | |
- toast_blks_hit: | |
usage: "COUNTER" | |
description: "Number of buffer hits in this table's TOAST table (if any)" | |
- tidx_blks_read: | |
usage: "COUNTER" | |
description: "Number of disk blocks read from this table's TOAST table indexes (if any)" | |
- tidx_blks_hit: | |
usage: "COUNTER" | |
description: "Number of buffer hits in this table's TOAST table indexes (if any)" | |
pg_stat_statements: | |
query: | | |
SELECT | |
pg_get_userbyid(userid) as user, | |
pg_database.datname, | |
pg_stat_statements.queryid, | |
pg_stat_statements.calls, | |
pg_stat_statements.total_time as time_milliseconds, | |
pg_stat_statements.rows, | |
pg_stat_statements.shared_blks_hit, | |
pg_stat_statements.shared_blks_read, | |
pg_stat_statements.shared_blks_dirtied, | |
pg_stat_statements.shared_blks_written, | |
pg_stat_statements.local_blks_hit, | |
pg_stat_statements.local_blks_read, | |
pg_stat_statements.local_blks_dirtied, | |
pg_stat_statements.local_blks_written, | |
pg_stat_statements.temp_blks_read, | |
pg_stat_statements.temp_blks_written, | |
pg_stat_statements.blk_read_time, | |
pg_stat_statements.blk_write_time | |
FROM pg_stat_statements | |
JOIN pg_database | |
ON pg_database.oid = pg_stat_statements.dbid | |
metrics: | |
- user: | |
usage: "LABEL" | |
description: "The user who executed the statement" | |
- datname: | |
usage: "LABEL" | |
description: "The database in which the statement was executed" | |
- queryid: | |
usage: "LABEL" | |
description: "Internal hash code, computed from the statement's parse tree" | |
- calls: | |
usage: "COUNTER" | |
description: "Number of times executed" | |
- time_milliseconds: | |
usage: "COUNTER" | |
description: "Total time spent in the statement, in milliseconds" | |
- rows: | |
usage: "COUNTER" | |
description: "Total number of rows retrieved or affected by the statement" | |
- shared_blks_hit: | |
usage: "COUNTER" | |
description: "Total number of shared block cache hits by the statement" | |
- shared_blks_read: | |
usage: "COUNTER" | |
description: "Total number of shared blocks read by the statement" | |
- shared_blks_dirtied: | |
usage: "COUNTER" | |
description: "Total number of shared blocks dirtied by the statement" | |
- shared_blks_written: | |
usage: "COUNTER" | |
description: "Total number of shared blocks written by the statement" | |
- local_blks_hit: | |
usage: "COUNTER" | |
description: "Total number of local block cache hits by the statement" | |
- local_blks_read: | |
usage: "COUNTER" | |
description: "Total number of local blocks read by the statement" | |
- local_blks_dirtied: | |
usage: "COUNTER" | |
description: "Total number of local blocks dirtied by the statement" | |
- local_blks_written: | |
usage: "COUNTER" | |
description: "Total number of local blocks written by the statement" | |
- temp_blks_read: | |
usage: "COUNTER" | |
description: "Total number of temp blocks read by the statement" | |
- temp_blks_written: | |
usage: "COUNTER" | |
description: "Total number of temp blocks written by the statement" | |
- blk_read_time: | |
usage: "COUNTER" | |
description: "Total time the statement spent reading blocks, in milliseconds" | |
- blk_write_time: | |
usage: "COUNTER" | |
description: "Total time the statement spent writing blocks, in milliseconds" | |
pg_total_relation_size: | |
query: | | |
SELECT relnamespace::regnamespace as schemaname, | |
relname as relname, | |
pg_total_relation_size(oid) bytes | |
FROM pg_class | |
WHERE relkind = 'r'; | |
metrics: | |
- schemaname: | |
usage: "LABEL" | |
description: "Name of the schema that this table is in" | |
- relname: | |
usage: "LABEL" | |
description: "Name of this table" | |
- bytes: | |
usage: "GAUGE" | |
description: "total disk space usage for the specified table and associated indexes" | |
pg_blocked: | |
query: | | |
SELECT | |
count(blocked.transactionid) AS queries, | |
'__transaction__' AS table | |
FROM pg_catalog.pg_locks blocked | |
WHERE NOT blocked.granted AND locktype = 'transactionid' | |
GROUP BY locktype | |
UNION | |
SELECT | |
count(blocked.relation) AS queries, | |
blocked.relation::regclass::text AS table | |
FROM pg_catalog.pg_locks blocked | |
WHERE NOT blocked.granted AND locktype != 'transactionid' | |
GROUP BY relation | |
metrics: | |
- queries: | |
usage: "GAUGE" | |
description: "The current number of blocked queries" | |
- table: | |
usage: "LABEL" | |
description: "The table on which a query is blocked" | |
pg_oldest_blocked: | |
query: | | |
SELECT coalesce(extract('epoch' from max(clock_timestamp() - state_change)), 0) age_seconds | |
FROM pg_stat_activity | |
WHERE wait_event_type = 'Lock' | |
AND state='active' | |
metrics: | |
- age_seconds: | |
usage: "GAUGE" | |
description: "Largest number of seconds any transaction is currently waiting on a lock" | |
pg_slow: | |
query: | | |
SELECT COUNT(*) AS queries | |
FROM pg_stat_activity | |
WHERE state = 'active' AND (now() - query_start) > '1 seconds'::interval | |
metrics: | |
- queries: | |
usage: "GAUGE" | |
description: "Current number of slow queries" | |
pg_vacuum: | |
query: | | |
SELECT | |
COUNT(*) AS queries, | |
MAX(EXTRACT(EPOCH FROM (clock_timestamp() - query_start))) AS age_in_seconds | |
FROM pg_catalog.pg_stat_activity | |
WHERE state = 'active' AND trim(query) ~* '\AVACUUM (?!ANALYZE)' | |
metrics: | |
- queries: | |
usage: "GAUGE" | |
description: "The current number of VACUUM queries" | |
- age_in_seconds: | |
usage: "GAUGE" | |
description: "The current maximum VACUUM query age in seconds" | |
pg_vacuum_analyze: | |
query: | | |
SELECT | |
COUNT(*) AS queries, | |
MAX(EXTRACT(EPOCH FROM (clock_timestamp() - query_start))) AS age_in_seconds | |
FROM pg_catalog.pg_stat_activity | |
WHERE state = 'active' AND trim(query) ~* '\AVACUUM ANALYZE' | |
metrics: | |
- queries: | |
usage: "GAUGE" | |
description: "The current number of VACUUM ANALYZE queries" | |
- age_in_seconds: | |
usage: "GAUGE" | |
description: "The current maximum VACUUM ANALYZE query age in seconds" | |
pg_stuck_idle_in_transaction: | |
query: | | |
SELECT COUNT(*) AS queries | |
FROM pg_stat_activity | |
WHERE state = 'idle in transaction' AND (now() - query_start) > '10 minutes'::interval | |
metrics: | |
- queries: | |
usage: "GAUGE" | |
description: "Current number of queries that are stuck being idle in transactions" | |
# All xid and lsn metrics here are reported mod 2^52 to ensure they | |
# fit within a float for Prometheus :( Really annoying that counters | |
# aren't stored in a 64-bit integer. Note that for queries that report | |
# floats this only works because postgres_exporter does know to set | |
# extra_float_digits (which it sets to 2). So they don't print in | |
# exponential notation and precision is maintained up to 2^53-1. | |
pg_txid: | |
query: | | |
SELECT | |
CASE WHEN pg_is_in_recovery() THEN 'NaN'::float ELSE txid_current() % (2^52)::bigint END AS current, | |
CASE WHEN pg_is_in_recovery() THEN 'NaN'::float ELSE txid_snapshot_xmin(txid_current_snapshot()) % (2^52)::bigint END AS xmin, | |
CASE WHEN pg_is_in_recovery() THEN 'NaN'::float ELSE txid_current() - txid_snapshot_xmin(txid_current_snapshot()) END AS xmin_age | |
metrics: | |
- current: | |
usage: "COUNTER" | |
description: "Current 64-bit transaction id of the query used to collect this metric (truncated to low 52 bits)" | |
- xmin: | |
usage: "COUNTER" | |
description: "Oldest transaction id of a transaction still in progress, i.e. not known committed or aborted (truncated to low 52 bits)" | |
- xmin_age: | |
usage: "GAUGE" | |
description: "Age of oldest transaction still not committed or aborted measured in transaction ids" | |
pg_xlog_position: | |
query: | | |
SELECT CASE | |
WHEN pg_is_in_recovery() | |
THEN (pg_last_wal_replay_lsn() - '0/0') % (2^52)::bigint | |
ELSE (pg_current_wal_lsn() - '0/0') % (2^52)::bigint | |
END AS bytes | |
metrics: | |
- bytes: | |
usage: "COUNTER" | |
description: "Postgres LSN (log sequence number) being generated on primary or replayed on replica (truncated to low 52 bits)" | |
# This should be pushed upstream. We really just want "active" so we | |
# can monitor for orphaned slots causing xlog space usage to grow | |
pg_replication_slots: | |
query: | | |
SELECT slot_name, slot_type, | |
case when active then 1.0 else 0.0 end AS active, | |
age(xmin) AS xmin_age, | |
age(catalog_xmin) AS catalog_xmin_age, | |
CASE WHEN pg_is_in_recovery() THEN pg_last_wal_replay_lsn() ELSE pg_current_wal_lsn() END - restart_lsn AS restart_lsn_bytes, | |
CASE WHEN pg_is_in_recovery() THEN pg_last_wal_replay_lsn() ELSE pg_current_wal_lsn() END - confirmed_flush_lsn AS confirmed_flush_lsn_bytes | |
FROM pg_replication_slots | |
metrics: | |
- slot_name: | |
usage: "LABEL" | |
description: "Slot Name" | |
- slot_type: | |
usage: "LABEL" | |
description: "Slot Type" | |
- active: | |
usage: "GAUGE" | |
description: "Boolean flag indicating whether this slot has a consumer streaming from it" | |
- xmin_age: | |
usage: "GAUGE" | |
description: "Age of oldest transaction that cannot be vacuumed due to this replica" | |
- catalog_xmin_age: | |
usage: "GAUGE" | |
description: "Age of oldest transaction that cannot be vacuumed from catalogs due to this replica (used by logical replication)" | |
- restart_lsn_bytes: | |
usage: "GAUGE" | |
description: "Amount of data on in xlog that must be this replica may need to complete recovery" | |
- confirmed_flush_lsn_bytes: | |
usage: "GAUGE" | |
description: "Amount of data on in xlog that must be this replica has not yet received" | |
pg_stat_ssl: | |
query: | | |
SELECT pid, bits, | |
CASE WHEN ssl THEN 1.0 ELSE 0.0 END AS active, | |
CASE WHEN compression THEN 1.0 ELSE 0.0 END AS compression | |
FROM pg_stat_ssl | |
metrics: | |
- pid: | |
usage: "LABEL" | |
description: "Process ID of a backend or WAL sender process" | |
- active: | |
usage: "GAUGE" | |
description: "Boolean flag indicating if SSL is used on this connection" | |
- bits: | |
usage: "GAUGE" | |
description: "Number of bits in the encryption algorithm is in use" | |
- compression: | |
usage: "GAUGE" | |
description: "Boolean flag indicating if SSL compression is in use" | |
# NOT | |
# DO NOT ADD GENERAL PURPOSE GITLAB MONITORING HERE | |
# NOT | |
# | |
# This file is for PostgreSQL statistics exporting. Keep in mind any metrics exported from here will be: | |
# a) exported from every database including read-only replicas | |
# b) scraped frequently based on database-monitoring needs | |
# c) Queried live on demand for every scrape | |
# | |
# Moreover this file does not have tests and if you get this file | |
# wrong you can easily disable all database monitoring metrics. | |
# | |
# If you want to monitor the Gitlab system as a whole you're almost | |
# certainly looking for gitlab-monitor: | |
# | |
# https://gitlab.com/gitlab-org/gitlab-monitor | |
# | |
# There you will be able to write Ruby code, cache data, access Redis, | |
# etc. | |
# NOT | |
# DO NOT ADD GENERAL PURPOSE GITLAB MONITORING HERE | |
# NOT | |
# | |
# This file is for PostgreSQL statistics exporting. Keep in mind any metrics exported from here will be: | |
# a) exported from every database including read-only replicas | |
# b) scraped frequently based on database-monitoring needs | |
# c) Queried live on demand for every scrape | |
# | |
# Moreover this file does not have tests and if you get this file | |
# wrong you can easily disable all database monitoring metrics. | |
# | |
# If you want to monitor the Gitlab system as a whole you're almost | |
# certainly looking for gitlab-monitor: | |
# | |
# https://gitlab.com/gitlab-org/gitlab-monitor | |
# | |
# There you will be able to write Ruby code, cache data, access Redis, | |
# etc. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
metrics: | |
- user: | |
usage: "LABEL" | |
description: "The user who executed the statement" | |
- datname: | |
usage: "LABEL" | |
description: "The database in which the statement was executed" | |
- queryid: | |
usage: "LABEL" | |
description: "Internal hash code, computed from the statement's parse tree" | |
- calls: | |
usage: "COUNTER" | |
description: "Number of times executed" | |
- time_milliseconds: | |
usage: "COUNTER" | |
description: "Total time spent in the statement, in milliseconds" | |
- rows: | |
usage: "COUNTER" | |
description: "Total number of rows retrieved or affected by the statement" | |
- shared_blks_hit: | |
usage: "COUNTER" | |
description: "Total number of shared block cache hits by the statement" | |
- shared_blks_read: | |
usage: "COUNTER" | |
description: "Total number of shared blocks read by the statement" | |
- shared_blks_dirtied: | |
usage: "COUNTER" | |
description: "Total number of shared blocks dirtied by the statement" | |
- shared_blks_written: | |
usage: "COUNTER" | |
description: "Total number of shared blocks written by the statement" | |
- local_blks_hit: | |
usage: "COUNTER" | |
description: "Total number of local block cache hits by the statement" | |
- local_blks_read: | |
usage: "COUNTER" | |
description: "Total number of local blocks read by the statement" | |
- local_blks_dirtied: | |
usage: "COUNTER" | |
description: "Total number of local blocks dirtied by the statement" | |
- local_blks_written: | |
usage: "COUNTER" | |
description: "Total number of local blocks written by the statement" | |
- temp_blks_read: | |
usage: "COUNTER" | |
description: "Total number of temp blocks read by the statement" | |
- temp_blks_written: | |
usage: "COUNTER" | |
description: "Total number of temp blocks written by the statement" | |
- blk_read_time: | |
usage: "COUNTER" | |
description: "Total time the statement spent reading blocks, in milliseconds" | |
- blk_write_time: | |
usage: "COUNTER" | |
description: "Total time the statement spent writing blocks, in milliseconds" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pg_stat_statements: | |
query: | | |
SELECT | |
pg_get_userbyid(userid) as user, | |
pg_database.datname, | |
pg_stat_statements.queryid, | |
pg_stat_statements.calls, | |
pg_stat_statements.total_time as time_milliseconds, | |
pg_stat_statements.rows, | |
pg_stat_statements.shared_blks_hit, | |
pg_stat_statements.shared_blks_read, | |
pg_stat_statements.shared_blks_dirtied, | |
pg_stat_statements.shared_blks_written, | |
pg_stat_statements.local_blks_hit, | |
pg_stat_statements.local_blks_read, | |
pg_stat_statements.local_blks_dirtied, | |
pg_stat_statements.local_blks_written, | |
pg_stat_statements.temp_blks_read, | |
pg_stat_statements.temp_blks_written, | |
pg_stat_statements.blk_read_time, | |
pg_stat_statements.blk_write_time | |
FROM pg_stat_statements | |
JOIN pg_database | |
ON pg_database.oid = pg_stat_statements.dbid |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pg_replication_slots: | |
query: | | |
SELECT slot_name, slot_type, | |
case when active then 1.0 else 0.0 end AS active, | |
age(xmin) AS xmin_age, | |
age(catalog_xmin) AS catalog_xmin_age, | |
FROM pg_replication_slots | |
metrics: | |
- slot_name: | |
usage: "LABEL" | |
description: "Slot Name" | |
- slot_type: | |
usage: "LABEL" | |
description: "Slot Type" | |
- active: | |
usage: "GAUGE" | |
description: "Boolean flag indicating whether this slot has a consumer streaming from it" | |
- xmin_age: | |
usage: "GAUGE" | |
description: "Age of oldest transaction that cannot be vacuumed due to this replica" | |
- catalog_xmin_age: | |
usage: "GAUGE" | |
description: "Age of oldest transaction that cannot be vacuumed from catalogs due to this replica (used by logical replication)" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pg_replication: | |
query: | | |
SELECT EXTRACT(epoch FROM ( | |
now() - pg_last_xact_replay_timestamp() | |
))::int AS lag, | |
CASE WHEN pg_is_in_recovery() THEN 1 ELSE 0 END AS is_replica | |
metrics: | |
- lag: | |
usage: "GAUGE" | |
description: "Replication lag behind primary in seconds" | |
- is_replica: | |
usage: "GAUGE" | |
description: "Indicates if this host is a replica" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment