PostgreSQL, especially versions 9.4 and later, come with many system views, functions and features that expose rich information about the internal workings of your PostgreSQL cluster.
Traditionally, there have been scripts – like the venerable check_postgres and others, usually passed on from DBA to DBA and shared on public and private Wikis – that collect, or check some of these metrics and information.
There hasn’t been a standard, easy way to collect all the information available from your PostgreSQL server, in a way that can be used by DBAs to troubleshoot and ops folks to use in scripting for automation and monitoring.
pgmetrics
pgmetrics aims to be the simplest way to collect comprehensive information and statistics from a running PostgreSQL cluster. It can display this information in a human-readable text format for troubleshooting and diagnosis, or export it as a JSON file for scripting, automation amd monitoring.
We built pgmetrics based on our interactions with the customers of our server, database and service monitoring product OpsDash, as well as from our own needs arising from using PostgreSQL in production.
pgmetrics is open-source (Apache 2.0 licensed), is available now and lives at pgmetrics.io.
Usage
pgmetrics is a single, dependency-free, statically-linked executable that can be
easily deployed to any server or machine by just copying it there. It can be
invoked just like psql
, and takes nearly the same set of command-line
parameters and environment variables that you’re used to:
$ pgmetrics --help
pgmetrics collects PostgreSQL information and metrics.
Usage:
pgmetrics [OPTION]... [DBNAME]
General options:
-t, --timeout=SECS individual query timeout in seconds (default: 5)
-S, --no-sizes don't collect tablespace and relation sizes
-i, --input=FILE don't connect to db, instead read and display
this previously saved JSON file
-V, --version output version information, then exit
-?, --help[=options] show this help, then exit
--help=variables list environment variables, then exit
Output options:
-f, --format=FORMAT output format; "human", or "json" (default: "human")
-l, --toolong=SECS for human output, transactions running longer than
this are considered too long (default: 60)
-o, --output=FILE write output to the specified file
--no-pager do not invoke the pager for tty output
Connection options:
-h, --host=HOSTNAME database server host or socket directory (default: "/var/run/postgresql")
-p, --port=PORT database server port (default: 5432)
-U, --username=USERNAME database user name (default: "vagrant")
--no-password never prompt for password
For more information, visit <https://pgmetrics.io>.
For example, to collect metrics from databases called “main” and “reports” from a server, you might use:
$ pgmetrics -h mypgserver -U alice main reports
Password:
Have a look at the information and metrics pgmetrics can collect and display, in the sections below:
Overall Status
Basic, overall information about the server includes last checkpoint, transaction times, checkpoint lag, active backend count and transaction ID age. The range of transaction IDs must be within 2 billion to prevent transaction ID wraparound issues.
PostgreSQL Cluster:
Name: staging1
Server Version: 10.2
Server Started: 5 Mar 2018 3:35:49 AM (50 minutes ago)
System Identifier: 6529298009807657133
Timeline: 1
Last Checkpoint: 5 Mar 2018 4:25:28 AM (49 seconds ago)
Prior LSN: 0/2808EF98
REDO LSN: 0/29882FC0 (24 MiB since Prior)
Checkpoint LSN: 0/2BDB92C0 (37 MiB since REDO)
Transaction IDs: 548 to 352506 (diff = 351958)
Last Transaction: 5 Mar 2018 4:26:17 AM (now)
Active Backends: 5 (max 100)
Recovery Mode? no
Replication Status
When run an a primary with outgoing streaming replication, pgmetrics displays the state and progress of each replication – the lag, in bytes, between the current state, and the state that has been received, flushed and replayed at the destination.
Outgoing Replication Stats:
Destination #1:
User: vagrant
Application: pg_receivewal
Client Address:
State: streaming
Started At: 5 Mar 2018 3:51:20 AM (34 minutes ago)
Sent LSN: 0/2CC2E000
Written Until: 0/2CB88000 (write lag = 664 KiB)
Flushed Until: 0/2C000000 (flush lag = 12 MiB)
Replayed Until:
Sync Priority: 0
Sync State: async
Destination #2:
User: vagrant
Application: pg_recvlogical
Client Address:
State: streaming
Started At: 5 Mar 2018 3:52:22 AM (33 minutes ago)
Sent LSN: 0/2CC2DF60
Written Until: 0/2CC11FF0 (write lag = 112 KiB)
Flushed Until: 0/2CC11FF0 (no flush lag)
Replayed Until:
Sync Priority: 0
Sync State: async
Destination #3:
User: vagrant
Application: walreceiver
Client Address:
State: streaming
Started At: 5 Mar 2018 3:55:26 AM (30 minutes ago)
Sent LSN: 0/2CC2E000
Written Until: 0/2CC2E000 (no write lag)
Flushed Until: 0/2CC2E000 (no flush lag)
Replayed Until: 0/2CC2DF60 (replay lag = 160 B)
Sync Priority: 0
Sync State: async
Replication Slots
On the master side, a list of replication slots, along with the progress information, is captured.
Physical Replication Slots:
+------------+--------+---------------+-------------+-----------+
| Name | Active | Oldest Txn ID | Restart LSN | Temporary |
+------------+--------+---------------+-------------+-----------+
| slave_slot | yes | | 0/2CC2E000 | no |
| wal_arch | yes | | 0/2C000000 | no |
+------------+--------+---------------+-------------+-----------+
Logical Replication Slots:
+----------+---------------+----------+--------+---------------+-------------+---------------+-----------+
| Name | Plugin | Database | Active | Oldest Txn ID | Restart LSN | Flushed Until | Temporary |
+----------+---------------+----------+--------+---------------+-------------+---------------+-----------+
| logslot1 | test_decoding | bench | yes | | 0/28510670 | 0/2CC11FF0 | no |
+----------+---------------+----------+--------+---------------+-------------+---------------+-----------+
Standby Replication Status
When pgmetrics is run on a hot standby, it collects the recovery and replication status. This shows how much data has been received and replayed at the standby end. If replication slots are used, richer data is available as seen below:
Recovery Status:
Replay paused: no
Received LSN: 0/2EA4A000
Replayed LSN: 0/2EA49FB8 (lag = 72 B)
Last Replayed Txn: 5 Mar 2018 4:26:42 AM (now)
Incoming Replication Stats:
Status: streaming
Received LSN: 0/2EA4A000 (started at 0/11000000, 474 MiB)
Timeline: 1 (was 1 at start)
Latency: 11.505ms
Replication Slot: slave_slot
WAL Archiving
When WAL archiving is enabled, these stats are also collected and displayed. The
number of WAL files in the pg_wal
(or pg_xlog
in older PostgreSQL versions)
directory, as well as the number of “ready” files are also collected.
Relevant configuration settings are also displayed in this section. pgmetrics will collect all settings, including changes from default, it any.
WAL Files:
WAL Archiving? yes
WAL Files: 17
Ready Files: 0
Archive Rate: 0.91 per min
Last Archived: 5 Mar 2018 4:26:05 AM (12 seconds ago)
Last Failure:
Totals: 46 succeeded, 0 failed
Totals Since: 5 Mar 2018 3:35:50 AM (50 minutes ago)
+--------------------+---------------+
| Setting | Value |
+--------------------+---------------+
| wal_level | logical |
| archive_timeout | 120 |
| wal_compression | on |
| max_wal_size | 1024 (16 GiB) |
| min_wal_size | 80 (1.3 GiB) |
| checkpoint_timeout | 60 |
| full_page_writes | on |
| wal_keep_segments | 10 |
+--------------------+---------------+
BG Writer
The stats for the bgwriter
process include the distribution of scheduled and
requested checkpoints, write frequency and amounts, buffer writes segregated by
cause, and other information.
BG Writer:
Checkpoint Rate: 1.05 per min
Average Write: 12 MiB per checkpoint
Total Checkpoints: 47 sched (88.7%) + 6 req (11.3%) = 53
Total Write: 1.6 GiB, @ 540 KiB per sec
Buffers Allocated: 116427 (910 MiB)
Buffers Written: 85061 chkpt (41.6%) + 72964 bgw (35.7%) + 46451 be (22.7%)
Clean Scan Stops: 174
BE fsyncs: 0
Counts Since: 5 Mar 2018 3:35:49 AM (50 minutes ago)
+------------------------------+--------------+
| Setting | Value |
+------------------------------+--------------+
| bgwriter_delay | 200 msec |
| bgwriter_flush_after | 64 (512 KiB) |
| bgwriter_lru_maxpages | 100 |
| bgwriter_lru_multiplier | 2 |
| block_size | 8192 |
| checkpoint_timeout | 60 sec |
| checkpoint_completion_target | 0.5 |
+------------------------------+--------------+
Backends
The report about active backends highlights common causes for concern, like transactions that have been open for too long, or are idling. Processes waiting for various reasons, including locks, are also called out:
Backends:
Total Backends: 4 (4.0% of max 100)
Problematic: 3 waiting, 2 xact too long, 2 idle in xact
Waiting:
+------+---------+------+-------------+----------+---------------------+-----------------------+
| PID | User | App | Client Addr | Database | Wait | Query Start |
+------+---------+------+-------------+----------+---------------------+-----------------------+
| 7024 | vagrant | psql | | postgres | Client / ClientRead | 5 Mar 2018 5:01:05 AM |
| 7210 | vagrant | psql | | postgres | Lock / relation | 5 Mar 2018 5:00:35 AM |
| 7213 | vagrant | psql | | postgres | Client / ClientRead | 5 Mar 2018 5:00:26 AM |
+------+---------+------+-------------+----------+---------------------+-----------------------+
Long Running (>60 sec) Transactions:
+------+---------+------+-------------+----------+--------------------------------------+
| PID | User | App | Client Addr | Database | Transaction Start |
+------+---------+------+-------------+----------+--------------------------------------+
| 7210 | vagrant | psql | | postgres | 5 Mar 2018 4:59:57 AM (1 minute ago) |
| 7213 | vagrant | psql | | postgres | 5 Mar 2018 5:00:10 AM (1 minute ago) |
+------+---------+------+-------------+----------+--------------------------------------+
Idling in Transaction:
+------+---------+------+-------------+----------+----------+-----------------------+
| PID | User | App | Client Addr | Database | Aborted? | State Change |
+------+---------+------+-------------+----------+----------+-----------------------+
| 7024 | vagrant | psql | | postgres | no | 5 Mar 2018 5:01:05 AM |
| 7213 | vagrant | psql | | postgres | no | 5 Mar 2018 5:00:26 AM |
+------+---------+------+-------------+----------+----------+-----------------------+
Vacuum Progress
Information of ongoing vacuum and autovacuum jobs are also captured by pgmetrics. This is helpful in diagnosing “stuck” vacuum jobs and also trying to make an educated guess about when ongoing jobs will finish.
Vacuum Progress:
Vacuum Process #1:
Phase: scanning heap
Database: postgres
Table:
Scan Progress: 15369 of 21589 (71.2% complete)
Heap Blks Vac'ed: 15368 of 21589
Idx Vac Cycles: 0
Dead Tuples: 53
Dead Tuples Max: 291
+------------------------------+----------------+
| Setting | Value |
+------------------------------+----------------+
| maintenance_work_mem | 65536 (64 KiB) |
| autovacuum | on |
| autovacuum_analyze_threshold | 50 |
| autovacuum_vacuum_threshold | 50 |
| autovacuum_freeze_max_age | 200000000 |
| autovacuum_max_workers | 3 |
| autovacuum_naptime | 60 sec |
| vacuum_freeze_min_age | 50000000 |
| vacuum_freeze_table_age | 150000000 |
+------------------------------+----------------+
Roles
All the roles (user and groups) in the cluster, including group membership and attributes like superuser, connection limit etc. are also captured by pgmetrics. The display is a bit too wide for this blog post, so we’re omitting it here.
Tablespaces
Each tablespace, it’s location and the size consumed (as reported by
pg_tablespace_size
) is collected by pgmetrics. If run locally, it also
examines the mounted filesystem where the tablespace is located and reports the
disk and inode usage for that filesystem.
Tablespaces:
+------------+---------+-------------------------------+---------+----------------------------+-------------------------+
| Name | Owner | Location | Size | Disk Used | Inode Used |
+------------+---------+-------------------------------+---------+----------------------------+-------------------------+
| pg_default | vagrant | $PGDATA = /home/vagrant/data1 | 249 MiB | 3.4 GiB (39.3%) of 8.7 GiB | 59889 (10.3%) of 584064 |
| pg_global | vagrant | $PGDATA = /home/vagrant/data1 | 573 KiB | 3.4 GiB (39.3%) of 8.7 GiB | 59889 (10.3%) of 584064 |
| s1 | vagrant | /dev/shm/s93 | 3.5 MiB | 10 MiB (1.0%) of 1002 MiB | 21 (0.0%) of 256561 |
+------------+---------+-------------------------------+---------+----------------------------+-------------------------+
Databases
The stats for each database includes the commit ratio, cache efficiency, the age of oldest active transaction ID, the size in bytes etc. Also included are the list of installed extensions, the cache efficiency of sequence objects and the list of disabled triggers.
Database #1:
Name: postgres
Owner: vagrant
Tablespace: pg_default
Connections: 3 (no max limit)
Frozen Xid Age: 484454
Transactions: 159 (97.5%) commits, 4 (2.5%) rollbacks
Cache Hits: 99.2%
Rows Changed: ins 83.3%, upd 0.0%, del 16.7%
Total Temp: 1.3 MiB in 1 files
Problems: 0 deadlocks, 0 conflicts
Totals Since: 5 Mar 2018 3:36:08 AM (1 hour ago)
Size: 11 MiB
Sequences:
+---------------+------------+
| Sequence | Cache Hits |
+---------------+------------+
| seqtest_a_seq | 50.0% |
+---------------+------------+
Installed Extensions:
+---------+---------+------------------------------+
| Name | Version | Comment |
+---------+---------+------------------------------+
| plpgsql | 1.0 | PL/pgSQL procedural language |
+---------+---------+------------------------------+
Disabled Triggers:
+------+---------------+-----------+
| Name | Table | Procedure |
+------+---------------+-----------+
| tr1 | public.trtest | trigfn |
+------+---------------+-----------+
Tables and Indexes
For each table, the last vacuum and analyze information, as well as estimates of live and dead rows are collected. Stats like percentage of updates that are HOT updates, cache efficiency for table and indexes, size, bloat, the use of sequential and index scans etc are included.
Stats for each index associated with the table, like cache efficiency and rows fetched/scan etc. are also listed.
The bloat figure calculated by pgmetrics uses the query taken from check_postgres.
Table #2 in "bench":
Name: bench.public.pgbench_tellers
Manual Vacuums: 2, last 40 minutes ago
Manual Analyze: 1, last 51 minutes ago
Auto Vacuums: 6, last 12 minutes ago
Auto Analyze: 11, last 12 minutes ago
Post-Analyze: 99.3% est. rows modified
Row Estimate: 0.7% live of total 14709
Rows Changed: ins 0.0%, upd 75.8%, del 0.0%
HOT Updates: 75.8% of all updates
Seq Scans: 152045, 100.0 rows/scan
Idx Scans: 332274, 1.0 rows/scan
Cache Hits: 100.0% (idx=100.0%)
Size: 4.3 MiB
Bloat: 5.6 MiB (131.6%)
+----------------------+------------+--------+----------------+-------------------+
| Index | Cache Hits | Scans | Rows Read/Scan | Rows Fetched/Scan |
+----------------------+------------+--------+----------------+-------------------+
| pgbench_tellers_pkey | 100.0% | 332274 | 107.2 | 1.0 |
+----------------------+------------+--------+----------------+-------------------+
System Information
Finally, pgmetrics also captures system-level information (if run locally, of course).
System Information:
Hostname: stretch
CPU Cores: 2 x Intel(R) Core(TM) i5-3450 CPU @ 3.10GHz
Load Average: 0.07
Memory: used=174 MiB, free=76 MiB, buff=40 MiB, cache=1.7 GiB
Swap: used=2.4 MiB, free=1020 MiB
+---------------------------------+-----------------+
| Setting | Value |
+---------------------------------+-----------------+
| shared_buffers | 16384 (128 MiB) |
| work_mem | 4096 (4.0 MiB) |
| maintenance_work_mem | 65536 (64 MiB) |
| temp_buffers | 1024 (8.0 MiB) |
| autovacuum_work_mem | -1 |
| temp_file_limit | -1 |
| max_worker_processes | 8 |
| autovacuum_max_workers | 3 |
| max_parallel_workers_per_gather | 2 |
| effective_io_concurrency | 1 |
+---------------------------------+-----------------+
Availability
pgmetrics is available as a single binary for Linux, Windows and macOS, for 64-bit platforms. You can download these from the GitHub releases page. You can easily build pgmetrics for other platforms yourself using a Go development environment, read more here.
System metrics (disk space, memory usage etc) are currently supported only for Linux servers.
pgmetrics currently runs on PostgreSQL versions from 9.3 to 10. Patches to make it work on other versions are welcome.
Get Involved!
We’d be happy to hear your bug reports, suggestions and feedback; and incorporate them to make pgmetrics even more useful. Find out more at the pgmetrics home page at pgmetrics.io.
- Join the pgmetrics-users mailing list (Google groups) for announcements and discussions.
- Report bugs and suggestions to the GitHub project’s issue tracker.
- Submit patches as pull requests via GitHub.
- Tell your friends about pgmetrics!