Announcing pgmetrics

PostgreSQL, especially versions 9.4 and later, come with many system views, functions and features that expose rich information about the internal workings of your PostgreSQL cluster.

Traditionally, there have been scripts – like the venerable check_postgres and others, usually passed on from DBA to DBA and shared on public and private Wikis – that collect, or check some of these metrics and information.

There hasn’t been a standard, easy way to collect all the information available from your PostgreSQL server, in a way that can be used by DBAs to troubleshoot and ops folks to use in scripting for automation and monitoring.

pgmetrics

pgmetrics aims to be the simplest way to collect comprehensive information and statistics from a running PostgreSQL cluster. It can display this information in a human-readable text format for troubleshooting and diagnosis, or export it as a JSON file for scripting, automation amd monitoring.

We built pgmetrics based on our interactions with the customers of our server, database and service monitoring product OpsDash, as well as from our own needs arising from using PostgreSQL in production.

pgmetrics is open-source (Apache 2.0 licensed), is available now and lives at pgmetrics.io.

Usage

pgmetrics is a single, dependency-free, statically-linked executable that can be easily deployed to any server or machine by just copying it there. It can be invoked just like psql, and takes nearly the same set of command-line parameters and environment variables that you’re used to:

$ pgmetrics --help
pgmetrics collects PostgreSQL information and metrics.

Usage:
  pgmetrics [OPTION]... [DBNAME]

General options:
  -t, --timeout=SECS       individual query timeout in seconds (default: 5)
  -S, --no-sizes           don't collect tablespace and relation sizes
  -i, --input=FILE         don't connect to db, instead read and display
                               this previously saved JSON file
  -V, --version            output version information, then exit
  -?, --help[=options]     show this help, then exit
      --help=variables     list environment variables, then exit

Output options:
  -f, --format=FORMAT      output format; "human", or "json" (default: "human")
  -l, --toolong=SECS       for human output, transactions running longer than
                               this are considered too long (default: 60)
  -o, --output=FILE        write output to the specified file
      --no-pager           do not invoke the pager for tty output

Connection options:
  -h, --host=HOSTNAME      database server host or socket directory (default: "/var/run/postgresql")
  -p, --port=PORT          database server port (default: 5432)
  -U, --username=USERNAME  database user name (default: "vagrant")
      --no-password        never prompt for password

For more information, visit <https://pgmetrics.io>.

For example, to collect metrics from databases called “main” and “reports” from a server, you might use:

$ pgmetrics -h mypgserver -U alice main reports
Password:

Have a look at the information and metrics pgmetrics can collect and display, in the sections below:

Overall Status

Basic, overall information about the server includes last checkpoint, transaction times, checkpoint lag, active backend count and transaction ID age. The range of transaction IDs must be within 2 billion to prevent transaction ID wraparound issues.

PostgreSQL Cluster:
    Name:                staging1
    Server Version:      10.2
    Server Started:      5 Mar 2018 3:35:49 AM (50 minutes ago)
    System Identifier:   6529298009807657133
    Timeline:            1
    Last Checkpoint:     5 Mar 2018 4:25:28 AM (49 seconds ago)
    Prior LSN:           0/2808EF98
    REDO LSN:            0/29882FC0 (24 MiB since Prior)
    Checkpoint LSN:      0/2BDB92C0 (37 MiB since REDO)
    Transaction IDs:     548 to 352506 (diff = 351958)
    Last Transaction:    5 Mar 2018 4:26:17 AM (now)
    Active Backends:     5 (max 100)
    Recovery Mode?       no

Replication Status

When run an a primary with outgoing streaming replication, pgmetrics displays the state and progress of each replication – the lag, in bytes, between the current state, and the state that has been received, flushed and replayed at the destination.

Outgoing Replication Stats:
    Destination #1:
      User:              vagrant
      Application:       pg_receivewal
      Client Address:    
      State:             streaming
      Started At:        5 Mar 2018 3:51:20 AM (34 minutes ago)
      Sent LSN:          0/2CC2E000
      Written Until:     0/2CB88000 (write lag = 664 KiB)
      Flushed Until:     0/2C000000 (flush lag = 12 MiB)
      Replayed Until:    
      Sync Priority:     0
      Sync State:        async
    Destination #2:
      User:              vagrant
      Application:       pg_recvlogical
      Client Address:    
      State:             streaming
      Started At:        5 Mar 2018 3:52:22 AM (33 minutes ago)
      Sent LSN:          0/2CC2DF60
      Written Until:     0/2CC11FF0 (write lag = 112 KiB)
      Flushed Until:     0/2CC11FF0 (no flush lag)
      Replayed Until:    
      Sync Priority:     0
      Sync State:        async
    Destination #3:
      User:              vagrant
      Application:       walreceiver
      Client Address:    
      State:             streaming
      Started At:        5 Mar 2018 3:55:26 AM (30 minutes ago)
      Sent LSN:          0/2CC2E000
      Written Until:     0/2CC2E000 (no write lag)
      Flushed Until:     0/2CC2E000 (no flush lag)
      Replayed Until:    0/2CC2DF60 (replay lag = 160 B)
      Sync Priority:     0
      Sync State:        async

Replication Slots

On the master side, a list of replication slots, along with the progress information, is captured.

Physical Replication Slots:
    +------------+--------+---------------+-------------+-----------+
    |       Name | Active | Oldest Txn ID | Restart LSN | Temporary |
    +------------+--------+---------------+-------------+-----------+
    | slave_slot |    yes |               |  0/2CC2E000 |        no |
    |   wal_arch |    yes |               |  0/2C000000 |        no |
    +------------+--------+---------------+-------------+-----------+

Logical Replication Slots:
    +----------+---------------+----------+--------+---------------+-------------+---------------+-----------+
    |     Name |        Plugin | Database | Active | Oldest Txn ID | Restart LSN | Flushed Until | Temporary |
    +----------+---------------+----------+--------+---------------+-------------+---------------+-----------+
    | logslot1 | test_decoding |    bench |    yes |               |  0/28510670 |    0/2CC11FF0 |        no |
    +----------+---------------+----------+--------+---------------+-------------+---------------+-----------+

Standby Replication Status

When pgmetrics is run on a hot standby, it collects the recovery and replication status. This shows how much data has been received and replayed at the standby end. If replication slots are used, richer data is available as seen below:

Recovery Status:
    Replay paused:       no
    Received LSN:        0/2EA4A000
    Replayed LSN:        0/2EA49FB8 (lag = 72 B)
    Last Replayed Txn:   5 Mar 2018 4:26:42 AM (now)

Incoming Replication Stats:
    Status:              streaming
    Received LSN:        0/2EA4A000 (started at 0/11000000, 474 MiB)
    Timeline:            1 (was 1 at start)
    Latency:             11.505ms
    Replication Slot:    slave_slot

WAL Archiving

When WAL archiving is enabled, these stats are also collected and displayed. The number of WAL files in the pg_wal (or pg_xlog in older PostgreSQL versions) directory, as well as the number of “ready” files are also collected.

Relevant configuration settings are also displayed in this section. pgmetrics will collect all settings, including changes from default, it any.

WAL Files:
    WAL Archiving?       yes
    WAL Files:           17
    Ready Files:         0
    Archive Rate:        0.91 per min
    Last Archived:       5 Mar 2018 4:26:05 AM (12 seconds ago)
    Last Failure:        
    Totals:              46 succeeded, 0 failed
    Totals Since:        5 Mar 2018 3:35:50 AM (50 minutes ago)
    +--------------------+---------------+
    |            Setting |         Value |
    +--------------------+---------------+
    |          wal_level |       logical |
    |    archive_timeout |           120 |
    |    wal_compression |            on |
    |       max_wal_size | 1024 (16 GiB) |
    |       min_wal_size |  80 (1.3 GiB) |
    | checkpoint_timeout |            60 |
    |   full_page_writes |            on |
    |  wal_keep_segments |            10 |
    +--------------------+---------------+

BG Writer

The stats for the bgwriter process include the distribution of scheduled and requested checkpoints, write frequency and amounts, buffer writes segregated by cause, and other information.

BG Writer:
    Checkpoint Rate:     1.05 per min
    Average Write:       12 MiB per checkpoint
    Total Checkpoints:   47 sched (88.7%) + 6 req (11.3%) = 53
    Total Write:         1.6 GiB, @ 540 KiB per sec
    Buffers Allocated:   116427 (910 MiB)
    Buffers Written:     85061 chkpt (41.6%) + 72964 bgw (35.7%) + 46451 be (22.7%)
    Clean Scan Stops:    174
    BE fsyncs:           0
    Counts Since:        5 Mar 2018 3:35:49 AM (50 minutes ago)
    +------------------------------+--------------+
    |                      Setting |        Value |
    +------------------------------+--------------+
    |               bgwriter_delay |     200 msec |
    |         bgwriter_flush_after | 64 (512 KiB) |
    |        bgwriter_lru_maxpages |          100 |
    |      bgwriter_lru_multiplier |            2 |
    |                   block_size |         8192 |
    |           checkpoint_timeout |       60 sec |
    | checkpoint_completion_target |          0.5 |
    +------------------------------+--------------+

Backends

The report about active backends highlights common causes for concern, like transactions that have been open for too long, or are idling. Processes waiting for various reasons, including locks, are also called out:

Backends:
    Total Backends:      4 (4.0% of max 100)
    Problematic:         3 waiting, 2 xact too long, 2 idle in xact
    Waiting:
      +------+---------+------+-------------+----------+---------------------+-----------------------+
      |  PID |    User |  App | Client Addr | Database |                Wait |           Query Start |
      +------+---------+------+-------------+----------+---------------------+-----------------------+
      | 7024 | vagrant | psql |             | postgres | Client / ClientRead | 5 Mar 2018 5:01:05 AM |
      | 7210 | vagrant | psql |             | postgres |     Lock / relation | 5 Mar 2018 5:00:35 AM |
      | 7213 | vagrant | psql |             | postgres | Client / ClientRead | 5 Mar 2018 5:00:26 AM |
      +------+---------+------+-------------+----------+---------------------+-----------------------+

    Long Running (>60 sec) Transactions:
      +------+---------+------+-------------+----------+--------------------------------------+
      |  PID |    User |  App | Client Addr | Database |                    Transaction Start |
      +------+---------+------+-------------+----------+--------------------------------------+
      | 7210 | vagrant | psql |             | postgres | 5 Mar 2018 4:59:57 AM (1 minute ago) |
      | 7213 | vagrant | psql |             | postgres | 5 Mar 2018 5:00:10 AM (1 minute ago) |
      +------+---------+------+-------------+----------+--------------------------------------+

    Idling in Transaction:
      +------+---------+------+-------------+----------+----------+-----------------------+
      |  PID |    User |  App | Client Addr | Database | Aborted? |          State Change |
      +------+---------+------+-------------+----------+----------+-----------------------+
      | 7024 | vagrant | psql |             | postgres |       no | 5 Mar 2018 5:01:05 AM |
      | 7213 | vagrant | psql |             | postgres |       no | 5 Mar 2018 5:00:26 AM |
      +------+---------+------+-------------+----------+----------+-----------------------+

Vacuum Progress

Information of ongoing vacuum and autovacuum jobs are also captured by pgmetrics. This is helpful in diagnosing “stuck” vacuum jobs and also trying to make an educated guess about when ongoing jobs will finish.

Vacuum Progress:
    Vacuum Process #1:
      Phase:             scanning heap
      Database:          postgres
      Table:
      Scan Progress:     15369 of 21589 (71.2% complete)
      Heap Blks Vac'ed:  15368 of 21589
      Idx Vac Cycles:    0
      Dead Tuples:       53
      Dead Tuples Max:   291
    +------------------------------+----------------+
    |                      Setting |          Value |
    +------------------------------+----------------+
    |         maintenance_work_mem | 65536 (64 KiB) |
    |                   autovacuum |             on |
    | autovacuum_analyze_threshold |             50 |
    |  autovacuum_vacuum_threshold |             50 |
    |    autovacuum_freeze_max_age |      200000000 |
    |       autovacuum_max_workers |              3 |
    |           autovacuum_naptime |         60 sec |
    |        vacuum_freeze_min_age |       50000000 |
    |      vacuum_freeze_table_age |      150000000 |
    +------------------------------+----------------+

Roles

All the roles (user and groups) in the cluster, including group membership and attributes like superuser, connection limit etc. are also captured by pgmetrics. The display is a bit too wide for this blog post, so we’re omitting it here.

Tablespaces

Each tablespace, it’s location and the size consumed (as reported by pg_tablespace_size) is collected by pgmetrics. If run locally, it also examines the mounted filesystem where the tablespace is located and reports the disk and inode usage for that filesystem.

Tablespaces:
    +------------+---------+-------------------------------+---------+----------------------------+-------------------------+
    |       Name |   Owner |                      Location |    Size |                  Disk Used |              Inode Used |
    +------------+---------+-------------------------------+---------+----------------------------+-------------------------+
    | pg_default | vagrant | $PGDATA = /home/vagrant/data1 | 249 MiB | 3.4 GiB (39.3%) of 8.7 GiB | 59889 (10.3%) of 584064 |
    |  pg_global | vagrant | $PGDATA = /home/vagrant/data1 | 573 KiB | 3.4 GiB (39.3%) of 8.7 GiB | 59889 (10.3%) of 584064 |
    |         s1 | vagrant |                  /dev/shm/s93 | 3.5 MiB |  10 MiB (1.0%) of 1002 MiB |     21 (0.0%) of 256561 |
    +------------+---------+-------------------------------+---------+----------------------------+-------------------------+

Databases

The stats for each database includes the commit ratio, cache efficiency, the age of oldest active transaction ID, the size in bytes etc. Also included are the list of installed extensions, the cache efficiency of sequence objects and the list of disabled triggers.

Database #1:
    Name:                postgres
    Owner:               vagrant
    Tablespace:          pg_default
    Connections:         3 (no max limit)
    Frozen Xid Age:      484454
    Transactions:        159 (97.5%) commits, 4 (2.5%) rollbacks
    Cache Hits:          99.2%
    Rows Changed:        ins 83.3%, upd 0.0%, del 16.7%
    Total Temp:          1.3 MiB in 1 files
    Problems:            0 deadlocks, 0 conflicts
    Totals Since:        5 Mar 2018 3:36:08 AM (1 hour ago)
    Size:                11 MiB
    Sequences:
      +---------------+------------+
      |      Sequence | Cache Hits |
      +---------------+------------+
      | seqtest_a_seq |      50.0% |
      +---------------+------------+

    Installed Extensions:
      +---------+---------+------------------------------+
      |    Name | Version |                      Comment |
      +---------+---------+------------------------------+
      | plpgsql |     1.0 | PL/pgSQL procedural language |
      +---------+---------+------------------------------+

    Disabled Triggers:
      +------+---------------+-----------+
      | Name |         Table | Procedure |
      +------+---------------+-----------+
      |  tr1 | public.trtest |    trigfn |
      +------+---------------+-----------+

Tables and Indexes

For each table, the last vacuum and analyze information, as well as estimates of live and dead rows are collected. Stats like percentage of updates that are HOT updates, cache efficiency for table and indexes, size, bloat, the use of sequential and index scans etc are included.

Stats for each index associated with the table, like cache efficiency and rows fetched/scan etc. are also listed.

The bloat figure calculated by pgmetrics uses the query taken from check_postgres.

Table #2 in "bench":
    Name:                bench.public.pgbench_tellers
    Manual Vacuums:      2, last 40 minutes ago
    Manual Analyze:      1, last 51 minutes ago
    Auto Vacuums:        6, last 12 minutes ago
    Auto Analyze:        11, last 12 minutes ago
    Post-Analyze:        99.3% est. rows modified
    Row Estimate:        0.7% live of total 14709
    Rows Changed:        ins 0.0%, upd 75.8%, del 0.0%
    HOT Updates:         75.8% of all updates
    Seq Scans:           152045, 100.0 rows/scan
    Idx Scans:           332274, 1.0 rows/scan
    Cache Hits:          100.0% (idx=100.0%)
    Size:                4.3 MiB
    Bloat:               5.6 MiB (131.6%)
    +----------------------+------------+--------+----------------+-------------------+
    |                Index | Cache Hits |  Scans | Rows Read/Scan | Rows Fetched/Scan |
    +----------------------+------------+--------+----------------+-------------------+
    | pgbench_tellers_pkey |     100.0% | 332274 |          107.2 |               1.0 |
    +----------------------+------------+--------+----------------+-------------------+

System Information

Finally, pgmetrics also captures system-level information (if run locally, of course).

System Information:
    Hostname:            stretch
    CPU Cores:           2 x Intel(R) Core(TM) i5-3450 CPU @ 3.10GHz
    Load Average:        0.07
    Memory:              used=174 MiB, free=76 MiB, buff=40 MiB, cache=1.7 GiB
    Swap:                used=2.4 MiB, free=1020 MiB
    +---------------------------------+-----------------+
    |                         Setting |           Value |
    +---------------------------------+-----------------+
    |                  shared_buffers | 16384 (128 MiB) |
    |                        work_mem |  4096 (4.0 MiB) |
    |            maintenance_work_mem |  65536 (64 MiB) |
    |                    temp_buffers |  1024 (8.0 MiB) |
    |             autovacuum_work_mem |              -1 |
    |                 temp_file_limit |              -1 |
    |            max_worker_processes |               8 |
    |          autovacuum_max_workers |               3 |
    | max_parallel_workers_per_gather |               2 |
    |        effective_io_concurrency |               1 |
    +---------------------------------+-----------------+

Availability

pgmetrics is available as a single binary for Linux, Windows and macOS, for 64-bit platforms. You can download these from the GitHub releases page. You can easily build pgmetrics for other platforms yourself using a Go development environment, read more here.

System metrics (disk space, memory usage etc) are currently supported only for Linux servers.

pgmetrics currently runs on PostgreSQL versions from 9.3 to 10. Patches to make it work on other versions are welcome.

Get Involved!

We’d be happy to hear your bug reports, suggestions and feedback; and incorporate them to make pgmetrics even more useful. Find out more at the pgmetrics home page at pgmetrics.io.

Join the pgmetrics-users mailing list (Google groups) for announcements and discussions.
Report bugs and suggestions to the GitHub project’s issue tracker.
Submit patches as pull requests via GitHub.
Tell your friends about pgmetrics!