← All guides

Guide · Database monitoring

How to monitor a Postgres database without an APM

Your HTTP healthcheck is green, your TCP probe is green, and Postgres is quietly two hours behind on replication. Here's the minimum-viable Postgres health monitoring setup that catches the silent failures — without paying for Datadog DBM or installing an APM agent.

Published 2026-05-22 · ~10 min read · StatusPulse Team

"The app is up" doesn't mean Postgres is healthy

Most monitoring setups stop at one of two checks: an HTTP probe on /healthz that returns 200, or a TCP probe on port 5432 that confirms the socket is accepting connections. Both can be green for hours while Postgres is in a state nobody would call healthy.

The failure modes a TCP or HTTP probe misses, in production, every week:

  • Replication is broken. The primary is fine, the replica accepts connections, the WAL receiver died at 03:14 and the standby is now a frozen snapshot from before lunch. Reads still go through. They're just stale.
  • Autovacuum has fallen behind. A long-running transaction held a snapshot for six hours, dead tuples piled up, table bloat went from 4% to 60%, and the planner started picking sequential scans on what used to be index lookups.
  • Connection pool is saturated. Your app servers are returning 503 because PgBouncer or the in-process pool is queuing every request behind a leaked transaction. The database itself is idle — the pool is the bottleneck and your HTTP probe sometimes wins the queue race.
  • A slow query is piling up. A missing index on a hot path, or a planner regression after the last ANALYZE, and suddenly every checkout takes 8 seconds. The HTTP probe still loads the homepage in 80 ms, because the homepage doesn't hit that query.
  • The database is in read-only recovery. A managed-service failover happened, the new primary hasn't been promoted yet, and every write fails with cannot execute INSERT in a read-only transaction. The socket is open, the TCP probe is delighted, the app is on fire.

None of these show up as "Postgres is down." All of them show up as "the product is broken." If you wait for your customers to tell you on Twitter, you'll find out about every single one of them.

What "healthy Postgres" actually looks like

Forget the dashboard with 80 metrics for a moment. The signals that matter for "is Postgres healthy right now" boil down to six numbers:

  1. Connection success. Can a real client open a fresh connection, authenticate, and run a query within a few seconds? This subsumes TCP, TLS, auth, and the database is not in recovery case.
  2. Latency on a sentinel query. A trivial SELECT 1 end-to-end gives you the floor latency. When it doubles, something is wrong — saturated WAL writer, full disk queue, runaway query starving everyone else.
  3. Replication lag. Either seconds behind the primary or bytes of unreplayed WAL. Either is fine; both catastrophically diverging is what you care about.
  4. Active connection count. How many of your max_connections are in use, and how many are in idle in transaction (the bad kind). Crossing 80% of the cap is your warning shot.
  5. Long-running transactions. Anything older than a few minutes is suspicious. Anything older than an hour is blocking autovacuum and probably holding row locks.
  6. Autovacuum recency. When was the last autovacuum on your hottest tables? If the answer is "yesterday" and you write a million rows an hour, the planner statistics are already lying to you.

Notice that all six are answered by SQL queries against Postgres's own catalogs. You don't need an agent. You don't need an eBPF probe. You need a thing that runs a SELECT on a schedule and tells you when the answer changes.

Why a generic APM is overkill for most teams

The textbook answer to "monitor Postgres" is to buy Datadog, install the agent, enable Database Monitoring (DBM), and let it ingest query samples, execution plans, and wait events. It's a great product. It's also wildly oversized for most Series A-B teams running one or two Postgres clusters.

What you actually pay for with a full APM:

  • An agent on every database host or sidecar. On managed services like RDS, Cloud SQL, Aurora, or Supabase, you don't get a host — you get an endpoint. Agent-based monitoring becomes a sidecar VM you run yourself, which is the opposite of why you went managed.
  • An observability stack you didn't ask for. APMs assume you also want distributed tracing, log ingestion, RUM, and synthetic checks. You'll pay per-host, per-GB, and per-million-spans whether you use them or not.
  • Pricing that punishes growth. Datadog DBM lists around $70 per database host per month before ingest, on top of your APM seat. For a team running three Postgres clusters across staging and prod, you're at $200-300/month for the DB layer alone before you've watched a single query.

Most teams don't need query-level introspection at that price. They need a confident answer to "is Postgres healthy?" with alerting on the failure modes that actually page humans. That's what StatusPulse's Database probe is built for: a real connection, a real query, a real assertion, on a schedule, with a status page in front of it. No agent. No host accounting. No span ingest.

StatusPulse's Database probe is on the Business tier — it's not free, but it's a flat add-on, not a per-host meter. If you've already outgrown free uptime monitors but you're not ready to feed the APM beast, this is the gap it fills.

The minimum-viable probe set

Three probes will catch the overwhelming majority of "the database is wrong" incidents. Each maps to one of the signals from above and runs an actual SQL query through a real Postgres connection.

Probe 1 — Connection + sentinel query

The base case. Opens a fresh connection with your probe user, runs SELECT 1, captures the auth-handshake time and total round-trip separately. If the database is in recovery, has rotated its certificate, has revoked the user, or is just plain slow to accept connections, this probe goes Down or Degraded.

-- Probe query
SELECT 1;

On StatusPulse's Database probe this is the Level 2 auth + ping mode — no custom query needed, just host, port, database, credentials, and TLS. Treat the Degraded threshold as your sentinel-latency budget. 500 ms is a reasonable starting value for a same-region managed Postgres; tighten or loosen once you have a week of baseline data.

Probe 2 — Replication lag

Point this at your replica, not the primary. The query asks the replica how far behind it is in seconds:

-- Run on the replica
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int
  AS lag_seconds;

Use the First col in [min..max] assertion with a range like 0..300. The replica is allowed to be up to five minutes behind; anything more flips the probe Down. If you have a stricter SLA — say, a read-after-write product where 30 seconds is already user-visible — narrow the range.

If you'd rather measure unreplayed WAL bytes instead of seconds (more useful when the primary is idle and timestamps don't move):

-- Run on the primary; bytes of WAL not yet replayed by the named replica
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::bigint AS lag_bytes
  FROM pg_stat_replication
 WHERE application_name = 'replica-1';

Probe 3 — Connection pool / activity

Counts active backends, which is the best proxy for "is the pool saturated" you can get without instrumenting your app. Anything over 80% of max_connections is your warning.

-- Active and idle-in-transaction backends, excluding the probe itself
SELECT count(*) AS active
  FROM pg_stat_activity
 WHERE state IN ('active', 'idle in transaction')
   AND pid <> pg_backend_pid();

Use First col in [min..max] with a ceiling at roughly 80% of your max_connections setting. If your pool is sized at 200, set the range to 0..160. The probe flips Down the moment the count crosses, and you have a clear signal to look at PgBouncer, the app, or whatever is leaking transactions — before connection exhaustion takes the whole product down.

That's it. Three probes covers connection health, replication health, and connection-pool health — the three failure modes that actually page people. The product spec for every assertion mode and the engine-version drift detection lives in the Database probe help section if you want the full reference.

Probes that catch each failure mode

A cheat sheet mapping each failure mode from the first section to a concrete probe, query, and threshold. Run these on a 1-5 minute cadence — Postgres can wait that long for you to notice.

Failure mode Query Assertion
Auth broken / DB in recovery SELECT 1 None (query must succeed)
Replication lag (replica) SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int First col in 0..300
Connection pool saturation SELECT count(*) FROM pg_stat_activity WHERE state IN ('active','idle in transaction') First col in 0..160 (tune to 80% of max_connections)
Long-running transaction SELECT count(*) FROM pg_stat_activity WHERE state <> 'idle' AND xact_start < now() - interval '10 minutes' Row count exactly 0
Autovacuum behind on a hot table SELECT EXTRACT(EPOCH FROM (now() - last_autovacuum))::int FROM pg_stat_user_tables WHERE relname = 'events' First col in 0..3600
Standby fell off the primary SELECT count(*) FROM pg_stat_replication WHERE state = 'streaming' Row count at least 1

Each row drops into a StatusPulse Database probe verbatim. The assertion modes are first-class on the probe form — pick the mode from a dropdown, paste the expected value, save. Internally the probe runs the SQL through a real Postgres connection on a schedule, evaluates the assertion, and reports Up / Degraded / Down with a sanitised error if the query itself fails.

Worth flagging: every query above runs as the probe user. Create a dedicated statuspulse_probe account with the absolute minimum grants — usually pg_monitor role membership plus SELECT on whatever business tables you assert against — and rotate the password independently from your app's credentials.

What to alert on, what to ignore

Six probes will generate dozens of state transitions a week if you page on every flap. The fastest way to make on-call hate the database dashboard is to wire every Degraded signal straight to PagerDuty.

A defensible split, after running this setup across several production Postgres fleets:

  • Page (wake someone up): connection probe Down for 2 consecutive checks; replication lag Down (over 5 minutes behind) for 2 consecutive checks; long-running transaction count above zero for 15 minutes straight. These are real, customer-affecting problems that only get worse with time.
  • Notify (Slack or email, no page): connection pool above 80% for 10 minutes; sentinel-query latency over 1 second for 15 minutes; autovacuum recency breaching budget on a hot table. These need a human to look, but they don't need a human at 03:00.
  • Dashboard only: engine version drift, connection count between 60% and 80%, single-check Degraded transitions. Put them on the status page and the internal dashboard. Don't alert on them.

StatusPulse handles the consecutive-checks heuristic and the intermittent-failure suppression at the probe level, so you can wire the Slack channel and the on-call PagerDuty webhook to different incident severities without writing routing logic yourself. Service-level health (gRPC, HTTP, WebSocket) follows the same pattern — see the companion gRPC Health Check guide for the application-tier equivalent of this article.

Wrap-up

Three things to take away:

  1. "The app is up" and "Postgres is healthy" are not the same statement. A TCP probe on 5432 can be green while replication is two hours behind, the pool is saturated, or the primary is in recovery.
  2. Six SQL queries against the system catalogs answer the practical questions. You don't need an APM agent or a per-host-priced DBM tier to run them on a schedule.
  3. Alert on the failures that page humans (connection, lag, long transactions). Notify on the rest. Dashboard the noise. The fastest way to lose trust in your monitoring is to alert on everything.

If you'd rather not build the cron-plus-SQL-plus-alert-routing yourself, that's exactly what StatusPulse's Database probe ships. Business tier, no agent, no per-host fee, runs in US or EU regions.

Try StatusPulse's Database probe

5 probes, 1 status page, forever. No credit card. US or EU host — you choose.