Guide · Database monitoring
How to monitor a SQL Server database without an APM
Your HTTP healthcheck is green, your TCP probe on 1433 is green, and SQL Server is quietly stuck in a 40-minute blocking chain with tempdb at 98%. Here's the minimum-viable SQL Server health monitoring setup that catches the silent failures — without paying for Datadog DBM or installing a SQL Server Agent extension.
"The app is up" doesn't mean SQL Server is healthy
Most monitoring setups stop at one of two checks: an HTTP probe on
/healthz that returns 200, or a TCP probe on port
1433 that confirms the SQL endpoint is accepting connections. Both
can be green for hours while SQL Server is in a state nobody would
call healthy.
The failure modes a TCP or HTTP probe misses, in SQL Server production, every week:
- Availability Group replication is broken. The primary is fine, the secondary accepts connections, the log send queue blew past a gigabyte and the synchronisation_health column flipped to not_healthy at 03:14. Read-only workloads pointed at the listener are now reading minutes-old data — silently.
-
A blocking chain is growing. One session took a
row lock, didn't commit, fifty more sessions are now stacked
behind it waiting on
LCK_M_X. The database itself is idle, the TCP probe is delighted, and your app is timing out at exactly the rate new requests arrive. -
tempdb is full. A bad query spilled a sort to
tempdb, autogrow hit the disk ceiling, and now every query that
needs a worktable fails with
Could not allocate space for object 'dbo.SORT temporary
run storage'. Connection-only probes pass; anything more
complex than
SELECT 1craters. -
The database is SUSPECT or RECOVERY_PENDING.
A storage hiccup or a botched restore left a database in a
state where you can connect to the server but the database
itself refuses queries.
sys.databases.state_desctells you in one column;SELECT 1againstmasterwon't. - Statistics are stale and the plan flipped. A missing index, a parameter-sniff regression after a stats update, or a planner choosing a parallel scan on what used to be a seek — checkout suddenly takes 8 seconds. The HTTP probe still loads the homepage in 80 ms, because the homepage doesn't hit that query.
None of these show up as "SQL Server is down." All of them show up as "the product is broken." If you wait for your customers to tell you on a support ticket, you'll find out about every single one of them.
What "healthy SQL Server" actually looks like
Forget the dashboard with 80 metrics for a moment. The signals that matter for "is SQL Server healthy right now" boil down to six numbers:
- Connection success. Can a real client open a fresh connection, authenticate, and run a query within a few seconds? This subsumes TCP, TLS, login auth, and the database is not in recovery / suspect / restoring cases.
-
Latency on a sentinel query. A trivial
SELECT 1end-to-end gives you the floor latency. When it doubles, something is wrong — saturated worker threads, a tempdb that's spilled, a wait chain onSOS_SCHEDULER_YIELD. -
AG replication lag. Either the size of the log
send queue on the primary, or seconds behind on the secondary.
Either is fine; both catastrophically diverging is what you
care about. (If you're not on AlwaysOn, skip this; if you're on
log shipping, the equivalent is the last
restore logtimestamp on the standby.) - Active session count. How many sessions are actively running as opposed to sleeping. Crossing some fraction of your worker-thread budget (or the platform default — 576 for SQL Server on most modern boxes) is your warning shot.
-
Blocking chain depth. Anything with
blocking_session_id> 0 for more than a few minutes is suspicious. A blocking chain three or four deep that doesn't clear is the most common "the database is fine, the app is broken" pattern in SQL Server. -
Database state. Every database you care about
should be
ONLINE. Anything inSUSPECT,RECOVERY_PENDING,EMERGENCY, orOFFLINEmeans a human needs to look — fast.
Notice that all six are answered by queries against SQL Server's
own DMVs and system catalogues. You don't need an agent. You don't
need extended events. You need a thing that runs a
SELECT on a schedule and tells you when the answer
changes.
Why a generic APM is overkill for most teams
The textbook answer to "monitor SQL Server" is to buy Datadog or New Relic, install the agent, enable Database Monitoring (DBM), and let it ingest query samples, execution plans, and wait events. It's a great product. It's also wildly oversized for most Series A-B teams running one or two SQL Server instances.
What you actually pay for with a full APM on SQL Server:
- An agent on every database host or sidecar. On Azure SQL Database, Azure SQL Managed Instance, RDS for SQL Server, or Cloud SQL for SQL Server, you don't get a host — you get an endpoint. Agent-based monitoring becomes a sidecar VM you run yourself, which is the opposite of why you went managed.
- An observability stack you didn't ask for. APMs assume you also want distributed tracing, log ingestion, RUM, and synthetic checks. You'll pay per-host, per-GB, and per-million-spans whether you use them or not.
- Per-database-host pricing that punishes growth. Datadog DBM lists around $70 per database host per month before ingest, on top of your APM seat. For a team running staging + prod + a read replica, you're at $200+/month for the DB layer alone before you've watched a single query.
- Windows Authentication friction. If your SQL Server is on-prem with Windows Auth, the APM agent needs a Windows service account with the right delegation, and you end up debugging Kerberos every time someone rotates a password.
Most teams don't need query-level introspection at that price. They need a confident answer to "is SQL Server healthy?" with alerting on the failure modes that actually page humans. That's what StatusPulse's Database probe is built for: a real TDS connection, a real query, a real assertion, on a schedule, with a status page in front of it. No agent. No host accounting. No span ingest.
StatusPulse's Database probe is on the Business tier — it's not free, but it's a flat add-on, not a per-host meter. If you've already outgrown free uptime monitors but you're not ready to feed the APM beast, this is the gap it fills.
The minimum-viable probe set
Three probes will catch the overwhelming majority of "the database is wrong" incidents. Each maps to one of the signals from above and runs an actual T-SQL query through a real TDS connection.
Probe 1 — Connection + sentinel query
The base case. Opens a fresh connection with your probe login,
runs SELECT 1, captures the auth-handshake time and
total round-trip separately. If the server is unreachable, the
login is locked out, the TLS cert has rotated, or the instance
is just plain slow to accept connections, this probe goes Down or
Degraded.
-- Probe query
SELECT 1;
On StatusPulse's Database probe this is the Level 2 auth + ping mode — no custom query needed, just host, port (usually 1433), database, credentials, and TLS. Treat the Degraded threshold as your sentinel-latency budget. 500 ms is a reasonable starting value for a same-region managed SQL Server (Azure SQL Database or RDS); tighten or loosen once you have a week of baseline data.
Probe 2 — AlwaysOn replication lag
Point this at any node of the Availability Group — the DMV joins give you the same answer regardless. The query asks how much log data is stacked up waiting to ship to the secondary, in kilobytes:
-- Run on any AG replica
SELECT TOP 1 drs.log_send_queue_size AS lag_kb
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar
ON drs.replica_id = ar.replica_id
WHERE ar.replica_server_name <> @SERVERNAME
ORDER BY drs.log_send_queue_size DESC;
Use the First col in [min..max] assertion with a range
like 0..51200 (50 MB of unshipped log). The secondary
is allowed to be a little behind; anything more flips the probe
Down. If your AG has a strict RPO — say, a financial system where
10 seconds of lag is already a contract breach — narrow the range.
If you'd rather measure seconds behind instead of log bytes (more intuitive for read-only secondary traffic):
-- Estimated seconds behind on the secondary
SELECT TOP 1 drs.secondary_lag_seconds AS lag_seconds
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar
ON drs.replica_id = ar.replica_id
WHERE ar.replica_server_name <> @SERVERNAME
AND drs.secondary_lag_seconds IS NOT NULL
ORDER BY drs.secondary_lag_seconds DESC;
Not on AlwaysOn? If you're on log shipping, the equivalent probe asks how recent the last applied log restore is on the standby:
-- Run on the log-shipped secondary, against msdb
SELECT DATEDIFF(second, MAX(restore_date), GETUTCDATE()) AS seconds_since_restore
FROM msdb.dbo.restorehistory
WHERE restore_type = 'L';
Probe 3 — Active sessions and blocking
Counts active sessions, which is the best proxy for "is the server under load" you can get without instrumenting your app. And — more importantly — counts how many of those sessions are blocked by someone else.
-- Active and blocked sessions, excluding system + the probe itself
SELECT count(*) AS active
FROM sys.dm_exec_sessions s
WHERE s.is_user_process = 1
AND s.status = 'running'
AND s.session_id <> @SPID;
Use First col in [min..max] with a ceiling well below
your worker-thread budget. SQL Server's max_worker_threads
default is 0 (auto-calculated) which usually lands at 576 — set
the range to 0..400 as a safe ceiling. The probe
flips Down the moment the count crosses, and you have a clear
signal to look at blocking, parallelism, or whatever is starving
the scheduler — before worker-thread exhaustion takes the whole
instance down.
That's it. Three probes covers connection health, replication health, and session-pressure health — the three failure modes that actually page people. The product spec for every assertion mode and the engine-version drift detection lives in the Database probe help section if you want the full reference.
Probes that catch each failure mode
A cheat sheet mapping each failure mode from the first section to a concrete probe, query, and threshold. Run these on a 1-5 minute cadence — SQL Server can wait that long for you to notice.
| Failure mode | Query | Assertion |
|---|---|---|
| Login broken / instance unreachable | SELECT 1 |
None (query must succeed) |
| Database not ONLINE | SELECT count(*) FROM sys.databases WHERE name = N'YourDb' AND state_desc = 'ONLINE' |
Row count at least 1 |
| AG log send queue too large | SELECT TOP 1 log_send_queue_size FROM sys.dm_hadr_database_replica_states ORDER BY log_send_queue_size DESC |
First col in 0..51200 |
| AG sync health degraded | SELECT count(*) FROM sys.dm_hadr_database_replica_states WHERE synchronization_health_desc <> 'HEALTHY' |
Row count exactly 0 |
| Blocking chain not clearing | SELECT count(*) FROM sys.dm_exec_requests WHERE blocking_session_id > 0 AND DATEDIFF(second, start_time, GETDATE()) > 60 |
Row count exactly 0 |
| Long-running request | SELECT count(*) FROM sys.dm_exec_requests WHERE session_id > 50 AND DATEDIFF(minute, start_time, GETDATE()) > 10 |
Row count exactly 0 |
| tempdb close to full | SELECT 100.0 * SUM(unallocated_extent_page_count) / SUM(total_page_count) AS pct_free FROM sys.dm_db_file_space_usage |
First col in 10..100 (Down when below 10% free) |
| Last full backup too old | SELECT DATEDIFF(hour, MAX(backup_finish_date), GETDATE()) FROM msdb.dbo.backupset WHERE database_name = N'YourDb' AND type = 'D' |
First col in 0..36 |
| SQL Server Agent job failed | SELECT count(*) FROM msdb.dbo.sysjobhistory h JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id WHERE h.run_status = 0 AND h.step_id = 0 AND msdb.dbo.agent_datetime(h.run_date, h.run_time) > DATEADD(hour, -1, GETDATE()) |
Row count exactly 0 |
Each row drops into a StatusPulse Database probe verbatim. The assertion modes are first-class on the probe form — pick the mode from a dropdown, paste the expected value, save. Internally the probe runs the SQL through a real TDS connection on a schedule, evaluates the assertion, and reports Up / Degraded / Down with a sanitised error if the query itself fails.
A note on the backup and Agent-job probes: both query
msdb, which doesn't exist on Azure SQL Database
(single-database SKU). Use the
sys.dm_database_backups DMV on Azure SQL DB
instead, or just skip these probes if your backups are managed
by the platform.
A probe login with the right grants
Don't reuse your app login for the probe. Create a dedicated SQL login or contained user with the absolute minimum grants — every query in this guide works with the following set:
-- On SQL Server (Windows / Linux / RDS / Managed Instance)
CREATE LOGIN statuspulse_probe WITH PASSWORD = N'<strong-random>';
USE master;
GRANT VIEW SERVER STATE TO statuspulse_probe; -- DMVs at server scope
GRANT VIEW SERVER PERFORMANCE STATE TO statuspulse_probe; -- SQL 2022+ AG DMVs
GRANT VIEW ANY DATABASE TO statuspulse_probe; -- sys.databases visibility
-- In each database you probe, including msdb if you query backups
USE YourDb;
CREATE USER statuspulse_probe FOR LOGIN statuspulse_probe;
GRANT VIEW DATABASE STATE TO statuspulse_probe; -- db-scope DMVs
-- Optional: SELECT on specific tables if you assert against business data
-- GRANT SELECT ON dbo.orders TO statuspulse_probe;
On Azure SQL Database (single-DB SKU) the
GRANT VIEW SERVER STATE equivalent is to add the
user to ##MS_ServerStateReader## in
master. The script above won't apply verbatim;
adapt to the platform.
Rotate the probe password independently from the app's credentials, and store it encrypted at rest — StatusPulse's Database probe wraps it in AES-GCM before persisting, but the principle holds wherever you run probes.
What to alert on, what to ignore
Nine probes will generate dozens of state transitions a week if you page on every flap. The fastest way to make on-call hate the SQL Server dashboard is to wire every Degraded signal straight to PagerDuty.
A defensible split, after running this setup across several production SQL Server fleets:
-
Page (wake someone up):
connection probe Down for 2 consecutive checks; database
state not
ONLINE; AG sync health degraded for 2 consecutive checks; tempdb below 10% free; blocking chain older than 60 seconds for 5 consecutive minutes. These are real, customer-affecting problems that only get worse with time. - Notify (Slack or email, no page): AG log send queue above the threshold for 10 minutes; active session count above 80% of worker-thread budget for 10 minutes; long-running request count above zero for 15 minutes; last full backup older than your RPO budget. These need a human to look, but they don't need a human at 03:00.
- Dashboard only: engine version drift, Agent-job failures from the last hour (yes, even these — they self-clear), single-check Degraded transitions, sentinel latency between baseline and 2× baseline. Put them on the status page and the internal dashboard. Don't alert on them.
StatusPulse handles the consecutive-checks heuristic and the intermittent-failure suppression at the probe level, so you can wire the Slack channel and the on-call PagerDuty webhook to different incident severities without writing routing logic yourself. The Postgres companion to this article runs through the same pattern from the Postgres side; the grammar of the probes is identical, only the queries change.
Wrap-up
Three things to take away:
- "The app is up" and "SQL Server is healthy" are not the same statement. A TCP probe on 1433 can be green while tempdb is full, an AG secondary is hours behind, or a blocking chain is taking the application down.
- Six T-SQL queries against the DMVs answer the practical questions. You don't need an APM agent or a per-host-priced DBM tier to run them on a schedule.
- Alert on the failures that page humans (connection, database state, AG health, blocking, tempdb). Notify on the rest. Dashboard the noise. The fastest way to lose trust in your monitoring is to alert on everything.
If you'd rather not build the cron-plus-T-SQL-plus-alert-routing yourself, that's exactly what StatusPulse's Database probe ships. Business tier, no agent, no per-host fee, runs in US or EU regions.
Try StatusPulse's Database probe
5 probes, 1 status page, forever. No credit card. US or EU host — you choose.