Guide · gRPC monitoring
The gRPC Health Check Protocol, explained
A working gRPC health check is the single line of defence between a
green dashboard and a silent outage. This is what
grpc.health.v1.Health actually does, why HTTP
/healthz isn't a substitute, and the four ways teams
wire it in production — kubelet, Envoy, cloud load balancers, and
external probes.
Why HTTP /healthz lies about gRPC
Most gRPC services we audit have an HTTP /healthz
endpoint bolted on the side — usually because the first version of
Kubernetes probe templates everyone copies assumes HTTP, and
because curl is easier than grpcurl. It feels safe.
It is not. An HTTP sidecar can return a cheerful
200 OK while the actual gRPC server next to it is
broken in any of the following ways:
-
The gRPC port isn't accepting connections.
The HTTP listener is on
:8080, the gRPC listener is on:9090, and only one of them crashed. Plain TCP probes against:9090will at least catch the refused connection, but won't tell you whether the HTTP/2 stack came up correctly. -
The HTTP/2 stack is up but the gRPC server is
deadlocked. A pool of worker goroutines is starved,
or the .NET
ThreadPoolis saturated, or a downstream blocking call is holding every server-side handler. The HTTP handler thread keeps responding; the gRPC dispatcher is unreachable. -
One service is
NOT_SERVINGinside a multi-service binary. Yourapi-gatewayprocess hostsusers,billing, andshipping.billingfailed to load its database driver and is publishingNOT_SERVINGto the Health service. HTTP/healthzhas no idea. -
mTLS is failing for everyone except the localhost
/healthzprobe. The Istio sidecar is rejecting every external client because a CA rotation broke trust. The kubelet probe loops to127.0.0.1:8080and bypasses the mesh entirely.
The pattern in all four cases is the same: the HTTP probe is checking the wrong protocol, and often the wrong code path, from the wrong network position. A real gRPC health check has to speak HTTP/2, has to invoke an actual gRPC method, and — ideally — has to traverse the same network path that real callers traverse. That's what the gRPC Health Checking Protocol is for.
What grpc.health.v1.Health is
The gRPC project ships a canonical service definition at
grpc/grpc#doc/health-checking.md
called grpc.health.v1.Health. Every mainstream gRPC
implementation (grpc-go, grpc-java, grpc-dotnet, grpc-python,
grpc-rust via tonic) ships a pre-built implementation you can
register in one line. That's the surface you want every health
probe — internal or external — to talk to.
The service exposes two RPCs:
-
Check— a unary RPC. Ask "what is the health right now?", get an immediate response. This is what kubelet calls, what every service mesh's built-in health filter calls, and what most external monitors should call. It is the simpler RPC and the one to default to. -
Watch— a server-streaming RPC. The server pushes the current state as the first message on the stream, then pushes every subsequent state change. Useful for long-lived clients that want to react instantaneously to a flip fromSERVINGtoNOT_SERVINGwithout polling, and for the rare servers that only implementWatchand returnUNIMPLEMENTEDonCheck.
The response is a single enum field, status, with four legal values:
SERVING— the server (or service) is healthy. Treat as Up.NOT_SERVING— the server (or service) is intentionally refusing traffic. Treat as Down.UNKNOWN— the server hasn't decided. In practice you should also treat this as Down for alerting; "I don't know" is not a green dashboard.SERVICE_UNKNOWN— only legal on aWatchresponse, returned when the operator asks about a service the server doesn't know about. Treat as Down — your config is wrong, the operator wants to know.
One detail that catches people out: on a unary Check
call against an unknown service, the server doesn't return
SERVICE_UNKNOWN — it returns the gRPC status code
NOT_FOUND at the RPC layer. Same meaning, different
mechanism. A correct client treats both as "this service isn't
exposed".
The .proto contract
Here is the canonical service definition, lifted verbatim from
the gRPC project. Every server implementation generates code from
this exact .proto:
syntax = "proto3";
package grpc.health.v1;
service Health {
rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse);
}
message HealthCheckRequest {
string service = 1; // Empty = overall server health
}
message HealthCheckResponse {
enum ServingStatus {
UNKNOWN = 0;
SERVING = 1;
NOT_SERVING = 2;
SERVICE_UNKNOWN = 3; // Only valid in Watch responses
}
ServingStatus status = 1;
}
The service field is the lever every operator
misuses, so worth being explicit about:
- Empty string means "what is the overall server health?" — the answer is a single combined status the server is free to compute however it wants. This is what kubelet readiness probes hit. It is the right default.
-
A fully-qualified protobuf service name like
myapp.users.v1.UserServicemeans "what is the health of this specific service inside the server?" — the answer comes from the server's internal status map, which each service registers itself in. This is how you alert onbillingindependently ofshippingin a multi-service binary.
Fully-qualified means package.Service, not just
Service. UserService alone will not match
myapp.users.v1.UserService — the service-name string
is matched verbatim by the server, case-sensitive.
The four ways to wire a real check
Once your server implements grpc.health.v1.Health,
you have four places to actually call it from. Most production
stacks use two or three of them in combination — kubelet for
in-cluster, an external monitor for outside-in, and Envoy or a
cloud LB to gate load balancing.
a) grpc_health_probe in Kubernetes liveness / readiness
The grpc-ecosystem maintains a small CLI binary called
grpc_health_probe
specifically so kubelet can invoke it as an exec
probe. Bake it into your container image, then wire the probes
as:
livenessProbe:
exec:
command:
- /bin/grpc_health_probe
- -addr=localhost:9090
periodSeconds: 10
readinessProbe:
exec:
command:
- /bin/grpc_health_probe
- -addr=localhost:9090
- -service=myapp.users.v1.UserService
periodSeconds: 5
As of Kubernetes 1.24+ you can also use the built-in
grpc: probe type, which removes the binary
dependency:
livenessProbe:
grpc:
port: 9090
periodSeconds: 10
readinessProbe:
grpc:
port: 9090
service: myapp.users.v1.UserService
That covers kubernetes grpc liveness for any
modern cluster — no sidecar, no shell-out, kubelet talks
grpc.health.v1.Health directly.
b) Envoy / Istio gRPC health check filter
Envoy can be configured to call Health/Check against
each upstream as part of its outlier-detection loop, and remove
unhealthy pods from the routing pool without waiting for kubelet
to mark them NotReady:
health_checks:
- timeout: 1s
interval: 10s
unhealthy_threshold: 2
healthy_threshold: 1
grpc_health_check:
service_name: "myapp.users.v1.UserService"
authority: "users.myapp.svc.cluster.local"
This is the envoy grpc health check that ships
inside every Istio install. Linkerd does the equivalent via its
Server + HealthCheckPolicy CRDs. The
point is identical: don't route requests to a pod whose Health
service reports NOT_SERVING, regardless of what
kubelet thinks.
c) Cloud load balancer gRPC health checks
GCP's external HTTPS load balancer supports gRPC health checks
natively (set the protocol field to GRPC
on the health-check resource; optionally set
grpcServiceName for per-service checks). AWS
Network Load Balancer also has a GRPC protocol
option for target groups since 2023. Azure Application Gateway
currently does HTTP-only health probes against gRPC backends —
a known gap.
Use the LB health check when you need traffic-shaping decisions (drain a region, fail over a zone) tied to gRPC health. Don't use it as your only signal — LB health checks are tuned for fast routing, not for alerting. They typically have short timeouts and small histories, so they're noisy as an on-call paging source.
d) External monitoring with mTLS
The three above are all in-cluster or in-cloud. They tell you
whether the pod is healthy from the position of something that's
already inside your trust boundary. None of them tells you what
an external customer experiences. For that you need a probe that
sits outside the cluster, presents an mTLS client cert if your
mesh requires it, and calls Health/Check the same
way a real client would.
That's where a grpc health monitoring saas like
StatusPulse's gRPC Health probe fits.
It speaks grpc.health.v1.Health over HTTP/2 (HTTP/3
opt-in), supports per-service Check, Watch-first-frame, mTLS with
an encrypted client certificate, a custom CA bundle, optional
Bearer / API-key metadata, and the standard SSRF protections you
expect from a public-internet monitor. It is available on the
Pro tier and above; the Free and Starter plans
do not include gRPC Health.
Start free if
you want to try the rest of the platform first; upgrade to Pro
when the gRPC probe is the one you need.
Datadog has a gRPC synthetic check too — but it ships as part of the full Synthetics suite, priced per "test" at Datadog volumes; see the StatusPulse vs Datadog Synthetics breakdown if the cost-per-probe math matters for your stack.
mTLS, per-service Check, streaming Watch
These are the three advanced moves that separate a real production gRPC health check from a smoke test.
mTLS for zero-trust meshes
Istio, Linkerd, and Consul terminate every internal RPC at an
mTLS proxy and require clients to present a certificate. A
health probe that only does TLS server-cert validation will be
rejected at the handshake — you'll see a
tls: client didn't provide a certificate or
Server rejected client certificate error before any
gRPC method gets invoked. Provision a dedicated probe identity
in your mesh (a SPIFFE ID, a service account, whatever the mesh
calls it), grant it Check / Watch permissions on the target
service, and feed the resulting cert + key into the probe as a
single PEM blob:
-----BEGIN CERTIFICATE-----
MIIB...
-----END CERTIFICATE-----
-----BEGIN PRIVATE KEY-----
MIIE...
-----END PRIVATE KEY-----
StatusPulse stores that blob AES-GCM-encrypted at rest with a Key Vault master key; the form never re-echoes the value back. Pair it with a plain-text custom CA bundle when the server's cert chains to a private root (the typical mesh setup).
Per-service Check
If a single gRPC server hosts more than one service, default to
one probe per service rather than a single overall-server probe.
The kernel reason: an overall-server probe goes Down when
any registered service goes NOT_SERVING,
which floods your on-call with noise about shipping
being down at 3am when the real story is the
billing Redis dependency. Per-service probes give
each team a dedicated signal.
Pseudocode for what a per-service Check looks like on the wire:
request = HealthCheckRequest { service: "myapp.users.v1.UserService" }
response = Health.Check(request)
assert response.status == SERVING
Streaming Watch — first frame only
Watch is server-streaming: the server pushes the
current state immediately as the first frame, then every
subsequent state change for as long as the stream stays open. A
periodic health probe doesn't need the stream to stay
open — reading the first frame and cancelling gives you the same
"what is the state right now?" answer as Check,
and it works against the rare spec-only-Watch servers that
return UNIMPLEMENTED on Check.
If you need push-style notification on flips (alert
within seconds rather than within one polling interval), you
need a long-lived Watch client — a different shape
from a periodic probe, with stream re-open, backoff, and
dead-stream detection. That is out of scope for most external
monitors today, StatusPulse included; periodic polling fits the
SLA shape every other probe type uses.
Common mistakes
-
Using HTTP for gRPC liveness. The default,
and the most common. Every section of this article exists
because of this one mistake. If your gRPC server crashes and
your HTTP
/healthzstill answers, your alerts are theatre. -
Not implementing the protocol and relying on TCP.
A TCP probe against
:9090tells you the socket is open. It does not tell you the gRPC dispatcher is alive, the HTTP/2 negotiation works, or any service is registered. Implementgrpc.health.v1.Health— it is a single library call in every modern gRPC framework. -
Hardcoding
Check("")instead of per-service. Fine for kubelet readiness in a single-service pod; wrong for a multi-service pod where the operator wants per-service alert routing. Configure one probe per critical service. -
mTLS auth issues at handshake time, not at RPC
time. The probe error says
Server rejected client certificate— that's not a gRPC-level failure, it's a TLS-level failure. Check the mesh'sAuthorizationPolicyfor the probe identity, confirm the cert chains to the right CA, confirm the server trusts that CA. -
Ignoring
SERVICE_UNKNOWNvsNOT_SERVING. They mean very different things — "you misconfigured the probe, the service name is wrong" vs "the server says this service is down". Treat both as Down for the dashboard, but surface the distinction in the alert text. A misconfigured probe and a real outage produce the same red light otherwise.
What to actually monitor for
Once the probe is wired and reporting Up, four signals matter. They're the same four whether you're rolling your own probe or using the StatusPulse gRPC Health probe — the latter just captures them for you and graphs them out of the box.
-
Status of
Check(orWatchfirst frame). The headline.SERVING→ Up, anything else → Down with the actual server-returned status spelled out in the alert. - Per-service status, when you run multiple. Each critical service in a multi-service binary needs its own probe with its own threshold. Surface the service name in the alert body.
-
Latency of the
CheckRPC itself. A server that returnsSERVINGin 5 ms versus 500 ms is telling you two very different things — the second is a smoke alarm for a backend dependency saturating. Set a Degraded threshold (~200 ms is a sane default for same-region targets) so latency creep pages the right team before users notice. - mTLS handshake time. Track TCP + TLS + HTTP/2 init as a separate phase from RPC RTT. A slow handshake with fast RPC means TLS-side trouble (cert chain bloat, missing OCSP staple, cold-start handshake); a fast handshake with slow RPC means the application is slow. Mixing them masks both.
For a deeper read on the StatusPulse-specific fields, thresholds, and the worked Istio example, see the gRPC Health probe help page. If you're also monitoring a database, the Postgres monitoring guide is the sister article on infrastructure-level probes — same philosophy, different protocol.
Wrap-up
HTTP /healthz against a gRPC server is the
monitoring equivalent of testing a door by knocking on the wall
next to it. The door is right there.
grpc.health.v1.Health is the canonical, shipped-in-every-library,
kubelet-and-Envoy-already-speak-it surface. Implement it on the
server, point a real gRPC health probe at it, and you've replaced
a synthetic green light with an actual signal.
The mechanical checklist:
- Register
grpc.health.v1.Healthin every gRPC server (one library call per language). - Wire kubelet readiness + liveness with the built-in
grpc:probe type. - Wire Envoy / Istio / cloud LB health checks at the layer that gates traffic.
- Wire one external probe per critical service, with mTLS if your mesh requires it, and per-phase latency thresholds.
- Alert on the four signals above — status, per-service status, RPC latency, handshake latency.
Five layers of defence is overkill for a hobby project; the first three are table stakes for anything serving real traffic. The fourth — the outside-in probe — is the one that catches the mTLS-rotation-broke-everything class of outage that the in-cluster probes can't see.
Try StatusPulse's gRPC Health probe
grpc.health.v1.Health/Check + Watch first-frame, mTLS
with encrypted client cert, per-service probing, HTTP/2 + HTTP/3.
Pro tier and above. 5 probes, 1 status page, forever on Free if
you want to evaluate the rest of the platform first. No credit
card. US or EU host — you choose.