← All guides

Guide · gRPC monitoring

The gRPC Health Check Protocol, explained

A working gRPC health check is the single line of defence between a green dashboard and a silent outage. This is what grpc.health.v1.Health actually does, why HTTP /healthz isn't a substitute, and the four ways teams wire it in production — kubelet, Envoy, cloud load balancers, and external probes.

Published 2026-05-22 · ~10 min read · StatusPulse Team

Why HTTP /healthz lies about gRPC

Most gRPC services we audit have an HTTP /healthz endpoint bolted on the side — usually because the first version of Kubernetes probe templates everyone copies assumes HTTP, and because curl is easier than grpcurl. It feels safe. It is not. An HTTP sidecar can return a cheerful 200 OK while the actual gRPC server next to it is broken in any of the following ways:

The gRPC port isn't accepting connections. The HTTP listener is on :8080, the gRPC listener is on :9090, and only one of them crashed. Plain TCP probes against :9090 will at least catch the refused connection, but won't tell you whether the HTTP/2 stack came up correctly.
The HTTP/2 stack is up but the gRPC server is deadlocked. A pool of worker goroutines is starved, or the .NET ThreadPool is saturated, or a downstream blocking call is holding every server-side handler. The HTTP handler thread keeps responding; the gRPC dispatcher is unreachable.
One service is NOT_SERVING inside a multi-service binary. Your api-gateway process hosts users, billing, and shipping. billing failed to load its database driver and is publishing NOT_SERVING to the Health service. HTTP /healthz has no idea.
mTLS is failing for everyone except the localhost /healthz probe. The Istio sidecar is rejecting every external client because a CA rotation broke trust. The kubelet probe loops to 127.0.0.1:8080 and bypasses the mesh entirely.

The pattern in all four cases is the same: the HTTP probe is checking the wrong protocol, and often the wrong code path, from the wrong network position. A real gRPC health check has to speak HTTP/2, has to invoke an actual gRPC method, and — ideally — has to traverse the same network path that real callers traverse. That's what the gRPC Health Checking Protocol is for.

What grpc.health.v1.Health is

The gRPC project ships a canonical service definition at grpc/grpc#doc/health-checking.md called grpc.health.v1.Health. Every mainstream gRPC implementation (grpc-go, grpc-java, grpc-dotnet, grpc-python, grpc-rust via tonic) ships a pre-built implementation you can register in one line. That's the surface you want every health probe — internal or external — to talk to.

The service exposes two RPCs:

Check — a unary RPC. Ask "what is the health right now?", get an immediate response. This is what kubelet calls, what every service mesh's built-in health filter calls, and what most external monitors should call. It is the simpler RPC and the one to default to.
Watch — a server-streaming RPC. The server pushes the current state as the first message on the stream, then pushes every subsequent state change. Useful for long-lived clients that want to react instantaneously to a flip from SERVING to NOT_SERVING without polling, and for the rare servers that only implement Watch and return UNIMPLEMENTED on Check.

The response is a single enum field, status, with four legal values:

SERVING — the server (or service) is healthy. Treat as Up.
NOT_SERVING — the server (or service) is intentionally refusing traffic. Treat as Down.
UNKNOWN — the server hasn't decided. In practice you should also treat this as Down for alerting; "I don't know" is not a green dashboard.
SERVICE_UNKNOWN — only legal on a Watch response, returned when the operator asks about a service the server doesn't know about. Treat as Down — your config is wrong, the operator wants to know.

One detail that catches people out: on a unary Check call against an unknown service, the server doesn't return SERVICE_UNKNOWN — it returns the gRPC status code NOT_FOUND at the RPC layer. Same meaning, different mechanism. A correct client treats both as "this service isn't exposed".

The .proto contract

Here is the canonical service definition, lifted verbatim from the gRPC project. Every server implementation generates code from this exact .proto:

syntax = "proto3";

package grpc.health.v1;

service Health {
  rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
  rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse);
}

message HealthCheckRequest {
  string service = 1;  // Empty = overall server health
}

message HealthCheckResponse {
  enum ServingStatus {
    UNKNOWN = 0;
    SERVING = 1;
    NOT_SERVING = 2;
    SERVICE_UNKNOWN = 3;  // Only valid in Watch responses
  }
  ServingStatus status = 1;
}

The service field is the lever every operator misuses, so worth being explicit about:

Empty string means "what is the overall server health?" — the answer is a single combined status the server is free to compute however it wants. This is what kubelet readiness probes hit. It is the right default.
A fully-qualified protobuf service name like myapp.users.v1.UserService means "what is the health of this specific service inside the server?" — the answer comes from the server's internal status map, which each service registers itself in. This is how you alert on billing independently of shipping in a multi-service binary.

Fully-qualified means package.Service, not just Service. UserService alone will not match myapp.users.v1.UserService — the service-name string is matched verbatim by the server, case-sensitive.

The four ways to wire a real check

Once your server implements grpc.health.v1.Health, you have four places to actually call it from. Most production stacks use two or three of them in combination — kubelet for in-cluster, an external monitor for outside-in, and Envoy or a cloud LB to gate load balancing.

a) grpc_health_probe in Kubernetes liveness / readiness

The grpc-ecosystem maintains a small CLI binary called grpc_health_probe specifically so kubelet can invoke it as an exec probe. Bake it into your container image, then wire the probes as:

livenessProbe:
  exec:
    command:
      - /bin/grpc_health_probe
      - -addr=localhost:9090
  periodSeconds: 10
readinessProbe:
  exec:
    command:
      - /bin/grpc_health_probe
      - -addr=localhost:9090
      - -service=myapp.users.v1.UserService
  periodSeconds: 5

As of Kubernetes 1.24+ you can also use the built-in grpc: probe type, which removes the binary dependency:

livenessProbe:
  grpc:
    port: 9090
  periodSeconds: 10
readinessProbe:
  grpc:
    port: 9090
    service: myapp.users.v1.UserService

That covers kubernetes grpc liveness for any modern cluster — no sidecar, no shell-out, kubelet talks grpc.health.v1.Health directly.

b) Envoy / Istio gRPC health check filter

Envoy can be configured to call Health/Check against each upstream as part of its outlier-detection loop, and remove unhealthy pods from the routing pool without waiting for kubelet to mark them NotReady:

health_checks:
  - timeout: 1s
    interval: 10s
    unhealthy_threshold: 2
    healthy_threshold: 1
    grpc_health_check:
      service_name: "myapp.users.v1.UserService"
      authority: "users.myapp.svc.cluster.local"

This is the envoy grpc health check that ships inside every Istio install. Linkerd does the equivalent via its Server + HealthCheckPolicy CRDs. The point is identical: don't route requests to a pod whose Health service reports NOT_SERVING, regardless of what kubelet thinks.

c) Cloud load balancer gRPC health checks

GCP's external HTTPS load balancer supports gRPC health checks natively (set the protocol field to GRPC on the health-check resource; optionally set grpcServiceName for per-service checks). AWS Network Load Balancer also has a GRPC protocol option for target groups since 2023. Azure Application Gateway currently does HTTP-only health probes against gRPC backends — a known gap.

Use the LB health check when you need traffic-shaping decisions (drain a region, fail over a zone) tied to gRPC health. Don't use it as your only signal — LB health checks are tuned for fast routing, not for alerting. They typically have short timeouts and small histories, so they're noisy as an on-call paging source.

d) External monitoring with mTLS

The three above are all in-cluster or in-cloud. They tell you whether the pod is healthy from the position of something that's already inside your trust boundary. None of them tells you what an external customer experiences. For that you need a probe that sits outside the cluster, presents an mTLS client cert if your mesh requires it, and calls Health/Check the same way a real client would.

That's where a grpc health monitoring saas like StatusPulse's gRPC Health probe fits. It speaks grpc.health.v1.Health over HTTP/2 (HTTP/3 opt-in), supports per-service Check, Watch-first-frame, mTLS with an encrypted client certificate, a custom CA bundle, optional Bearer / API-key metadata, and the standard SSRF protections you expect from a public-internet monitor. It is available on the Pro tier and above; the Free and Starter plans do not include gRPC Health. Start free if you want to try the rest of the platform first; upgrade to Pro when the gRPC probe is the one you need.

Datadog has a gRPC synthetic check too — but it ships as part of the full Synthetics suite, priced per "test" at Datadog volumes; see the StatusPulse vs Datadog Synthetics breakdown if the cost-per-probe math matters for your stack.

mTLS, per-service Check, streaming Watch

These are the three advanced moves that separate a real production gRPC health check from a smoke test.

mTLS for zero-trust meshes

Istio, Linkerd, and Consul terminate every internal RPC at an mTLS proxy and require clients to present a certificate. A health probe that only does TLS server-cert validation will be rejected at the handshake — you'll see a tls: client didn't provide a certificate or Server rejected client certificate error before any gRPC method gets invoked. Provision a dedicated probe identity in your mesh (a SPIFFE ID, a service account, whatever the mesh calls it), grant it Check / Watch permissions on the target service, and feed the resulting cert + key into the probe as a single PEM blob:

-----BEGIN CERTIFICATE-----
MIIB...
-----END CERTIFICATE-----
-----BEGIN PRIVATE KEY-----
MIIE...
-----END PRIVATE KEY-----

StatusPulse stores that blob AES-GCM-encrypted at rest with a Key Vault master key; the form never re-echoes the value back. Pair it with a plain-text custom CA bundle when the server's cert chains to a private root (the typical mesh setup).

Per-service Check

If a single gRPC server hosts more than one service, default to one probe per service rather than a single overall-server probe. The kernel reason: an overall-server probe goes Down when any registered service goes NOT_SERVING, which floods your on-call with noise about shipping being down at 3am when the real story is the billing Redis dependency. Per-service probes give each team a dedicated signal.

Pseudocode for what a per-service Check looks like on the wire:

request  = HealthCheckRequest { service: "myapp.users.v1.UserService" }
response = Health.Check(request)
assert response.status == SERVING

Streaming Watch — first frame only

Watch is server-streaming: the server pushes the current state immediately as the first frame, then every subsequent state change for as long as the stream stays open. A periodic health probe doesn't need the stream to stay open — reading the first frame and cancelling gives you the same "what is the state right now?" answer as Check, and it works against the rare spec-only-Watch servers that return UNIMPLEMENTED on Check.

If you need push-style notification on flips (alert within seconds rather than within one polling interval), you need a long-lived Watch client — a different shape from a periodic probe, with stream re-open, backoff, and dead-stream detection. That is out of scope for most external monitors today, StatusPulse included; periodic polling fits the SLA shape every other probe type uses.

Common mistakes

Using HTTP for gRPC liveness. The default, and the most common. Every section of this article exists because of this one mistake. If your gRPC server crashes and your HTTP /healthz still answers, your alerts are theatre.
Not implementing the protocol and relying on TCP. A TCP probe against :9090 tells you the socket is open. It does not tell you the gRPC dispatcher is alive, the HTTP/2 negotiation works, or any service is registered. Implement grpc.health.v1.Health — it is a single library call in every modern gRPC framework.
Hardcoding Check("") instead of per-service. Fine for kubelet readiness in a single-service pod; wrong for a multi-service pod where the operator wants per-service alert routing. Configure one probe per critical service.
mTLS auth issues at handshake time, not at RPC time. The probe error says Server rejected client certificate — that's not a gRPC-level failure, it's a TLS-level failure. Check the mesh's AuthorizationPolicy for the probe identity, confirm the cert chains to the right CA, confirm the server trusts that CA.
Ignoring SERVICE_UNKNOWN vs NOT_SERVING. They mean very different things — "you misconfigured the probe, the service name is wrong" vs "the server says this service is down". Treat both as Down for the dashboard, but surface the distinction in the alert text. A misconfigured probe and a real outage produce the same red light otherwise.

What to actually monitor for

Once the probe is wired and reporting Up, four signals matter. They're the same four whether you're rolling your own probe or using the StatusPulse gRPC Health probe — the latter just captures them for you and graphs them out of the box.

Status of Check (or Watch first frame). The headline. SERVING → Up, anything else → Down with the actual server-returned status spelled out in the alert.
Per-service status, when you run multiple. Each critical service in a multi-service binary needs its own probe with its own threshold. Surface the service name in the alert body.
Latency of the Check RPC itself. A server that returns SERVING in 5 ms versus 500 ms is telling you two very different things — the second is a smoke alarm for a backend dependency saturating. Set a Degraded threshold (~200 ms is a sane default for same-region targets) so latency creep pages the right team before users notice.
mTLS handshake time. Track TCP + TLS + HTTP/2 init as a separate phase from RPC RTT. A slow handshake with fast RPC means TLS-side trouble (cert chain bloat, missing OCSP staple, cold-start handshake); a fast handshake with slow RPC means the application is slow. Mixing them masks both.

For a deeper read on the StatusPulse-specific fields, thresholds, and the worked Istio example, see the gRPC Health probe help page. If you're also monitoring a database, the Postgres monitoring guide is the sister article on infrastructure-level probes — same philosophy, different protocol.

Wrap-up

HTTP /healthz against a gRPC server is the monitoring equivalent of testing a door by knocking on the wall next to it. The door is right there. grpc.health.v1.Health is the canonical, shipped-in-every-library, kubelet-and-Envoy-already-speak-it surface. Implement it on the server, point a real gRPC health probe at it, and you've replaced a synthetic green light with an actual signal.

The mechanical checklist:

Register grpc.health.v1.Health in every gRPC server (one library call per language).
Wire kubelet readiness + liveness with the built-in grpc: probe type.
Wire Envoy / Istio / cloud LB health checks at the layer that gates traffic.
Wire one external probe per critical service, with mTLS if your mesh requires it, and per-phase latency thresholds.
Alert on the four signals above — status, per-service status, RPC latency, handshake latency.

Five layers of defence is overkill for a hobby project; the first three are table stakes for anything serving real traffic. The fourth — the outside-in probe — is the one that catches the mTLS-rotation-broke-everything class of outage that the in-cluster probes can't see.

Try StatusPulse's gRPC Health probe

grpc.health.v1.Health/Check + Watch first-frame, mTLS with encrypted client cert, per-service probing, HTTP/2 + HTTP/3. Pro tier and above. 5 probes, 1 status page, forever on Free if you want to evaluate the rest of the platform first. No credit card. US or EU host — you choose.

Start free See pricing