Observability with Service Mesh - Service Mesh

The Distributed Tracing Problem (Revisited)

In the Monitoring section of GuideDevOps, we discussed how incredibly complex it is to track a request as it bounces between 15 microservices.

Historically, to generate a trace, developers had to explicitly import tracing libraries (like OpenTelemetry) into their application code. They had to grab the TraceID variable out of the incoming HTTP header dynamically, and explicitly inject it into the next outbound HTTP request.

If developers forgot to pass the headers, the trace completely shattered.

A Service Mesh drastically mitigates this problem by viewing traffic at the lowest possible level: the network.

What the Proxy Sees

Because every single network packet must pass through a Sidecar proxy before entering or leaving a Pod, the proxy has an unparalleled viewpoint on the state of the system.

The proxy natively understands HTTP, gRPC, and TCP traffic.

Therefore, without the developer writing any specific code, the proxy instantly knows:

Rate: Exactly how many HTTP requests are hitting this pod per second.
Errors: Exactly what percentage of those requests are crashing and returning HTTP 500s.
Duration: Exactly how many milliseconds the internal application took before sending the HTTP Response back out.

This is often referred to as the RED Metrics (Rate, Errors, Duration).

Instant Dashboards (Kiali)

Service Meshes export all of these metrics dynamically.

If you are using Istio, you can install its native dashboard engine called Kiali.

Kiali consumes the mountain of data generated perfectly by the Istio sidecars and draws a stunning, live-animated topological map of your entire network architecture.

You do not need to read architectural diagrams; Kiali draws the reality exactly as it exists.

It displays a visual node for every running microservice.
It draws animated arrows between nodes showing the direction of traffic.
If the Database node begins returning errors, Kiali automatically turns the line pointing to it bright red and flashes a warning, allowing an SRE to pinpoint the broken microservice instantly without ever looking at the code.

The OpenTelemetry Partnership

While a Service Mesh generates the performance metrics flawlessly, there is a minor catch regarding Distributed Traces (Spans).

A sidecar proxy does automatically create a span when a request enters a pod, and generates another span when a request leaves the pod. However, the proxy is blind to what happens inside the application's RAM.

If an HTTP request enters the Node.js application, the Node app might perform three math equations, spawn two threads, and then quickly request data from a database.

Because the Service Mesh proxy did not process those internal math equations, it cannot connect the "incoming request" to the "outgoing database request."

The Golden Rule of Tracing with a Service Mesh: To generate beautiful, cohesive traces (in a tool like Jaeger), the Service Mesh does 99% of the work. However, the developer must manually write a tiny piece of code that grabs the B3 Trace Headers (the TraceID) from the incoming HTTP request and passes them unmodified directly into the outgoing HTTP requests.

The Service Mesh handles the heavy lifting of sending the tracing payload to the database. The developer just acts as the postman handing off the identifier.