The Three Pillars - Monitoring & Observability

Telemetry Data

To make a system observable, the system must emit data about its current state. We call this data Telemetry. In modern DevOps architectures, telemetry is divided into three distinct categories known as the "Three Pillars of Observability."

To truly understand what is happening inside a complex application, you need all three.

Pillar 1: Metrics

Metrics are quantifiable measurements of a system's state over time.

They answer the question: "What is the overall numeric health of the system right now?"

Metrics are incredibly cheap to store and transmit because they are just an array of numbers tagged with a timestamp. They do not tell you what an individual user is doing, they aggregate the behavior of all users.

Characteristics of Metrics

Used primarily for dashboards (graphs, charts) and triggering alerts.
Excellent at identifying trends (e.g., "Memory usage is climbing slowly by 1% per day").
Very low storage overhead.

Common Metric Examples

Node-level Metrics: CPU usage, Memory consumption, Disk I/O bytes.
Application-level Metrics: Number of active database connections, Garbage collection pause times.
Business-level Metrics: Failed logins per minute, checkouts completed per hour.
Network-level Metrics: HTTP request rate (requests per second), 5xx error rate.

Popular Tools: Prometheus, Datadog (Metrics), StatsD.

Pillar 2: Logs

Logs are immutable, timestamped records of discrete events that happened over time.

They answer the question: "What exactly did the system say it was doing when the error occurred?"

While metrics show you a sudden spike in CPU, logs provide the context. You look at the logs to see exactly what triggered the CPU spike.

Characteristics of Logs

High granularity. A single log entry often corresponds to a specific user action or function execution.
Can be very expensive to store and index at massive scale (high volume).
Plain-text logs are hard to parse; modern systems require Structured Logging (JSON).

Structured vs Unstructured Logs

Unstructured (Hard for machines to search):

[2026-10-14 14:32:01] INFO - User jdoe123 failed to log in from IP 192.168.1.55. Reason: Invalid Password.

Structured (Easy to filter, sort, and search in a tool like Kibana):

{
  "timestamp": "2026-10-14T14:32:01Z",
  "level": "INFO",
  "event": "authentication_failure",
  "username": "jdoe123",
  "client_ip": "192.168.1.55",
  "reason": "invalid_password"
}

Popular Tools: Elasticsearch (ELK Stack), Loki, Splunk, Datadog (Logs).

Pillar 3: Traces (Distributed Tracing)

A trace represents the entire journey of a single request as it travels across a distributed system.

They answer the question: "Where did this request spend its time, and in which microservice did it fail?"

In a microservices architecture, a user clicking "Buy" might touch the Frontend, the Cart Service, the Inventory API, the Payment Gateway, and the Shipping Service. If the request takes 4 seconds, whose fault is it? Logs won't tell you, because the logs are scattered across 5 different servers.

How Tracing Works

When an HTTP request enters the system, the API Gateway generates a unique Trace ID and injects it into the HTTP Headers. As the request is passed from microservice to microservice, every service logs how much time it spent processing the request, tagging that data with the same Trace ID.

These individual chunks of work are called Spans.

A Trace UI reconstructs all overlapping spans into a graphical waterfall chart.

Characteristics of Traces

Crucial for identifying bottlenecks/latency in microservice architectures.
Displays request topology (how services talk to one another).
Often "sampled" (e.g., the system only records 1 out of every 100 requests) to save storage costs.

Popular Tools: Jaeger, Zipkin, Honeycomb, AWS X-Ray.

Bringing the Pillars Together

The magic of observability happens when you correlate these three pillars. Here is how a DevOps engineer realistically troubleshoots an outage:

Metrics (The Alert): You receive a Slack alert from Grafana because the metric http_errors_500_total suddenly spiked by 400%.
Traces (The Location): You look at the tracing dashboard (Jaeger) for recent 500 errors. The waterfall graph shows the request succeeded in the API Gateway, succeeded in the Payments_Service, but failed catastrophically taking 2 seconds in the Inventory_DB_Service. You found the bottleneck!
Logs (The Root Cause): You switch to your logging platform (Kibana), filter logs specifically for the Inventory_DB_Service during that exact minute, and see the error: FATAL: Connection refused: Max connections (100) reached.

By chaining Metrics → Traces → Logs, you found the exact root cause of a complex distributed failure in exactly 3 minutes.