Prometheus

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit initially built at SoundCloud. It is now a flagship project of the Cloud Native Computing Foundation (CNCF).

Prometheus is essentially a massive, highly optimized database specifically designed for storing one thing: Time-Series Metrics.

It is the absolute standard for cloud-native metrics and integrates natively with Kubernetes.

The Prometheus Architecture: A "Pull" Model

Most legacy monitoring systems (like StatsD or Nagios) use a Push model: your applications run background agents that proactively send (push) metrics to the central monitoring server.

Prometheus flips this. It utilizes a Pull model (called scraping).

Your application exposes a simple HTTP endpoint (usually http://yourapp.com/metrics) that prints out raw text containing metric values.
The Prometheus server reaches out to your application every 15 seconds (via an HTTP GET request) and "scrapes" that text.
Prometheus stores that snapshot in its time-series database.

Why Pull?

Simplicity: The application doesn't need to know where Prometheus is, or handle retries if Prometheus goes down. It just passively hosts a text page.
Failure detection: If Prometheus tries to scrape your app and the HTTP request fails, Prometheus instantly knows your app is down.
Performance: You can run multiple Prometheus servers scraping the same endpoints for high-availability.

The `/metrics` Endpoint

If you visit a /metrics page exposed by a Prometheus client library, the raw, human-readable text looks like this:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027
http_requests_total{method="post",code="400"} 3
 
# HELP node_memory_Active_bytes Memory information field Active_bytes.
# TYPE node_memory_Active_bytes gauge
node_memory_Active_bytes 4.238917632e+09

Counters vs Gauges

Prometheus categorizes metrics into several primary types. The two most common are:

Counters: A value that only ever goes UP (or resets to zero if the server restarts).
- Examples: total_http_requests, errors_encountered, bytes_sent.
Gauges: A value that can arbitrarily go up AND down.
- Examples: cpu_usage_percentage, active_memory_bytes, current_queue_size.

Exporters

What if you want to monitor a software system you didn't write, like a Linux server, a MySQL database, or an Nginx web server? You can't alter their source code to add a /metrics endpoint.

To solve this, the community built Exporters. An exporter is a tiny sidecar application that pulls metrics out of an existing system (using native APIs or reading files) and translates them into a Prometheus /metrics endpoint.

Node Exporter: Runs on Linux servers. Reads /proc and exposes CPU, memory, and disk usage as Prometheus metrics.
MySQL Exporter: Connects to MySQL, runs SHOW STATUS;, and exposes the data as Prometheus metrics.
Blackbox Exporter: Pings external URLs to ensure they return HTTP 200s, exporting the latency.

PromQL (Prometheus Query Language)

Storing data is useless if you can't query it. Prometheus provides an incredibly powerful, functional querying language called PromQL.

PromQL is not SQL. It is designed to evaluate multi-dimensional time series math.

Basic Selection

Select the raw value of a metric right now:

http_requests_total

Label Filtering

Filter a metric exactly by its labels (key-value tags attached to the metric):

http_requests_total{method="GET", code="500"}

(This returns the total number of HTTP 500 errors caused by GET requests)

The `rate()` Function (The Most Important Function)

Because http_requests_total is a Counter (it only goes up), querying its raw number (e.g., 5,432,192) isn't helpful for a dashboard. You want to know "How fast are requests coming in right now?"

The rate() function calculates the per-second average increase of a time series over a given time window:

rate(http_requests_total{code="200"}[5m])

(This looks at the counter's growth over the last 5 minutes, and tells you the current Requests Per Second (RPS).

Math operations

Calculate error percentage (Errors divided by Total Requests, multiplied by 100):

(
  sum(rate(http_requests_total{code="500"}[5m]))
  / 
  sum(rate(http_requests_total[5m]))
) * 100

Aggregation

If you are running 5 web servers, http_requests_total will return 5 separate time series lines. You often want to sum them together into a single global number:

sum(rate(http_requests_total[5m])) by (method)

(This merges all servers together, but breaks the total down by the HTTP method—showing one line for GETs, one for POSTs).

How Prometheus Fits in the Stack

Prometheus excels at scraping, storing, and evaluating metrics. However, its built-in graphing interface is extremely rudimentary, intended mostly for rapid debugging.

To visualize PromQL queries beautifully, Prometheus is almost universally paired with Grafana, which we will cover next.