ELK Stack - Monitoring & Observability

The Distributed Logging Problem

Imagine your application runs on 10 separate servers. If a user encounters an error, which server processed their request?

If you have to SSH into server-1, run grep "error" /var/log/app.log, find nothing, exit, SSH into server-2, and repeat... you will go insane. This is called Distributed Logging, and it is a massive bottleneck for DevOps teams.

The solution is Centralized Logging. You must ship all logs from all 10 servers to a single, massive, searchable database.

For a decade, the undisputed king of centralized logging has been the ELK Stack.

What is ELK?

The ELK stack is a collection of three open-source products maintained by Elastic. They work together in a pipeline to parse, store, and visualize data.

1. (E) Elasticsearch: The Brain

Elasticsearch is a NoSQL, distributed search and analytics engine. When you throw a JSON log document at it, Elasticsearch indexes every single word and field, making it possible to search petabytes of log data in milliseconds. It is the storage backend.

2. (L) Logstash: The Pipeline

A raw Linux syslog looks very different from an Apache web access log, which looks very different from a custom Python application error log. Logstash receives raw logs from your servers, structures them (turning raw text into neat JSON fields like client_ip or status_code), and then ships them into Elasticsearch.

3. (K) Kibana: The Interface

Kibana is the web UI that connects to Elasticsearch. It allows you to search your logs flexibly, create pie charts of the most common error codes, or build dashboards showing traffic over time.

Enter Beats (The "L" becomes complicated)

Originally, you had to install Logstash on every single application server. The problem? Logstash is built on Java and was notoriously heavy, sometimes consuming more CPU than the actual web application it was monitoring!

Elastic recognized this and introduced Beats.

Beats are lightweight, single-purpose agents written in Go. You install Beats on your application servers. They use almost zero memory or CPU. They do no processing; they simply read the log files and ship them directly to Elasticsearch (or to an intermediate Logstash server for heavy parsing).

Filebeat: Reads text-based log files natively.
Metricbeat: Collects system metrics (CPU, RAM).
Packetbeat: Sniffs network traffic.

Consequently, the stack is sometimes referred to as the Elastic Stack rather than ELK.

graph LR
    A[App Server 1\n(Filebeat)] ---> B(Logstash\nFiltering)
    C[App Server 2\n(Filebeat)] ---> B
    B ---> D[(Elasticsearch)]
    E[Kibana UI] ---> D

Log Parsing: Turning Text into Data

Centralized logging is fundamentally useless if you simply dump raw, unstructured text strings into Elasticsearch.

Consider this standard Apache access log:

192.168.1.55 - - [14/Oct/2026:14:32:01 -0700] "GET /api/users HTTP/1.1" 404 1234

If you search Kibana for "404", it will pull up this log. But what if you want to generate a pie chart showing which IP addresses generate the most 404 errors? You can't, because Elasticsearch doesn't know that 192.168.1.55 is an IP address; it just sees a giant string of text.

Grok Filters

Logstash uses a technology called Grok to parse text. Grok maps Regex patterns to assign labels to text strings.

A basic Grok filter for the Apache log looks like this:

%{IPORHOST:client_ip} - - \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:response_code} %{NUMBER:bytes}

Once processed by this filter, the raw text is transformed into a beautifully structured JSON document inside Elasticsearch:

{
  "client_ip": "192.168.1.55",
  "timestamp": "14/Oct/2026:14:32:01 -0700",
  "method": "GET",
  "request": "/api/users",
  "response_code": 404,
  "bytes": 1234
}

Now, Kibana can easily aggregate the response_code field and build your pie chart!

Kibana Query Language (KQL)

Once your data is neatly structured in Elasticsearch, Kibana makes searching for incidents incredibly fast.

Instead of writing complex Regex via Linux grep, you use KQL in the Kibana search bar:

Find all 500 errors:

response_code >= 500

Find all POST requests to the login endpoint that took longer than 2 seconds:

method: POST and request: "/api/login" and duration_ms > 2000

Find all errors generated specifically by the user "jdoe" on the staging environment:

environment: "staging" and user_id: "jdoe" and level: "ERROR"

The Weakness of ELK

The ELK stack is incredibly powerful, but it has one massive downside: Cost and Complexity.

Elasticsearch is famously resource-hungry. Indexing every single word of 5 terabytes of logs generated daily by a large microservice architecture requires a massive cluster of high-CPU, high-RAM servers spanning multiple Availability Zones.

This massive infrastructure footprint birthed an alternative, much cheaper logging solution built strictly for Cloud-Native environments: Grafana Loki. We will explore Loki next.