Kubernetes Troubleshooting - Kubernetes

Kubernetes is complex, and things will eventually break. Here is the standard workflow for diagnosing and fixing cluster issues.

1. The Pod is failing (CrashLoopBackOff / Error)

This is the most common issue.

Step 1: Check Pod Status

Action:

kubectl get pods

Result:

NAME        READY   STATUS             RESTARTS   AGE
web-app-1   0/1     CrashLoopBackOff   4          2m

Step 2: Check Events (The "Why")

Action:

kubectl describe pod web-app-1

Result: Look for the Events section at the bottom:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Warning  Failed     30s (x3 over 1m)   kubelet            Error: ImagePullBackOff

(This tells you Kubernetes can't find your image or doesn't have permission to pull it)

Step 3: Check Logs (Application Errors)

Action:

kubectl logs web-app-1

Result:

[FATAL] Database connection failed: 'db-service' not found.

(This tells you the issue is inside your code or configuration)

2. Service is unreachable

If your app is running but you can't talk to it.

Step 1: Verify Endpoints

Action:

kubectl get endpoints my-web-service

Result:

NAME             ENDPOINTS            AGE
my-web-service   <none>               5m

Interpretation: <none> means your Service Selector doesn't match your Pod Labels. Fix your YAML!

3. Node or Cluster Issues

Check Node Health

Action:

kubectl get nodes

Result:

NAME       STATUS     ROLES           AGE   VERSION
node-1     Ready      control-plane   10d   v1.26.1
node-2     NotReady   <none>          10d   v1.26.1

Interpretation: NotReady often means the node is out of disk space, memory, or the Kubelet has crashed.

Summary: Debugging Checklist

kubectl get pods: Is it running?
kubectl describe: What do the Events say?
kubectl logs: What does the app code say?
kubectl get events -A: Check global cluster errors.
kubectl exec: Can you ping other services from inside?