Service Discovery in Distributed Systems
Part of the Networking Basics tutorial series.
In distributed systems, services are dynamic — they start up, scale, and fail. Service discovery is how services find each other without hardcoded IP addresses.
The Problem
Monolithic Application
Single server: 192.168.1.100
All code runs in one process
Clients connect to that one IP
Simple!
Microservices Architecture
Service A: 10.0.1.50 (might restart at 10.0.1.51 tomorrow)
Service B: 10.0.1.51 (might scale to 10.0.1.52, 10.0.1.53)
Service C: 10.0.1.52 (might disappear entirely)
Question: How does Service A find Service B?
Answer: Service Discovery!
Service Discovery Approaches
1. Client-Side Discovery Client finds services itself:
┌──────────────┐
│ Service A │
│ "I need B" │
└────────┬─────┘
│ Query
↓
┌─────────┐
│ Service │ "B is at 10.0.1.51:8080"
│Registry │ "Also at 10.0.1.52:8080"
└─────────┘
↑
│ Service B registers
│ Service B registers replica
Flow:
- Service B registers with registry: "I'm at 10.0.1.51:8080"
- Service B replica registers: "I'm also at 10.0.1.52:8080"
- Service A queries registry: "Where is B?"
- Registry replies: "Try 10.0.1.51 or 10.0.1.52"
- Service A connects directly
Examples: Consul, etcd, Eureka
2. Server-Side Discovery Server/Load Balancer finds services:
┌──────────────┐ ┌──────────────┐
│ Service A │ │ Load Balancer│
│ "I need B" │────request───> │ (queries │
│ │<─response──── │ registry) │
└──────────────┘ └──────────────┘
↑
│ Regular queries
┌────▼─────┐
│ Service │
│ Registry │
└──────────┘
Flow:
- Service A sends request to load balancer
- Load balancer queries registry for Service B
- Load balancer picks healthy instance
- Load balancer forwards request
- Response returns to Service A
Examples: AWS ELB, Kubernetes Service, HAProxy with dynamic config
Service Registry
Central database of all services:
Service Registry Contents:
├── Service A
│ ├── Instance 1: 10.0.1.50:8000 (healthy)
│ ├── Instance 2: 10.0.1.51:8000 (healthy)
│ └── Instance 3: 10.0.1.52:8000 (unhealthy)
├── Service B
│ ├── Instance 1: 10.0.1.100:3000 (healthy)
│ └── Instance 2: 10.0.1.101:3000 (healthy)
└── Service C
└── Instance 1: 10.0.1.150:9000 (healthy)
Registry Data:
- Service name
- Instance ID
- Host
- Port
- Health status
- Metadata (version, region, etc.)
Service Registration
Automatic Registration Service platform (Kubernetes) automatically registers services:
apiVersion: v1
kind: Service
metadata:
name: backend
spec:
selector:
app: backend
ports:
- port: 8080Kubernetes automatically:
- Creates service DNS name (backend.default.svc.cluster.local)
- Assigns virtual IP (for example, 10.96.0.1)
- Watches for backend pods
- Adds/removes pods from endpoints
- Updates kube-proxy routing rules
Manual Registration Application registers itself on startup:
// Go example using Consul
client.Agent().ServiceRegister(&api.AgentServiceRegistration{
ID: "backend-1",
Name: "backend",
Port: 8080,
Address: "10.0.1.50",
Check: &api.AgentServiceCheck{
HTTP: "http://10.0.1.50:8080/health",
Interval: "10s",
},
})Application deregisters on shutdown (or uses TTL):
client.Agent().ServiceDeregister("backend-1")Health Checks
Registry must know which services are healthy:
Health Check Types
1. HTTP Check
Service Registry
↓ (every 10 seconds)
HTTP GET http://10.0.1.50:8080/health
↓
Response: 200 OK? → Healthy
Response: 500? → Unhealthy
No response? → Unhealthy
2. TCP Check
Service Registry
↓
TCP connect to 10.0.1.50:8080
↓
Connection successful? → Assumed healthy
Connection refused? → Unhealthy
3. Script Check
Service Registry runs custom check script
Script returns 0? → Healthy
Script returns 1? → Unhealthy
4. TTL Check (Time To Live)
Service registers: "I'm healthy, TTL: 30 seconds"
Service must renew health every 30 seconds
Fail to renew? → Marked unhealthy
(Used when service explicitly heartbeats)
DNS-Based Service Discovery
Traditional DNS
Query: web.example.com
DNS Server returns: 203.0.113.50
Client connects to: 203.0.113.50
Problem: Single IP, no load balancing
Modern DNS Load Balancing
Query: backend.default.svc.cluster.local
DNS Server returns:
- 10.244.1.50
- 10.244.1.51
- 10.244.1.52
Client picks one (round-robin in resolver)
Example: Kubernetes CoreDNS
# Query service
nslookup backend.default.svc.cluster.local
# Output:
# Name: backend.default.svc.cluster.local
# Address: 10.96.0.1 (virtual IP)
# Or get all endpoints
nslookup backend-endpoints.default.svc.cluster.local
# Returns all pod IPsKubernetes Service Discovery
ClusterIP Service (DNS-only)
apiVersion: v1
kind: Service
metadata:
name: backend
spec:
type: ClusterIP
selector:
app: backend
ports:
- port: 8080
targetPort: 8080DNS Name: backend.default.svc.cluster.local
IP: 10.96.0.1 (virtual, load balanced by kube-proxy)
NodePort Service (for external access)
spec:
type: NodePort
ports:
- port: 8080
nodePort: 30080Access from outside: node-ip:30080
LoadBalancer Service (cloud provider load balancer)
spec:
type: LoadBalancerCloud provider assigns public IP, routes to pods
Service Mesh Discovery
Advanced service discovery in Istio/Linkerd:
┌─────────────┐ ┌──────────────┐
│ Service A │ │ Envoy Proxy │ ← Sidecar
│ │◄───────►│ │
└─────────────┘ └────────┬─────┘
│
Queries Service Mesh
│
↓
┌──────────────────┐
│ Service Registry │
│ (Workload │
│ discovery, │
│ health checks, │
│ load balancing) │
└──────────────────┘
Service mesh handles:
- Load balancing algorithms
- Retries and timeouts
- Circuit breaking
- Advanced routing rules
Implementation Patterns
Pattern 1: Database-Backed Registry (Consul)
┌─────────────────┐
│ Service Instance│ ← Registers
└────────┬────────┘
│
↓
┌────────────────┐
│ Consul Server │ ← HTTP API
└────────┬───────┘
│
┌───┴────┐
Health Check Watch for changes
│ (notify clients)
↓
Service stays healthy
Pattern 2: DNS-Based (traditional)
┌─────────────────┐
│ Service Instance│ ← Updates DNS recordsets
└────────┬────────┘
│
↓
┌────────────────┐
│ DNS Server │
└────────┬───────┘
│
↓ Clients query for name resolution
Pattern 3: Platform-Managed (Kubernetes)
┌─────────────────────┐
│ Kubernetes Service │ ← Watches pod endpoints
└────────┬────────────┘
│
├─ DNS (CoreDNS)
├─ Iptables/IPVS rules
└─ Envoy (in service mesh)
Handling Failures
Service Instance Fails
1. Health check fails
2. Registry marks as unhealthy
3. No new traffic sent
4. Existing connections may still use old instance
5. Client receives error, tries next instance
Result: Graceful degradation
Registry Itself Fails
1. Services can't register/deregister
2. Clients using cached entries still work
3. New service discovery breaks
4. Must have backup registry or failover
Result: System continues temporarily
Network Partition
Service in region A ↔ Partition ↔ Service in region B
Options:
- Clients notice failed connections, fail over
- Keep both instances running, accept inconsistency
- Use geo-aware routing
Best Practices
✓ Always use service names, never hardcoded IPs ✓ Implement health checks ✓ Cache registry lookups client-side (with TTL) ✓ Handle registry unavailability gracefully ✓ Test failover scenarios ✓ Monitor service discovery health ✓ Use DNS when possible (standard) ✓ Implement retry logic with exponential backoff ✓ Document service dependencies
Key Concepts
- Service Registry = database of available services
- Health Checks = verify service is working
- Client-side discovery = client queries registry
- Server-side discovery = load balancer queries registry
- DNS = standard service discovery mechanism
- Kubernetes Services = abstract pod endpoints
- Load balancing = distribute traffic across instances
- Service mesh = advanced discovery + routing
- Automated registration better than manual
- Always handle registry failures