Service Discovery in Distributed Systems - Networking Basics

In distributed systems, services are dynamic — they start up, scale, and fail. Service discovery is how services find each other without hardcoded IP addresses.

The Problem

Monolithic Application

Single server: 192.168.1.100
All code runs in one process
Clients connect to that one IP
Simple!

Microservices Architecture

Service A: 10.0.1.50 (might restart at 10.0.1.51 tomorrow)
Service B: 10.0.1.51 (might scale to 10.0.1.52, 10.0.1.53)
Service C: 10.0.1.52 (might disappear entirely)

Question: How does Service A find Service B?
Answer: Service Discovery!

Service Discovery Approaches

1. Client-Side Discovery Client finds services itself:

┌──────────────┐
│ Service A    │
│ "I need B"   │
└────────┬─────┘
         │ Query
         ↓
    ┌─────────┐
    │ Service │ "B is at 10.0.1.51:8080"
    │Registry │ "Also at 10.0.1.52:8080"
    └─────────┘
         ↑
         │ Service B registers
         │ Service B registers replica

Flow:

Service B registers with registry: "I'm at 10.0.1.51:8080"
Service B replica registers: "I'm also at 10.0.1.52:8080"
Service A queries registry: "Where is B?"
Registry replies: "Try 10.0.1.51 or 10.0.1.52"
Service A connects directly

Examples: Consul, etcd, Eureka

2. Server-Side Discovery Server/Load Balancer finds services:

┌──────────────┐                 ┌──────────────┐
│ Service A    │                 │ Load Balancer│
│ "I need B"   │────request───>  │ (queries     │
│              │<─response────    │ registry)    │
└──────────────┘                 └──────────────┘
                                      ↑
                                      │ Regular queries
                                 ┌────▼─────┐
                                 │ Service  │
                                 │ Registry │
                                 └──────────┘

Flow:

Service A sends request to load balancer
Load balancer queries registry for Service B
Load balancer picks healthy instance
Load balancer forwards request
Response returns to Service A

Examples: AWS ELB, Kubernetes Service, HAProxy with dynamic config

Service Registry

Central database of all services:

Service Registry Contents:
├── Service A
│   ├── Instance 1: 10.0.1.50:8000 (healthy)
│   ├── Instance 2: 10.0.1.51:8000 (healthy)
│   └── Instance 3: 10.0.1.52:8000 (unhealthy)
├── Service B
│   ├── Instance 1: 10.0.1.100:3000 (healthy)
│   └── Instance 2: 10.0.1.101:3000 (healthy)
└── Service C
    └── Instance 1: 10.0.1.150:9000 (healthy)

Registry Data:

Service name
Instance ID
Host
Port
Health status
Metadata (version, region, etc.)

Service Registration

Automatic Registration Service platform (Kubernetes) automatically registers services:

apiVersion: v1
kind: Service
metadata:
  name: backend
spec:
  selector:
    app: backend
  ports:
  - port: 8080

Kubernetes automatically:

Creates service DNS name (backend.default.svc.cluster.local)
Assigns virtual IP (for example, 10.96.0.1)
Watches for backend pods
Adds/removes pods from endpoints
Updates kube-proxy routing rules

Manual Registration Application registers itself on startup:

// Go example using Consul
client.Agent().ServiceRegister(&api.AgentServiceRegistration{
    ID:      "backend-1",
    Name:    "backend",
    Port:    8080,
    Address: "10.0.1.50",
    Check: &api.AgentServiceCheck{
        HTTP:     "http://10.0.1.50:8080/health",
        Interval: "10s",
    },
})

Application deregisters on shutdown (or uses TTL):

client.Agent().ServiceDeregister("backend-1")

Health Checks

Registry must know which services are healthy:

Health Check Types

1. HTTP Check

Service Registry
    ↓ (every 10 seconds)
HTTP GET http://10.0.1.50:8080/health
    ↓
Response: 200 OK? → Healthy
Response: 500? → Unhealthy
No response? → Unhealthy

2. TCP Check

Service Registry
    ↓
TCP connect to 10.0.1.50:8080
    ↓
Connection successful? → Assumed healthy
Connection refused? → Unhealthy

3. Script Check

Service Registry runs custom check script
Script returns 0? → Healthy
Script returns 1? → Unhealthy

4. TTL Check (Time To Live)

Service registers: "I'm healthy, TTL: 30 seconds"
Service must renew health every 30 seconds
Fail to renew? → Marked unhealthy
(Used when service explicitly heartbeats)

DNS-Based Service Discovery

Traditional DNS

Query: web.example.com
DNS Server returns: 203.0.113.50
Client connects to: 203.0.113.50
Problem: Single IP, no load balancing

Modern DNS Load Balancing

Query: backend.default.svc.cluster.local
DNS Server returns:
  - 10.244.1.50
  - 10.244.1.51
  - 10.244.1.52

Client picks one (round-robin in resolver)

Example: Kubernetes CoreDNS

# Query service
nslookup backend.default.svc.cluster.local
 
# Output:
# Name: backend.default.svc.cluster.local
# Address: 10.96.0.1 (virtual IP)
 
# Or get all endpoints
nslookup backend-endpoints.default.svc.cluster.local
# Returns all pod IPs

Kubernetes Service Discovery

ClusterIP Service (DNS-only)

apiVersion: v1
kind: Service
metadata:
  name: backend
spec:
  type: ClusterIP
  selector:
    app: backend
  ports:
  - port: 8080
    targetPort: 8080

DNS Name: backend.default.svc.cluster.local IP: 10.96.0.1 (virtual, load balanced by kube-proxy)

NodePort Service (for external access)

spec:
  type: NodePort
  ports:
  - port: 8080
    nodePort: 30080

Access from outside: node-ip:30080

LoadBalancer Service (cloud provider load balancer)

spec:
  type: LoadBalancer

Cloud provider assigns public IP, routes to pods

Service Mesh Discovery

Advanced service discovery in Istio/Linkerd:

┌─────────────┐        ┌──────────────┐
│ Service A   │        │ Envoy Proxy  │ ← Sidecar
│             │◄───────►│              │
└─────────────┘        └────────┬─────┘
                                │
                          Queries Service Mesh
                                │
                                ↓
                        ┌──────────────────┐
                        │ Service Registry │
                        │ (Workload        │
                        │  discovery,      │
                        │  health checks,  │
                        │  load balancing) │
                        └──────────────────┘

Service mesh handles:

Load balancing algorithms
Retries and timeouts
Circuit breaking
Advanced routing rules

Implementation Patterns

Pattern 1: Database-Backed Registry (Consul)

┌─────────────────┐
│ Service Instance│ ← Registers
└────────┬────────┘
         │
         ↓
    ┌────────────────┐
    │ Consul Server  │ ← HTTP API
    └────────┬───────┘
             │
         ┌───┴────┐
    Health Check  Watch for changes
         │        (notify clients)
         ↓
    Service stays healthy

Pattern 2: DNS-Based (traditional)

┌─────────────────┐
│ Service Instance│ ← Updates DNS recordsets
└────────┬────────┘
         │
         ↓
    ┌────────────────┐
    │ DNS Server     │
    └────────┬───────┘
             │
             ↓ Clients query for name resolution

Pattern 3: Platform-Managed (Kubernetes)

┌─────────────────────┐
│ Kubernetes Service  │ ← Watches pod endpoints
└────────┬────────────┘
         │
         ├─ DNS (CoreDNS)
         ├─ Iptables/IPVS rules
         └─ Envoy (in service mesh)

Handling Failures

Service Instance Fails

1. Health check fails
2. Registry marks as unhealthy
3. No new traffic sent
4. Existing connections may still use old instance
5. Client receives error, tries next instance

Result: Graceful degradation

Registry Itself Fails

1. Services can't register/deregister
2. Clients using cached entries still work
3. New service discovery breaks
4. Must have backup registry or failover

Result: System continues temporarily

Network Partition

Service in region A ↔ Partition ↔ Service in region B

Options:
- Clients notice failed connections, fail over
- Keep both instances running, accept inconsistency
- Use geo-aware routing

Best Practices

✓ Always use service names, never hardcoded IPs ✓ Implement health checks ✓ Cache registry lookups client-side (with TTL) ✓ Handle registry unavailability gracefully ✓ Test failover scenarios ✓ Monitor service discovery health ✓ Use DNS when possible (standard) ✓ Implement retry logic with exponential backoff ✓ Document service dependencies

Key Concepts

Service Registry = database of available services
Health Checks = verify service is working
Client-side discovery = client queries registry
Server-side discovery = load balancer queries registry
DNS = standard service discovery mechanism
Kubernetes Services = abstract pod endpoints
Load balancing = distribute traffic across instances
Service mesh = advanced discovery + routing
Automated registration better than manual
Always handle registry failures