G
GuideDevOps
Lesson 27 of 28

Network Troubleshooting Tools

Part of the Networking Basics tutorial series.

Network problems impact everything. You need systematic debugging techniques to quickly identify root causes and resolve issues. This is where being methodical saves hours of frustration.

Troubleshooting Methodology

The OSI Model Approach:

Start at bottom (Physical), work up to top (Application):

Layer 7: Application (is the service responding?)
         ↑ Check: telnet, curl, netcat
Layer 6: Presentation (correct format?)
Layer 5: Session (connection established?)
Layer 4: Transport (TCP/UDP working?)
         ↑ Check: netstat, ss, lsof
Layer 3: Network (IP routing working?)
         ↑ Check: ping, traceroute, ip route
Layer 2: Data Link (ARP, MAC addresses?)
         ↑ Check: arp, ip link
Layer 1: Physical (cables, up/down?)
         ↑ Check: ethtool, link status

Systematic Troubleshooting Framework

1. Define the Problem

  • ✓ Exactly what doesn't work:
    • HTTP requests timing out
    • DNS not resolving
    • High packet loss
    • Slow response times
  • ✗ Don't: "Network is broken"

2. Gather Information

  • ✓ When did it start?
  • ✓ What changed?
  • ✓ Who is affected (one user, everyone, one service)?
  • ✓ What does working look like?

3. Form Hypothesis

Given the problem + info:

  • My guess is: DNS server is restarting
  • or: Load balancer misconfigured
  • or: Firewall rule changed

4. Test Hypothesis

  • Is DNS working? → dig google.com
  • Is LB routing? → Check backend pool status
  • Is firewall up? → Check rules

5. Resolve

If hypothesis confirmed:
- Fix root cause
- Document what changed
- Implement monitoring

Essential Network Tools

ping — Test connectivity:

# Simple ping
ping google.com
 
# Count packets
ping -c 5 google.com
 
# Show time to first response
ping -W 1 -c 1 google.com

traceroute — Show route to destination:

# See all hops
traceroute google.com
 
# With hostnames (-D) and timeout (-w)
traceroute -D -w 1 google.com
 
# ICMP traceroute
traceroute -I google.com
 
# UDP traceroute
traceroute -U google.com

dig — DNS lookup:

# Simple lookup
dig google.com
 
# Query specific nameserver
dig @8.8.8.8 google.com
 
# Get all records
dig google.com ANY
 
# Short format (+short)
dig +short google.com
 
# Verbose (+trace shows delegation path)
dig +trace google.com

netstat — Connection statistics:

# All connections
netstat -a
 
# Listening ports
netstat -tln | head -20
# t=tcp, l=listening, n=numeric IPs
 
# Per-protocol statistics
netstat -s
 
# Process owning connection
netstat -tlnp | grep :8080

ss — Socket statistics (modern netstat):

# All listening sockets
ss -tln
 
# Established connections
ss -tan
 
# With process info
ss -tlnp
 
# Summary
ss -s

curl/wget — HTTP requests:

# Simple GET
curl http://example.com
 
# Show headers only
curl -I http://example.com
 
# Follow redirects
curl -L http://example.com
 
# Timeout
curl --connect-timeout 5 --max-time 10 http://example.com
 
# Verbose (show handshake, headers)
curl -v http://example.com
 
# Test specific hostname
curl -H "Host: example.com" http://203.0.113.1

nc (netcat) — Raw TCP/UDP:

# Test if port is open
nc -zv example.com 80
 
# Listen on port
nc -l 8080
 
# Send data
echo "hello" | nc example.com 9000
 
# UDP test
nc -u example.com 53

tcpdump — Packet capture:

# Capture all traffic
sudo tcpdump -i eth0
 
# Capture on port
sudo tcpdump -i eth0 port 80
 
# Capture to file
sudo tcpdump -i eth0 -w capture.pcap
 
# Read file
tcpdump -r capture.pcap
 
# Show MAC addresses
sudo tcpdump -i eth0 -e
 
# ASCII and hex
sudo tcpdump -i eth0 -X

Network Troubleshooting Scenarios

Scenario 1: "I can't connect to service"

Step 1: Is the service running?
ss -tlnp | grep 8080
 No service start
 Yes Continue
 
Step 2: Can I reach it on localhost?
curl localhost:8080
 No Service crashed, check logs
 Yes Continue
 
Step 3: Can I reach it from another machine?
curl 192.168.1.100:8080
 No Firewall? DNS? Routing?
Netstat Check if listening on all IPs (0.0.0.0)
UFW Check if port 8080 allowed
 Yes DNS or firewall issue
 
Step 4: Check DNS
dig service.example.com Returns IP?
 No DNS misconfigured
 Yes Correct IP?

Scenario 2: "High latency to database"

Step 1: Confirm latency
ping db.example.com
 Shows high response time? Yes
 
Step 2: Check route
traceroute db.example.com
 Which hop is slow?
 
Step 3: Check network stats
ss -s
 Packet loss? Retransmissions?
 
Step 4: Check interface
ethtool eth0
 Speed/duplex mismatched?
 
Step 5: Check application
time curl db.example.com:5432
 Network slow or app slow?

Scenario 3: "DNS not resolving"

Step 1: Check configured DNS
cat /etc/resolv.conf
 Shows nameservers?
 
Step 2: Test with public DNS
dig @8.8.8.8 google.com
 Fails Network problem
 Works Local DNS server problem
 
Step 3: Query local DNS directly
dig @192.168.1.1 example.com
 Fails DNS server misconfigured
 Works Resolver config wrong
 
Step 4: Check local DNS logs
sudo tail -f /var/log/named/default.log
 See query errors?

Network Performance Testing

Check Response Time:

# Show DNS + TCP + TLS + request time
curl -w "  DNS: %{time_namelookup}\n" \
      -w "  TCP: %{time_connect}\n" \
      -w "  TLS: %{time_appconnect}\n" \
      -w "  First Response: %{time_starttransfer}\n" \
      -w "  Total: %{time_total}\n" \
      https://example.com

Bandwidth Test:

# Download speed test
curl -o /dev/null -w "%{speed_download}\n" http://example.com/large-file
 
# Using iperf3
server: iperf3 -s
client: iperf3 -c server-ip -t 10

Packet Loss Detection:

# Ping with loss % shown
ping -c 100 example.com | grep '% packet loss'
 
# Continuous ping
ping -i 0.5 example.com  # Send every 0.5 seconds

Firewall Troubleshooting

"Connection refused"

# Check if port is listening
ss -tln | grep :8080
 Not there Service not running
 
# Check firewall rules
ufw status | grep 8080
 Not allowed ufw allow 8080
 
# Check iptables
sudo iptables -L | grep 8080
 Blocked Add allow rule
 
# Check if traffic reaches server
sudo tcpdump -i eth0 port 8080
 Packets not arriving Blocked upstream

"Connection times out"

# Very likely firewall dropping packets
# (not saying "connection refused", just hanging)
 
# Increase timeout to confirm
curl --connect-timeout 60 http://server:port
 
# Check if ICMP is blocked
ping -c 1 server
 No response but can SSH? ICMP blocked
 
# Manually try connection
tcpclient server port  # Wait to see what happens

Routing Issues

"Can't reach subnet"

# Check routing table
ip route show
 Is destination subnet listed?
 
# Add route
sudo ip route add 10.0.0.0/8 via 192.168.1.1
 
# Make permanent (netplan)
echo 'routes:\n  - to: 10.0.0.0/8\n    via: 192.168.1.1' >> /etc/netplan/01-netcfg.yaml
sudo netplan apply

"Asymmetric routing"

Outbound: A → Router1 → B (fast)
Inbound: B → Router2 → A (slow)

Diagnose with:
traceroute from A to B (shows outbound path)
traceroute from B to A (shows inbound path)

Connection Pool Problems

"Too many open connections"

# Check connected sockets
ss -s | grep TCP
 
# Find which process
ss -tlnp | wc -l
 
# Increase file descriptor limit
ulimit -n 10000
 
# Make permanent
echo "* soft nofile 10000" >> /etc/security/limits.conf

DNS Issues

"Wrong IP returned"

# Check cached locally
ss -s | grep DNS
 
# Clear cache (systemd)
sudo systemctl restart systemd-resolved
 
# Query authoritative nameserver directly
dig @ns1.example.com example.com
 Correct answer? If yes:
 Your local resolver cached old value
 TTL probably hasn't expired yet
 
# Check A record details
dig +trace example.com
↓ Follow delegation to see at which point IP changed

Monitoring for Issues

Watch TCP connections:

# Real-time connection graph
watch -n 1 'ss -s | grep TCP'

Monitor packet loss:

# Continuous monitoring
mtr google.com
 
# Shows loss per hop! MUCH better than traceroute

Check interface errors:

ethtool eth0
ip -s link show eth0
 
 Look for:
RX errors
RX dropped
TX errors
TX dropped

Key Concepts

  • Use OSI model — Troubleshoot from bottom up
  • Define problem clearly — Not "network is slow"
  • Ping tests reachability — IP routing working?
  • Traceroute shows path — Where does it fail?
  • tcpdump shows actual packets — Low-level visibility
  • DNS resolution is often the issue — Check first
  • Firewall is second most common — Check rules
  • Always check: service running, port listening, firewall allowing
  • Use curl/nc for application layer — Is app responding?
  • Document everything — Changes, errors, fixes