G
GuideDevOps
Lesson 26 of 28

Network Telemetry & Flow Analysis

Part of the Networking Basics tutorial series.

Network telemetry is the collection and analysis of network data. Understanding what's happening on your network is essential for troubleshooting, optimization, and security.

Types of Network Data

1. Flow Data "What traffic is flowing where?"

Source IP: 10.0.1.50
Dest IP: 10.1.0.20
Source Port: 54321
Dest Port: 443
Protocol: TCP
Bytes: 50,000
Packets: 250
Start Time: 14:32:15
Duration: 30 seconds

2. Packet Data "What's in each packet?"

Ethernet Frame:
├─ Source MAC: aa:bb:cc:dd:ee:01
├─ Dest MAC: aa:bb:cc:dd:ee:02
├─ Protocol: IPv4
├─ IP Header...
├─ TCP Header...
└─ Payload: ...

3. Counters "How many total packets/bytes?"

Interface stats:
├─ Total packets in: 1,000,000
├─ Total packets out: 950,000
├─ Dropped packets: 50,000
├─ Errors: 100
├─ Collisions: 0
└─ Timestamp: 14:32:15

4. Logs "What happened?"

14:32:15 - BGP session down: 203.0.113.1
14:32:20 - New route learned: 10.0.0.0/8 via 203.0.113.2
14:32:25 - DDoS detected: 1M packets/sec from 203.0.113.202

Flow-Based Telemetry

NetFlow (Cisco) Industry standard for flow data:

NetFlow v5 record:
├─ Source IP
├─ Destination IP
├─ Source Port
├─ Destination Port
├─ IP Protocol (TCP/UDP/ICMP)
├─ Input interface
├─ Output interface
├─ Layer 2 data
├─ Packets
├─ Bytes
├─ Start/End timestamps
└─ TCP flags

Every 'flow' exported as one record
Collector gathers records for analysis

sFlow (sample-based) Statistical sampling:

Sample 1 in 10,000 packets
Extrapolate statistics:
- 10,000 bytes measured
- Actual traffic likely ~100MB
- Lower overhead than NetFlow
- Less accuracy

IPFIX (IP Flow Information Export) Modern standard (NetFlow v9+):

Extensible:
- Can add custom fields
- Flexible templates
- Internet standard (RFC 7011)

Flow Collection Architecture

┌───┐ NetFlow
│ A ├─────────────┐
└───┘             │
                  ↓
┌───┐        ┌─────────────┐      ┌──────────────┐
│ B ├────────►Collector    │     │ Analysis &   │
└───┘        │(Port 2055)  │────►│ Dashboard    │
             └─────────────┘     │(Grafana, ELK)│
┌───┐         NetFlow v9        └──────────────┘
│ C ├─────────────┐
└───┘             │
                  ↓
              Collector

Flow Collectors:

  • Cisco Prime, Cisco Tetration
  • Kentik Detect
  • SolarWinds NetFlow Traffic Analyzer
  • Open Source: ntopng, flow-tools

Packet Analysis (tcpdump, Wireshark)

tcpdump — Command-line packet capture:

# Capture all traffic on eth0
tcpdump -i eth0
 
# Capture and save to file
tcpdump -i eth0 -w traffic.pcap
 
# Filter: only TCP traffic
tcpdump -i eth0 tcp
 
# Filter: only DNS traffic (UDP port 53)
tcpdump -i eth0 udp port 53
 
# Capture with complete packet data
tcpdump -i eth0 -C 10 -w traffic  # 10MB files

Wireshark — GUI packet analyzer:

Visualize pcap files:
├─ Protocol layers
├─ Packet-by-packet breakdown
├─ Flow view
└─ Statistics

Useful for:
- Debugging application issues
- Understanding protocol behavior
- Troubleshooting packet loss
- Analyzing network attacks

Metrics and KPIs

Key Metrics:

MetricMeaningExample
ThroughputBytes/sec100 Mbps
Packet Loss% of packets lost0.1%
LatencyDelay time50ms
JitterVariance in latency±5ms
AvailabilityUptime %99.9%
Utilization% of capacity used65%

Performance KPIs:

Response Time:
Goal: <100ms
Measurement: HTTP server response
Trend: Increasing → investigate slowness

Packet Loss:
Goal: <0.1%
Measurement: SNMP counters
Trend: Spikes → check congestion

Connection Success:
Goal: 99.99%
Measurement: TCP SYN → SYN-ACK success rate
Trend: Dropping → check availability

SNMP (Simple Network Management Protocol)

Purpose: Collect network device statistics

SNMP Versions:

VersionSecurityUse
v1None (plain text)Legacy, don't use
v2cCommunity stringSimple monitoring
v3Full authenticationProduction recommended

Common SNMP OIDs (Object Identifiers):

1.3.6.1.2.1.1.3 — System uptime
1.3.6.1.2.1.2.2.1.1 — Interface name
1.3.6.1.2.1.2.2.1.5 — Interface speed
1.3.6.1.2.1.2.2.1.10 — Octets in
1.3.6.1.2.1.2.2.1.16 — Octets out
1.3.6.1.2.1.2.2.1.20 — Dropped packets

SNMP Walk (Collect Data):

# Get all SNMP data
snmpwalk -v 2c -c public 192.168.1.1
 
# Get specific value
snmpget -v 2c -c public 192.168.1.1 \
  1.3.6.1.2.1.1.3.0
# Returns: System uptime

Application Performance Monitoring (APM)

Full-Stack Telemetry:

User Experience
     ↑
┌────────────────┐
│ Frontend       │ (JavaScript errors, page load time)
├────────────────┤
│ Network        │ (DNS time, TCP connection time)
├────────────────┤
│ Application    │ (HTTP response time, database queries)
├────────────────┤
│ Infrastructure │ (CPU, memory, disk, network)
└────────────────┘

Tools: Datadog, New Relic, Dynatrace, Elastic

Monitoring Tools Comparison

ToolTypeFor
PrometheusMetricsInfrastructure, applications
GrafanaVisualizationDashboard, alerting
ELK StackLogs/MetricsCentralized logging
DatadogAPMFull-stack monitoring
WiresharkPacketDetailed analysis
ntopngFlowNetwork behavior
tcpdumpPacketQuick capture

Real-Time Network Monitoring

Types of Monitoring:

Push-Based (Agent-based):

┌────────────────┐
│ Agent on host  │ (constantly sends data)
│ Sends metrics  │
└────────┬───────┘
         │ Push
         ↓
    ┌──────────┐
    │Collector │
    └──────────┘

Pros: Real-time, detailed
Cons: Overhead per host, scale challenge

Pull-Based (Scrape-based):

┌──────────────┐
│ Monitoring   │ Periodically requests metrics
│ System       │ (asks for data)
└──────┬───────┘
       │ Pull
       ↓
    ┌──────────────┐
    │ Host metrics │
    │ endpoint     │
    └──────────────┘

Pros: Easier scale, host controls exposure
Cons: Potentially missing spikes between scrapes

Network Telemetry Use Cases

Use Case 1: Anomaly Detection

Normal traffic: 100 Mbps
Baseline: 99 Mbps ±5%

Anomaly: Spike to 500 Mbps
Alert: "DDoS or traffic spike detected"
→ Investigate immediately

Use Case 2: Capacity Planning

Current: 65% utilized, trending up 2% per week
Projection: Will hit 80% in 3 weeks
Action: Provision more capacity

Use Case 3: Troubleshooting

"Users report slow service"
Telemetry shows:
- High latency to database (300ms vs 50ms normally)
- Increased packet loss on database link
- Root cause: Database server overwhelmed

Fix: Scale database or investigate queries

Use Case 4: Security

NetFlow shows:
- Sudden outbound traffic to unknown IP
- Large data transfer (1GB/min)
- Destination: 203.0.113.202 (malicious IP)

Response: Block outgoing traffic to that IP
Quarantine affected server

Best Practices for Network Telemetry

✓ Collect baseline metrics before problems ✓ Set realistic alerting thresholds ✓ Monitor end-user experience, not just infrastructure ✓ Track trends over time (capacity planning) ✓ Archive data for post-incident analysis ✓ Use correlation (if latency high AND CPU high → bottleneck) ✓ Combine flow data with logs for context ✓ Securely store sensitive network data ✓ Test alerting (make sure notifications work) ✓ Regularly review and adjust metrics

Key Concepts

  • NetFlow/IPFIX = Standard flow export
  • Flow data = Source, destination, bytes, packets, duration
  • Packet capture = Detailed but resource-intensive
  • SNMP = Device statistics collection
  • Metrics = Quantitative measurements
  • APM = End-user experience monitoring
  • Baseline = Normal behavior to detect anomalies
  • Correlation = Relate multiple signals for insight
  • Telemetry enables visibility into network behavior
  • Visibility is prerequisite for optimization and security