Network Telemetry & Flow Analysis - Networking Basics

Network telemetry is the collection and analysis of network data. Understanding what's happening on your network is essential for troubleshooting, optimization, and security.

Types of Network Data

1. Flow Data "What traffic is flowing where?"

Source IP: 10.0.1.50
Dest IP: 10.1.0.20
Source Port: 54321
Dest Port: 443
Protocol: TCP
Bytes: 50,000
Packets: 250
Start Time: 14:32:15
Duration: 30 seconds

2. Packet Data "What's in each packet?"

Ethernet Frame:
├─ Source MAC: aa:bb:cc:dd:ee:01
├─ Dest MAC: aa:bb:cc:dd:ee:02
├─ Protocol: IPv4
├─ IP Header...
├─ TCP Header...
└─ Payload: ...

3. Counters "How many total packets/bytes?"

Interface stats:
├─ Total packets in: 1,000,000
├─ Total packets out: 950,000
├─ Dropped packets: 50,000
├─ Errors: 100
├─ Collisions: 0
└─ Timestamp: 14:32:15

4. Logs "What happened?"

14:32:15 - BGP session down: 203.0.113.1
14:32:20 - New route learned: 10.0.0.0/8 via 203.0.113.2
14:32:25 - DDoS detected: 1M packets/sec from 203.0.113.202

Flow-Based Telemetry

NetFlow (Cisco) Industry standard for flow data:

NetFlow v5 record:
├─ Source IP
├─ Destination IP
├─ Source Port
├─ Destination Port
├─ IP Protocol (TCP/UDP/ICMP)
├─ Input interface
├─ Output interface
├─ Layer 2 data
├─ Packets
├─ Bytes
├─ Start/End timestamps
└─ TCP flags

Every 'flow' exported as one record
Collector gathers records for analysis

sFlow (sample-based) Statistical sampling:

Sample 1 in 10,000 packets
Extrapolate statistics:
- 10,000 bytes measured
- Actual traffic likely ~100MB
- Lower overhead than NetFlow
- Less accuracy

IPFIX (IP Flow Information Export) Modern standard (NetFlow v9+):

Extensible:
- Can add custom fields
- Flexible templates
- Internet standard (RFC 7011)

Flow Collection Architecture

┌───┐ NetFlow
│ A ├─────────────┐
└───┘             │
                  ↓
┌───┐        ┌─────────────┐      ┌──────────────┐
│ B ├────────►Collector    │     │ Analysis &   │
└───┘        │(Port 2055)  │────►│ Dashboard    │
             └─────────────┘     │(Grafana, ELK)│
┌───┐         NetFlow v9        └──────────────┘
│ C ├─────────────┐
└───┘             │
                  ↓
              Collector

Flow Collectors:

Cisco Prime, Cisco Tetration
Kentik Detect
SolarWinds NetFlow Traffic Analyzer
Open Source: ntopng, flow-tools

Packet Analysis (tcpdump, Wireshark)

tcpdump — Command-line packet capture:

# Capture all traffic on eth0
tcpdump -i eth0
 
# Capture and save to file
tcpdump -i eth0 -w traffic.pcap
 
# Filter: only TCP traffic
tcpdump -i eth0 tcp
 
# Filter: only DNS traffic (UDP port 53)
tcpdump -i eth0 udp port 53
 
# Capture with complete packet data
tcpdump -i eth0 -C 10 -w traffic  # 10MB files

Wireshark — GUI packet analyzer:

Visualize pcap files:
├─ Protocol layers
├─ Packet-by-packet breakdown
├─ Flow view
└─ Statistics

Useful for:
- Debugging application issues
- Understanding protocol behavior
- Troubleshooting packet loss
- Analyzing network attacks

Metrics and KPIs

Key Metrics:

Metric	Meaning	Example
Throughput	Bytes/sec	100 Mbps
Packet Loss	% of packets lost	0.1%
Latency	Delay time	50ms
Jitter	Variance in latency	±5ms
Availability	Uptime %	99.9%
Utilization	% of capacity used	65%

Performance KPIs:

Response Time:
Goal: <100ms
Measurement: HTTP server response
Trend: Increasing → investigate slowness

Packet Loss:
Goal: <0.1%
Measurement: SNMP counters
Trend: Spikes → check congestion

Connection Success:
Goal: 99.99%
Measurement: TCP SYN → SYN-ACK success rate
Trend: Dropping → check availability

SNMP (Simple Network Management Protocol)

Purpose: Collect network device statistics

SNMP Versions:

Version	Security	Use
v1	None (plain text)	Legacy, don't use
v2c	Community string	Simple monitoring
v3	Full authentication	Production recommended

Common SNMP OIDs (Object Identifiers):

1.3.6.1.2.1.1.3 — System uptime
1.3.6.1.2.1.2.2.1.1 — Interface name
1.3.6.1.2.1.2.2.1.5 — Interface speed
1.3.6.1.2.1.2.2.1.10 — Octets in
1.3.6.1.2.1.2.2.1.16 — Octets out
1.3.6.1.2.1.2.2.1.20 — Dropped packets

SNMP Walk (Collect Data):

# Get all SNMP data
snmpwalk -v 2c -c public 192.168.1.1
 
# Get specific value
snmpget -v 2c -c public 192.168.1.1 \
  1.3.6.1.2.1.1.3.0
# Returns: System uptime

Application Performance Monitoring (APM)

Full-Stack Telemetry:

User Experience
     ↑
┌────────────────┐
│ Frontend       │ (JavaScript errors, page load time)
├────────────────┤
│ Network        │ (DNS time, TCP connection time)
├────────────────┤
│ Application    │ (HTTP response time, database queries)
├────────────────┤
│ Infrastructure │ (CPU, memory, disk, network)
└────────────────┘

Tools: Datadog, New Relic, Dynatrace, Elastic

Monitoring Tools Comparison

Tool	Type	For
Prometheus	Metrics	Infrastructure, applications
Grafana	Visualization	Dashboard, alerting
ELK Stack	Logs/Metrics	Centralized logging
Datadog	APM	Full-stack monitoring
Wireshark	Packet	Detailed analysis
ntopng	Flow	Network behavior
tcpdump	Packet	Quick capture

Real-Time Network Monitoring

Types of Monitoring:

Push-Based (Agent-based):

┌────────────────┐
│ Agent on host  │ (constantly sends data)
│ Sends metrics  │
└────────┬───────┘
         │ Push
         ↓
    ┌──────────┐
    │Collector │
    └──────────┘

Pros: Real-time, detailed
Cons: Overhead per host, scale challenge

Pull-Based (Scrape-based):

┌──────────────┐
│ Monitoring   │ Periodically requests metrics
│ System       │ (asks for data)
└──────┬───────┘
       │ Pull
       ↓
    ┌──────────────┐
    │ Host metrics │
    │ endpoint     │
    └──────────────┘

Pros: Easier scale, host controls exposure
Cons: Potentially missing spikes between scrapes

Network Telemetry Use Cases

Use Case 1: Anomaly Detection

Normal traffic: 100 Mbps
Baseline: 99 Mbps ±5%

Anomaly: Spike to 500 Mbps
Alert: "DDoS or traffic spike detected"
→ Investigate immediately

Use Case 2: Capacity Planning

Current: 65% utilized, trending up 2% per week
Projection: Will hit 80% in 3 weeks
Action: Provision more capacity

Use Case 3: Troubleshooting

"Users report slow service"
Telemetry shows:
- High latency to database (300ms vs 50ms normally)
- Increased packet loss on database link
- Root cause: Database server overwhelmed

Fix: Scale database or investigate queries

Use Case 4: Security

NetFlow shows:
- Sudden outbound traffic to unknown IP
- Large data transfer (1GB/min)
- Destination: 203.0.113.202 (malicious IP)

Response: Block outgoing traffic to that IP
Quarantine affected server

Best Practices for Network Telemetry

✓ Collect baseline metrics before problems ✓ Set realistic alerting thresholds ✓ Monitor end-user experience, not just infrastructure ✓ Track trends over time (capacity planning) ✓ Archive data for post-incident analysis ✓ Use correlation (if latency high AND CPU high → bottleneck) ✓ Combine flow data with logs for context ✓ Securely store sensitive network data ✓ Test alerting (make sure notifications work) ✓ Regularly review and adjust metrics

Key Concepts

NetFlow/IPFIX = Standard flow export
Flow data = Source, destination, bytes, packets, duration
Packet capture = Detailed but resource-intensive
SNMP = Device statistics collection
Metrics = Quantitative measurements
APM = End-user experience monitoring
Baseline = Normal behavior to detect anomalies
Correlation = Relate multiple signals for insight
Telemetry enables visibility into network behavior
Visibility is prerequisite for optimization and security