What is Gremlin?
Gremlin is a commercial chaos engineering platform that provides infrastructure, platform, and application-level chaos experiments. Unlike Litmus (Kubernetes-focused), Gremlin works across:
- Cloud infrastructure (AWS, Azure, GCP)
- Kubernetes clusters
- VMs and bare metal
- Application code
- Network infrastructure
Why Choose Gremlin?
Advantages:
- Multi-platform support: Single pane for VM, container, and cloud failures
- Enterprise features: API-first design, SSO, audit logging
- No agent modification: Uses system tools (iptables, tc, etc.) under the hood
- User-friendly UI: Dashboard for running, monitoring, and analyzing experiments
Considerations:
- Commercial product (free tier available)
- Requires agent installation on all target systems
Gremlin Architecture
Core Components
┌─────────────────────────────────────────┐
│ Gremlin SaaS Control Plane │
│ - Experiment scheduling and reporting │
│ - Results aggregation │
│ - Team management and RBAC │
└────────────────┬────────────────────────┘
│ API
┌──────────┼──────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Gremlin │ │ Gremlin │ │ Gremlin │
│ Agent │ │ Agent │ │ Agent │
│ (Linux) │ │(Kubernetes) │ (Windows)│
└──────────┘ └──────────┘ └──────────┘
| | |
└────────────┴──────────────┘
Injects failure on systems
Installing Gremlin Agent
Linux Installation
# 1. Download and install
curl -O https://downloads.gremlin.com/gremlin/downloads/client/latest/linux/gremlin-latest.linux_amd64.rpm
sudo rpm -i gremlin-latest.linux_amd64.rpm
# Or for Debian/Ubuntu:
sudo apt install gremlin
# 2. Authenticate
# Option A: Team ID + Private Key
sudo gremlin config set -c <TEAM_ID> -p <PRIVATE_KEY>
# Option B: OAuth token
sudo gremlin config set -c <TEAM_ID> -a <AUTH_TOKEN>
# 3. Start Gremlin
sudo systemctl enable gremlin
sudo systemctl start gremlin
# 4. Verify
gremlin checkKubernetes Installation
# Add Gremlin Helm repository
helm repo add gremlin https://helm.gremlin.com
helm repo update
# Install Gremlin agent
helm install gremlin gremlin/gremlin \
--namespace gremlin \
--create-namespace \
--set gremlin.teamID=<TEAM_ID> \
--set gremlin.privKey=<PRIVATE_KEY>
# Verify
kubectl get pods -n gremlinGremlin Experiments by Layer
1. Infrastructure Attacks
Target the physical/virtual infrastructure layer.
CPU Attack
# Consume CPU cores
gremlin attack cpu \
--cores 4 \
--length 300What happens:
- Specified number of CPU cores maxed out
- Application performance degrades
- Tests if system can handle resource constraints
- Tests autoscaling triggers
Memory Attack
# Consume RAM
gremlin attack memory \
--megabytes 4096 \
--length 300 \
--percent-consumed 80 # Stop when 80% of memory is consumedWhat happens:
- Memory pressure increases
- Swap usage increases
- Application may be OOM-killed
- Tests memory limits and caching strategies
Disk Attack
# Fill disk space
gremlin attack disk \
--size 50GB \
--path /tmp \
--length 300What happens:
- Target path fills with temporary files
- Applications fail to write logs (critical!)
- Tests error handling for disk-full scenarios
Process Kill
# Kill a specific process
gremlin attack process-kill \
--process-name nginx \
--interval 30 # Kill every 30 secondsWhat happens:
- Process is terminated
- If managed by supervisor, will respawn
- Tests process restart mechanisms
2. Network Attacks
Target network communication and latency.
Latency Attack
# Add network latency
gremlin attack latency \
--latency 1000 \
--target-host 10.0.1.50 \
--length 300What happens:
- All packets to 10.0.1.50 delayed by 1 second
- Requests see significant latency increase
- Tests timeout configurations
- Tests circuit breaker behavior
Packet Loss Attack
# Lose network packets
gremlin attack packet-loss \
--percentage 50 \
--target-host database.example.com \
--length 300What happens:
- 50% of packets to target are dropped
- TCP retransmissions kick in
- Significant latency and potential timeouts
- Tests resilience to poor network conditions
Blackhole Attack
# Drop all packets to/from a target
gremlin attack blackhole \
--target-host 10.0.2.0/24 \
--length 300What happens:
- Complete network isolation
- Similar to availability zone partition
- Tests failover to backup systems
DNS Attack
# Corrupt DNS responses
gremlin attack dns \
--target-host api.example.com \
--corrupt-response true \
--length 300What happens:
- DNS queries return corrupted data
- Services fail to resolve names
- Tests DNS failover and retry logic
3. Application Attacks
Target application-level behavior (requires code instrumentation).
Exception Throwing
# Gremlin for Java with exception injection
gremlin attack exception \
--service payment-service \
--exception NullPointerException \
--percent-affected 10 # Affect 10% of requestsLatency Injection
# Add latency to specific methods
gremlin attack latency \
--service order-service \
--method calculateTotal \
--latency 2000 \
--percent-affected 20Running Experiments Through the UI
Step 1: Log Into Gremlin
Access the Gremlin web dashboard at https://app.gremlin.com
Step 2: Create an Experiment
- Click Scenarios → Create Scenario
- Choose Infrastructure Attack or Application Attack
- Select experiment type (CPU, Memory, Latency, etc.)
- Configure parameters
- Select target hosts (by tag, region, or host name)
Step 3: Set Blast Radius
Target Selection Options:
- By tag: app=payment-service
- By region: us-east-1
- By type: Database servers
- Percentage: Randomly select 20% of matching hosts
Step 4: Monitor Execution
- Watch real-time metrics during the experiment
- See which hosts are affected
- Monitor application-level impact (through integrations with Datadog, New Relic, etc.)
Step 5: View Results
Gremlin provides:
- Timeline of events
- Affected hosts and their metrics
- Application impact summary
- Pass/fail verdict based on monitoring
Integration with Monitoring Tools
Datadog Integration
# Configure Gremlin to report to Datadog
gremlin config set --datadog-api-key <YOUR_API_KEY> \
--datadog-app-key <YOUR_APP_KEY>Datadog will then:
- Show Gremlin events on metric graphs
- Correlate with application errors
- Show experiment timeline
Prometheus Integration
# Gremlin exposes metrics via Prometheus
scrape_configs:
- job_name: 'gremlin'
static_configs:
- targets: ['localhost:9000']API Usage
List Available Agents
curl -H "Authorization: Bearer <API_KEY>" \
https://api.gremlin.com/v1/agentsTrigger Experiment via API
curl -X POST https://api.gremlin.com/v1/scenarios \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"name": "CPU Attack",
"description": "Test high CPU",
"definition": {
"attacks": [{
"type": "cpu",
"parameters": {
"cores": 4,
"length": 300
}
}],
"targets": {
"filters": {
"names": ["prod-server-1"]
}
}
}
}'Best Practices
- Start with Staging: Never experiment on production first
- Use Tags: Organize infrastructure with meaningful tags
- Document Hypotheses: Record what you expect before running experiments
- Gradual Rollout: Start with 1-2 hosts, then expand to 10%, then 25%
- Team Notifications: Alert team before running experiments
- Automated Testing: Integrate Gremlin into CD pipelines
Gremlin vs Litmus vs Chaos Monkey
| Feature | Gremlin | Litmus | Chaos Monkey |
|---|---|---|---|
| Platform | Multi-cloud/VM/K8s | Kubernetes | AWS only |
| Agent | Required | Optional | Built-in |
| Cost | Commercial | Open-source | Open-source |
| Ease of Use | Dashboard-first | CRD-based | Imperative |
| API | Yes | Yes | Limited |
| Enterprise Features | RBAC, SSO, Audit | Community-driven | Basic |
Key Takeaways
- Gremlin covers all layers: Infrastructure, network, and application
- Enterprise-ready: Designed for large organizations
- Easy to use: Dashboard and API for different preferences
- Multi-platform: Works wherever your infrastructure is
- Integrates with monitoring: Fire experiments and correlate with metrics