Tools for troubleshooting IPsec integrity failures (SA-icv-failure / ICV failures) on OpenShift clusters.
ICV (Integrity Check Value) failures occur when IPsec encrypted packets fail integrity verification on the receiving end. This typically indicates:
- Data corruption in transit (network hardware, drivers)
- Packet modification between sender and receiver
- Crypto/key synchronization issues
These tools help capture synchronized packet data from both sides to identify where corruption occurs.
| Script | Purpose |
|---|---|
run-ipsec-diagnostics.sh |
Full ICV failure diagnostics: xfrm + tcpdump + retis (recommended) |
verify-capture-timestamps.sh |
Verify timestamp alignment across captures (post-capture analysis) |
run-dual-capture.sh |
ESP packet capture on two nodes |
run-retis-capture.sh |
Retis dropped packet capture |
xfrm-dump.sh |
Dump XFRM state and policy (local Linux only) |
capture-config.env |
Configuration for all scripts |
simulate-ipsec-failure.sh |
Simulate packet corruption for testing |
ipsec-capture-commands.sh |
Generate manual capture commands (fallback) |
Run all 3 tools in one command with ICV failure tracking:
# Clone and run
git clone https://github.com/lalan7/openshift-ipsec-network-diagnostics-tools.git
cd openshift-ipsec-network-diagnostics-tools
# Basic run with ICV monitoring
./run-ipsec-diagnostics.sh --monitor-icv --duration 60
# Full options for ICV failure investigation
./run-ipsec-diagnostics.sh \
--node1 worker1.example.com \
--node2 worker2.example.com \
--retis-node worker2.example.com \
--monitor-icv \
--icv-threshold 3 \
--duration 120 \
--no-packet-limit
# Skip Retis (faster, tcpdump + xfrm only)
./run-ipsec-diagnostics.sh --duration 30 --skip-retis
# Simple ESP filter (capture all ESP traffic)
./run-ipsec-diagnostics.sh --filter "esp" --duration 30Output:
~/ipsec-captures/diag-YYYYMMDD-HHMMSS/
├── xfrm-<node1>-start.txt # XFRM state/policy BEFORE capture
├── xfrm-<node1>-end.txt # XFRM state/policy AFTER capture
├── xfrm-<node2>-start.txt
├── xfrm-<node2>-end.txt
├── node1-esp.pcap # tcpdump ESP from sender
├── node1-timing.txt # Capture timing info
├── node2-esp.pcap # tcpdump ESP from receiver
├── node2-timing.txt
├── retis_icv.data # ICV failure tracking data
├── retis-timing.txt
├── retis-output.log
└── sync-time.txt # Capture start timestamp
Tested and working on RHEL 9.6:
# SSH to bastion host
ssh user@bastion.example.com
# Clone repo (or copy scripts)
git clone https://github.com/lalan7/openshift-ipsec-network-diagnostics-tools.git
cd openshift-ipsec-network-diagnostics-tools
# Run diagnostics
./run-ipsec-diagnostics.sh --duration 30# Use defaults from capture-config.env
./run-dual-capture.sh
# Override via CLI
./run-dual-capture.sh --node1 worker1.example.com --node2 worker2.example.com --duration 60
# With custom filter
./run-dual-capture.sh --filter "esp" --duration 30# Use defaults from capture-config.env
./run-retis-capture.sh
# Override via CLI
./run-retis-capture.sh --node worker1.example.com --duration 60
# With filter
./run-retis-capture.sh --filter "src host 10.0.0.1"# Local Linux only
./xfrm-dump.sh /tmp/xfrm-output
# On OpenShift node (via oc debug)
oc debug node/worker1.example.com -- chroot /host bash -c "ip xfrm state show; ip xfrm policy show"All parameters can be set via:
- Config file (
capture-config.env) - default values - Environment variables - override config file
- CLI arguments - highest priority
# Node names (set to your actual OpenShift worker nodes)
NODE1_NAME="worker1.example.com"
NODE2_NAME="worker2.example.com"
# Network interface
INTERFACE="br-ex"
# Capture settings
DURATION="30"
PACKET_COUNT="1000" # Max packets (ignored with --no-packet-limit)
LOCAL_OUTPUT="${HOME}/ipsec-captures"
# tcpdump filter (use {NODE1_IP} and {NODE2_IP} as placeholders)
FILTER="host {NODE1_IP} and host {NODE2_IP} and esp"
TCPDUMP_EXTRA=""
# Retis settings
RETIS_IMAGE="quay.io/retis/retis"
RETIS_FILTER=""./run-ipsec-diagnostics.sh --help
Options:
--node1 First node - sender (default: from config)
--node2 Second node - receiver (default: from config)
--interface Network interface (default: br-ex)
--duration Capture duration in seconds (default: 30)
--output Local output directory (default: ~/ipsec-captures)
--filter tcpdump filter (default: ESP between nodes)
--skip-retis Skip Retis capture
--retis-node Node where Retis runs (dropping side, default: node2)
--monitor-icv Monitor for ICV failures and auto-stop
--icv-threshold Number of ICV failures before stopping (default: 3)
--no-packet-limit Run tcpdump for full duration (ignore packet count)
By default, tcpdump stops after 1000 packets OR duration, whichever first.
Use --no-packet-limit to capture for the full duration regardless of packet count.
Captures include:
- XFRM state/policy at START and END (for comparison)
- tcpdump ESP packets on both nodes (synchronized, full packet -s0)
- Retis with xfrm_audit_state_icvfail/stack probe on dropping node# Compare XFRM state before and after capture
diff ~/ipsec-captures/diag-*/xfrm-*-start.txt ~/ipsec-captures/diag-*/xfrm-*-end.txt
# View IPsec Security Associations at start
cat ~/ipsec-captures/diag-*/xfrm-*-start.txt
# Look for:
# - src/dst IPs
# - ESP SPI values
# - aead rfc4106(gcm(aes)) encryption
# - replay-window settings# Basic read
tcpdump -r ~/ipsec-captures/diag-*/node1-esp.pcap -nn
tcpdump -r ~/ipsec-captures/diag-*/node2-esp.pcap -nn
# Show ESP details
tcpdump -r ~/ipsec-captures/diag-*/node1-esp.pcap -nn -v esp
# Count packets on both nodes (should match for synchronized captures)
echo "Node1: $(tcpdump -r ~/ipsec-captures/diag-*/node1-esp.pcap -nn esp 2>/dev/null | wc -l)"
echo "Node2: $(tcpdump -r ~/ipsec-captures/diag-*/node2-esp.pcap -nn esp 2>/dev/null | wc -l)"# Full decode
tshark -r ~/ipsec-captures/diag-*/node1-esp.pcap -V -Y "esp"
# Show ESP SPIs and sequence numbers (critical for ICV failure correlation)
tshark -r ~/ipsec-captures/diag-*/node1-esp.pcap -Y "esp" -T fields \
-e frame.time -e ip.src -e ip.dst -e esp.spi -e esp.sequence
# Compare packets between nodes (find same ESP sequence in both captures)
tshark -r ~/ipsec-captures/diag-*/node2-esp.pcap -Y "esp" -T fields \
-e frame.time -e ip.src -e ip.dst -e esp.spi -e esp.sequence
# Statistics
tshark -r ~/ipsec-captures/diag-*/node1-esp.pcap -q -z io,stat,1# On Linux
retis print ~/ipsec-captures/diag-*/retis_icv.data
retis sort ~/ipsec-captures/diag-*/retis_icv.data
# On macOS (via Podman)
podman run --rm -v ~/ipsec-captures:/data:ro quay.io/retis/retis print /data/diag-*/retis_icv.data
podman run --rm -v ~/ipsec-captures:/data:ro quay.io/retis/retis sort /data/diag-*/retis_icv.data
# View Retis output log
cat ~/ipsec-captures/diag-*/retis-output.logTo find corrupted packets:
- Find dropped packet in Retis output - Look for xfrm_audit_state_icvfail events
- Get ESP sequence number from the dropped packet
- Search both pcap files for the same sequence number
- Compare packet data between sender and receiver to identify corruption
After capturing, verify that timestamps are synchronized across all captures:
# Run verification on capture output directory
./verify-capture-timestamps.sh ~/ipsec-captures/diag-20241205-143022Output:
╔════════════════════════════════════════════════════════════════╗
║ Capture Timestamp Verification ║
╚════════════════════════════════════════════════════════════════╝
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Verification Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Capture Files Present: PASS
Timing Data Available: PASS
Capture Start Alignment: PASS
Packet Count Match: PASS
Retis Data Available: PASS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ ALL CHECKS PASSED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The verification script checks:
- Capture Files Present: All required pcap and timing files exist
- Timing Data Available: START/END timestamps recorded for each node
- Capture Start Alignment: Time difference between when captures started on each node
- Packet Count Match: Same number of ESP packets on both nodes
- Retis Data Available: ICV failure tracking data captured
Capture Start Alignment:
| Status | Difference | Meaning |
|---|---|---|
| ✓ Good | <1s | Captures well aligned |
| ✓ Acceptable | <5s | Normal variance from oc debug startup |
| ⚠ Large offset | <10s | Captures may have limited overlap |
| ✗ Very large | >10s | Captures may not overlap - rerun |
Important: This check measures when each capture started, NOT NTP/chrony clock accuracy. A difference of 500ms-2s is normal due to oc debug pod startup timing variance. This does NOT indicate clock skew - verify actual NTP sync with chronyc tracking on each node.
Requirements: tshark and bc
# macOS
brew install wireshark
# RHEL 9 / Fedora
sudo dnf install wireshark-cli bc
# Debian / Ubuntu
sudo apt install tshark bc-
Check IPsec is enabled:
oc get pods -n openshift-ovn-kubernetes -l app=ovn-ipsec
-
Generate traffic - ping or run workloads between nodes during capture
-
Try broader filter:
./run-ipsec-diagnostics.sh --filter "esp" --duration 30
Check capture logs:
cat /tmp/capture-<nodename>.log
cat /tmp/tcpdump-<nodename>.logUse simple filters:
# Good
--filter "esp"
--filter "host 10.0.0.1"
# Avoid complex filters with special charsUse home directory path (not /tmp) for Podman volume mounts:
# Wrong (won't work on macOS)
podman run -v /tmp/captures:/data ...
# Correct
podman run -v ~/ipsec-captures:/data ...The filter is passed to tcpdump. Common issues:
- Use
espnotproto esp - Ensure IPs are correct
- Check quoting in config file
| Requirement | Purpose |
|---|---|
oc CLI |
Cluster access |
| Cluster admin | oc debug node permission |
| RHCOS nodes | toolbox with tcpdump |
| For analysis | tcpdump, tshark, or Podman |
| Platform | Status |
|---|---|
| macOS | ✓ Tested |
| RHEL 9 | ✓ Tested |
| Linux | ✓ Should work |
╔════════════════════════════════════════════════════════════════╗
║ IPsec ICV Failure Diagnostics ║
╚════════════════════════════════════════════════════════════════╝
Cluster: https://api.cluster.example.com:6443
Node 1 (sender): worker1.example.com
Node 2 (receiver): worker2.example.com
Retis node (dropping side): worker2.example.com
Interface: br-ex
Duration: 120s
Capture mode: duration-only (no packet limit)
Monitor ICV: true (threshold: 3)
Output: ~/ipsec-captures/diag-20251203-131212
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Phase 1: XFRM State & Policy Dump (START - before capture)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Dumping XFRM from worker1 (start)...
Saved: xfrm-worker1-start.txt
Dumping XFRM from worker2 (start)...
Saved: xfrm-worker2-start.txt
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Phase 2: Synchronized Capture - tcpdump + Retis (120s)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
tcpdump filter: host 10.0.0.1 and host 10.0.0.2 and esp
Retis filter: src host 10.0.0.1 and dst host 10.0.0.2
Retis running on: worker2.example.com (dropping side)
Starting synchronized captures...
Capture start time: 2025-12-03T13:12:18-05:00
Starting tcpdump on worker1.example.com...
Starting tcpdump on worker2.example.com...
Starting Retis on worker2.example.com (ICV failure tracking)...
Waiting for captures to complete (120s)...
Time remaining: 10 seconds | Running: 3 | ICV failures: 0
All captures completed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Phase 3: XFRM State & Policy Dump (END - after capture)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Dumping XFRM from worker1 (end)...
Saved: xfrm-worker1-end.txt
Dumping XFRM from worker2 (end)...
Saved: xfrm-worker2-end.txt
╔════════════════════════════════════════════════════════════════╗
║ Results Summary ║
╚════════════════════════════════════════════════════════════════╝
Output directory: ~/ipsec-captures/diag-20251203-131212
total 1.2M
-rw-r--r--. 1 user user 4.0K Dec 3 13:13 node1-esp.pcap
-rw-r--r--. 1 user user 512 Dec 3 13:13 node1-timing.txt
-rw-r--r--. 1 user user 811K Dec 3 13:13 node2-esp.pcap
-rw-r--r--. 1 user user 512 Dec 3 13:13 node2-timing.txt
-rw-r--r--. 1 user user 64K Dec 3 13:13 retis_icv.data
-rw-r--r--. 1 user user 256 Dec 3 13:13 retis-timing.txt
-rw-r--r--. 1 user user 2.0K Dec 3 13:13 retis-output.log
-rw-r--r--. 1 user user 32 Dec 3 13:12 sync-time.txt
-rw-r--r--. 1 user user 1.5K Dec 3 13:12 xfrm-worker1-start.txt
-rw-r--r--. 1 user user 1.5K Dec 3 13:13 xfrm-worker1-end.txt
-rw-r--r--. 1 user user 26K Dec 3 13:12 xfrm-worker2-start.txt
-rw-r--r--. 1 user user 26K Dec 3 13:13 xfrm-worker2-end.txt
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/improvement) - Open a Pull Request
This project follows KISS (Keep It Simple, Stupid) principles and 12-Factor App methodology. All code contributions must adhere to security best practices and maintainability standards.
Key Principles:
- Security: Input validation, secure coding practices, no hardcoded secrets, proper error handling
- KISS: Simple, readable solutions over complex optimizations
- 12-Factor: Environment-based configuration, stateless processes, proper logging
For Cursor IDE users: This repository includes .cursorrules that automatically enforce these standards. The rules cover:
- Input validation and sanitization
- Secure bash scripting (
set -euo pipefail, proper quoting) - Container security (Podman/Buildah only)
- Configuration via environment variables
- Structured logging without sensitive data exposure
Code Requirements:
- All bash scripts must use
set -euo pipefailand quote all variables - Never hardcode credentials or secrets (use environment variables)
- Validate all inputs and file paths
- Use Podman/Buildah for containers (never Docker)
- Follow 12-factor config management (env vars > config files > hardcoded values)
See .cursorrules for complete development guidelines and security rules.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.