Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions docs/src/05_developers/03_e2e.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# E2E Test Failure Investigation Guide

This guide provides a structured approach to investigating end-to-end (e2e) test failures in the cluster-api-addon-provider-fleet project.

## Understanding E2E Tests

Our CI pipeline runs several e2e tests to validate functionality across different Kubernetes versions:

- **Cluster Class Import Tests**: Validate the cluster class import functionality
- **Import Tests**: Validate the general import functionality
- **Import RKE2 Tests**: Validate import functionality specific to RKE2 clusters

Each test runs on multiple Kubernetes versions (stable and latest) to ensure compatibility.

## Accessing Test Artifacts

When e2e tests fail, the CI pipeline automatically collects and uploads artifacts containing valuable debugging information. These artifacts are created using [crust-gather](https://github.com/crust-gather/crust-gather), a tool that captures the state of Kubernetes clusters.

### Finding the Artifact URL

1. Navigate to the failed GitHub Actions workflow run
2. Scroll down to the "Artifacts" section
3. Find the artifact corresponding to the failed test (e.g., `artifacts-cluster-class-import-stable`)
4. Copy the artifact URL (right-click on the artifact link and copy the URL)

## Using the serve-artifact.sh Script

The `serve-artifact.sh` script allows you to download and serve the test artifacts locally, providing access to the Kubernetes contexts from the test environment.

### Prerequisites

- A GitHub token with `repo` read permissions (set as `GITHUB_TOKEN` environment variable)
- `kubectl` installed, `krew` installed.
- [crust-gather](https://github.com/crust-gather/crust-gather) installed. Can be replicated with nix, if available.

### Serving Artifacts

Fetch the `serve-artifact.sh` script from the [crust-gather GitHub repository](https://github.com/crust-gather/crust-gather):

```bash
curl -L https://raw.githubusercontent.com/crust-gather/crust-gather/refs/heads/main/serve-artifact.sh -o serve-artifact.sh && chmod +x serve-artifact.sh
```

```bash
# Using the full artifact URL
./serve-artifact.sh -u https://github.com/rancher/cluster-api-addon-provider-fleet/actions/runs/15737662078/artifacts/3356068059 -s 0.0.0.0:9095

# OR using individual components
./serve-artifact.sh -o rancher -r cluster-api-addon-provider-fleet -a 3356068059 -s 0.0.0.0:9095
```

This will:
1. Download the artifact from GitHub
2. Extract its contents
3. Start a local server that provides access to the Kubernetes contexts from the test environment

## Investigating Failures

Once the artifact server is running, you can use various tools to investigate the failure:

### Using k9s

[k9s](https://k9scli.io/) provides a terminal UI to interact with Kubernetes clusters:

1. Open a new terminal
2. Run `k9s`
3. Press `:` to open the command prompt
4. Type `ctx` and press Enter
5. Select the context from the test environment (there may be multiple contexts). `dev` for the e2e tests.
6. Navigate through resources to identify issues:
- Check pods for crash loops or errors
- Examine events for warnings or errors
- Review logs from relevant components

### Common Investigation Paths

1. **Check Fleet Resources**:
- `FleetAddonConfig` resources
- Fleet `Cluster` resource
- CAPI `ClusterGroup` resources
- Ensure all relevant labels are present on above.
- Check for created `Fleet` namespace `cluster-<ns>-<cluster name>-<random-prefix>` that it is consitent with the NS in the Cluster `.status.namespace`.
- Check for `ClusterRegistrationToken` in the cluster namespace.
- Check for `BundleNamespaceMapping` in the `ClusterClass` namespace if a cluster references a `ClusterClass` in a different namespace

2. **Check CAPI Resources**:
- Cluster resource
- Check for `ControlPlaneInitialized` condition to be `true`
- ClusterClass resources, these are present and have `status.observedGeneration` consistent with the `metadata.generation`
- Continue on a per-cluster basis

3. **Check Controller Logs**:
- Look for error messages or warnings in the controller logs in the `caapf-system` namespace.
- Check for reconciliation failures in `manager` container. In case of upstream installation, check for `helm-manager` container logs.

4. **Check Kubernetes Events**:
- Events often contain information about failures, otherwise `CAAPF` publishes events for each resource apply from CAPI `Cluster`, including Fleet `Cluster` in the cluster namespace, `ClusterGroup` and `BundleNamespaceMapping` in the `ClusterClass` namespace. These events are created by `caapf-controller` component.

## Common Failure Patterns

### Import Failures

- **Symptom**: Fleet `Cluster` not created or in error state
- **Investigation**: Check the controller logs in the `cattle-fleet-system` namespace for errors during import processing. Check for errors in the `CAAPF` logs for missing cluster definition.
- **Common causes**:
- Fleet cluster import process is serial, and hot loop in other cluster import blocks further cluster imports. Fleet issue.
- CAPI `Cluster` is not ready and does not have `ControlPlaneInitialized` condition. Issue with CAPI or requires more time to be ready.
- Otherwise `CAAPF` issue.

### Cluster Class Failures

- **Symptom**: ClusterClass not properly imported or is not evaluated as a target.
- **Investigation**: Check for the `BundleNamespaceMapping` in the `ClusterClass` namespace named after the `Cluster` resource. Check the controller logs in the `caapf-system` namespace for errors during ClusterClass processing. Check `ClusterGroup` resource in the `Cluster` namespace.
- **Common causes**:
- Check for `Cluster` referencing `ClusterClass` in a different namespace.
- In the event of missing resources, `CAAPF` related error.

## Reference

- [crust-gather GitHub repository](https://github.com/crust-gather/crust-gather)
- [k9s documentation](https://k9scli.io/topics/commands/)
2 changes: 1 addition & 1 deletion src/metrics.rs
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ impl Default for Diagnostics {
fn default() -> Self {
Self {
last_event: Utc::now(),
reporter: "doc-controller".into(),
reporter: "caapf-controller".into(),
}
}
}
Expand Down