Skip to content

Conversation

@Damien-DAGORN
Copy link

@Damien-DAGORN Damien-DAGORN commented Dec 29, 2025

Provider SQL: Default OCI packaging + stabilized E2E on Kind

Summary

This PR standardizes provider installation via OCI package (tag latest) and hardens local end-to-end tests on a Kind cluster. It removes the temporary MariaDB skip option and ensures both MariaDB (no-TLS & TLS) and PostgreSQL suites run reliably with clear diagnostics and robust Helm-based setup.

Key Changes

  • Default OCI packaging flow:
    • Build and push the .xpkg to an in-cluster registry.
    • Install the Provider from the OCI reference using the :latest tag.
  • Retained local cache fallback (PVC + .gz) when USE_OCI=false.
  • Hardened integration scripts with stricter shell behavior and clearer logs:
    • set -euo pipefail, step-by-step logging, extended waits, and better error messages.
    • helm upgrade --install --wait --timeout and repo add/update for deterministic installs.
  • MariaDB:
    • Fixed Kubernetes secret creation syntax and improved install flow.
    • Always run both no-TLS and TLS tests (removed SKIP_MARIADB).
    • Using Bitnami chart version 24.0.2 (MariaDB 12.1.2).
  • PostgreSQL:
    • Switched chart values to auth.postgresPassword, disabled persistence for Kind, and added readiness checks.
    • Port-forward readiness loop and password generation hardened under pipefail.
    • Using Bitnami chart version 18.2.0 (PostgreSQL 18.1.0).
  • Cleanup reliability:
    • Guarded deletion where variables may not be defined (e.g., ProviderConfig during teardown).

Implementation Details

  • OCI mode is the default path: an in-cluster registry is launched, the provider .xpkg is pushed with both version and latest tags, and Crossplane installs the Provider from the OCI reference.
  • Local cache mode remains supported for offline/debug scenarios by setting USE_OCI=false.
  • Bitnami Helm charts are used for database deployments:
    • MariaDB: version 24.0.2 (MariaDB 12.1.2)
    • PostgreSQL: version 18.2.0 (PostgreSQL 18.1.0)
    • Persistence disabled for Kind cluster compatibility
  • Provider installation includes additional health checks and timeouts to avoid flakiness.

Relevant scripts:

Test Plan

Run the full local end-to-end suite on Kind.

Prerequisites:

  • Docker, kind, kubectl, and helm available on your machine.

Examples:

# Default: OCI mode (in-cluster registry, provider installed via :latest)
bash cluster/local/integration_tests.sh

# Explicit OCI mode
env USE_OCI=true bash cluster/local/integration_tests.sh

# Local cache mode (PVC + .gz) for offline/debugging
env USE_OCI=false bash cluster/local/integration_tests.sh

Expected outcome:

  • Kind cluster is created, Crossplane installed via Helm.
  • Provider package is pushed to the local registry and installed.
  • MariaDB no-TLS & TLS tests run and pass.
  • PostgreSQL tests run and pass.
  • Cleanup completes without errors.

Flags & Configuration

  • USE_OCI (default: true):
    • true: in-cluster registry + OCI install with :latest tag.
    • false: local cache path via PVC referencing the .gz package.
  • Optional environment defaults exist for Kind and node image (e.g., KIND_VERSION, KIND_NODE_IMAGE_TAG).

CI Impact

  • Aligns CI with the default OCI packaging flow for consistency.
  • Reduces transient failures via deterministic Helm operations and clearer readiness checks.

Breaking Changes

  • None. The temporary SKIP_MARIADB flag was removed; the suite now always runs both MariaDB and PostgreSQL as intended.

Checklist

  • Provider installed via OCI :latest by default
  • Local cache fallback supported (USE_OCI=false)
  • MariaDB (no-TLS & TLS) tests passing
  • PostgreSQL tests passing
  • Cleanup is robust

Notes

If you encounter chart/image pull issues on Kind, ensure the Bitnami repos are added/updated and avoid pinning to stale chart versions. The integration tests use the following verified chart versions:

  • MariaDB: 24.0.2 (latest stable as of December 2025)
  • PostgreSQL: 18.2.0 (latest stable as of December 2025)

Increased waits and explicit readiness checks are in place, but logs from the integration script will guide further troubleshooting.

Copilot AI review requested due to automatic review settings December 29, 2025 17:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the end-to-end testing infrastructure to improve reliability and debugging capabilities. The changes introduce OCI-based package delivery as the default mode, add a flag to skip MariaDB tests for focused debugging, and harden PostgreSQL test setup with better error handling and diagnostics.

Key Changes:

  • Switched to OCI mode by default for provider package delivery, using an in-cluster registry instead of mounted cache volumes
  • Added SKIP_MARIADB flag to enable PostgreSQL-only test runs
  • Improved PostgreSQL installation robustness with helm upgrade pattern, disabled persistence, and enhanced readiness checks

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 15 comments.

File Description
cmd/provider/main.go Added configurable cache-sync-timeout flag for controller startup
cluster/local/postgresdb_functions.sh Hardened PostgreSQL setup with pipefail-safe password generation, helm upgrade pattern, port-forward readiness checks, and safer cleanup
cluster/local/integration_tests.sh Implemented OCI mode with in-cluster registry, added SKIP_MARIADB flag, improved provider health checks, and removed hardcoded Helm chart versions
.github/workflows/ci.yml Added tmate debugging session for interactive troubleshooting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"${KUBECTL}" get all -A || true
"${KUBECTL}" get events --sort-by=.lastTimestamp | tail -n 200 || true
"${HELM}" status postgresdb || true
echo_error "Helm install/upgrade of PostgresDB failed"
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message uses echo_error but then doesn't exit. After checking for helm install failure and dumping diagnostics, the script should exit with a non-zero status code since the installation failed. Currently, the script continues executing even after the Helm install has failed.

Suggested change
echo_error "Helm install/upgrade of PostgresDB failed"
echo_error "Helm install/upgrade of PostgresDB failed"
return $rc

Copilot uses AI. Check for mistakes.

# Start port-forward; ignore output filtering failures under pipefail
echo_sub_step "Start port-forward to svc/postgresdb-postgresql:5432"
( "${KUBECTL}" port-forward --namespace default svc/postgresdb-postgresql 5432:5432 | grep -v "Handling connection for" || true ) &
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The port-forward command is being launched in a subshell with error suppression that could hide issues. The grep filter failure is silenced with '|| true', and any kubectl errors are redirected but not logged. If the port-forward fails to start, the script will proceed to the readiness check loop which will eventually timeout without clear indication of what went wrong. Consider capturing and logging stderr from kubectl separately before discarding it.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +58
# Wait for local port to be ready
for i in {1..20}; do
if nc -z 127.0.0.1 5432 >/dev/null 2>&1; then
echo_sub_step "Port-forward ready (PID: ${PORT_FORWARD_PID})"
echo_step_completed
break
fi
sleep 0.5
if [ $i -eq 20 ]; then
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded timeout of 10 seconds (20 iterations × 0.5 seconds) may be too short for slower environments or loaded systems. The port-forward can take longer to establish in resource-constrained CI environments. Consider making this configurable or increasing the timeout to a more conservative value like 30 seconds.

Suggested change
# Wait for local port to be ready
for i in {1..20}; do
if nc -z 127.0.0.1 5432 >/dev/null 2>&1; then
echo_sub_step "Port-forward ready (PID: ${PORT_FORWARD_PID})"
echo_step_completed
break
fi
sleep 0.5
if [ $i -eq 20 ]; then
# Wait for local port to be ready; allow overriding max attempts via env var
PORT_FORWARD_MAX_ATTEMPTS="${POSTGRES_PORT_FORWARD_MAX_ATTEMPTS:-60}"
for ((i=1; i<=PORT_FORWARD_MAX_ATTEMPTS; i++)); do
if nc -z 127.0.0.1 5432 >/dev/null 2>&1; then
echo_sub_step "Port-forward ready (PID: ${PORT_FORWARD_PID})"
echo_step_completed
break
fi
sleep 0.5
if [ "$i" -eq "$PORT_FORWARD_MAX_ATTEMPTS" ]; then

Copilot uses AI. Check for mistakes.
mkdir -p "${projectdir}/.work"
( kubectl -n crossplane-system port-forward svc/registry 5000:5000 >/dev/null 2>&1 & echo $! >"${projectdir}/.work/registry-pf.pid" )
for i in {1..20}; do nc -z localhost 5000 && break || sleep 0.5; done
}
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registry port-forward process is launched in a background subshell with all output redirected to /dev/null, making it difficult to diagnose failures. The PID is captured, but if the port-forward fails immediately after starting, the subsequent nc check loop will timeout without useful diagnostic information about why the port-forward failed.

Suggested change
}
( kubectl -n crossplane-system port-forward svc/registry 5000:5000 >"${projectdir}/.work/registry-pf.log" 2>&1 & echo $! >"${projectdir}/.work/registry-pf.pid" )

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +9
set +o pipefail
postgres_root_pw=$(LC_ALL=C tr -cd "A-Za-z0-9" </dev/urandom | head -c 32 || true)
set -o pipefail
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipefail handling for password generation is overly complex. Temporarily disabling pipefail with 'set +o pipefail' and then re-enabling it just to handle the 'head' command's SIGPIPE is a workaround that adds complexity. A cleaner approach would be to check if the password was generated successfully after the command, or use a different method that doesn't trigger SIGPIPE.

Suggested change
set +o pipefail
postgres_root_pw=$(LC_ALL=C tr -cd "A-Za-z0-9" </dev/urandom | head -c 32 || true)
set -o pipefail
postgres_root_pw=$(LC_ALL=C tr -cd "A-Za-z0-9" </dev/urandom | head -c 32 || true)

Copilot uses AI. Check for mistakes.
@Damien-DAGORN Damien-DAGORN force-pushed the fix/end-to-end-tests branch 2 times, most recently from f938334 to f8266a6 Compare December 29, 2025 17:57
@Damien-DAGORN Damien-DAGORN merged commit ad0b78c into master Dec 30, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants