Skip to content

Conversation

@ycombinator
Copy link
Contributor

What does this PR do?

This PR reconfigures the FIPS Buildkite pipeline to a) point it to the Staging GovCloud/FRH ESS environment and b) use an ESS API key from that environment for spinning up deployments.

Why is it important?

To run FIPS-related tests against the officially-configured FedRamp High (FRH) ESS environment.

@mergify
Copy link
Contributor

mergify bot commented Jul 30, 2025

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@ycombinator ycombinator added skip-changelog backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch labels Jul 30, 2025
@mergify
Copy link
Contributor

mergify bot commented Aug 7, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b bk-fips-redirect upstream/bk-fips-redirect
git merge upstream/main
git push upstream bk-fips-redirect

@ycombinator ycombinator marked this pull request as ready for review August 14, 2025 00:03
@ycombinator ycombinator requested a review from pchila August 14, 2025 00:03
@ycombinator ycombinator requested review from pkoutsovasilis and removed request for pchila August 14, 2025 01:29
@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 14, 2025

@pkoutsovasilis I could use your input here on how best to resolve the CI failure on this PR, since it pertains to the contents of the .package-version file.

The "Start ESS stack for FIPS integration tests" step is failing in CI like so:

│ Error: failed creating deployment
--
  | │
  | │   with ec_deployment.integration-testing,
  | │   on deployment.tf line 100, in resource "ec_deployment" "integration-testing":
  | │  100: resource "ec_deployment" "integration-testing" {
  | │
  | │ api error: 1 error occurred:
  | │ 	* clusters.cluster_plan_change_prohibited: The requested cluster configuration change is not permitted. Value of field 'elasticsearch.docker_image' cannot
  | │ be changed to [docker.elastic.co/cloud-release/elasticsearch-cloud-ess-fips:9.2.0-0c2a1cac-SNAPSHOT]. (resources.elasticsearch[0].elasticsearch.docker_image)
  | │

This is happening because there is no Docker image with this tag (9.2.0-0c2a1cac-SNAPSHOT) in the Staging GovCloud ESS environment, which is the environment that we run our FIPS tests against. The tag comes from

"stack_build_id": "9.2.0-0c2a1cac-SNAPSHOT"

That field is being read in the terraform file used for creating the ESS deployment over here:

images_version = coalesce(var.stack_build_id, var.stack_version)

Just for testing, I temporarily hardcoded the tag as 9.2.0-SNAPSHOT in the terraform file and it worked. So I know the issue is isolated to the tag not existing in the environment, and everything else in our configuration is OK.

Do you have any thoughts on how we could accommodate the Staging GovCloud environment in our testing configuration?

Here is one idea I have; let me know if you have a better suggestion:

  1. In https://github.com/elastic/elastic-agent/blob/main/.package-version, create a new field, stack_version, which will have the same value as stack_build_id but without the SHA. So, today, the value of this new field would be 9.2.0-SNAPSHOT. The automation that keeps that file updated will set this new field as well.
  2. We pass this new field value to the Terraform file in addition to passing stack_build_id today.
  3. In the Terraform file, we introduce a new boolean variable fips. The FIPS Buildkite pipeline will pass this value as true to the Terraform file.
  4. In the Terraform file, if fips is set to true, we use the value of stack_version otherwise we use the value of stack_build_id, to create the image tag.

Another solution, similar to the above, but needing more automation would be to store something like a frh_stack_build_id field in the .package-version file and have automation set that based on trying to find the stack build ID actually available in the ESS Staging GovCloud environment that matches stack_build_id the closest.

Keep in mind that we have a deadline to finish FIPS testing by the end of this week so we may need to go with a less-ideal solution for now and then revisit it to make it better.

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Aug 14, 2025

Hey @ycombinator 👋 I do like the following proposal

In https://github.com/elastic/elastic-agent/blob/main/.package-version, create a new field, stack_version, which will have the same value as stack_build_id but without the SHA. So, today, the value of this new field would be 9.2.0-SNAPSHOT. The automation that keeps that file updated will set this new field as well.
We pass this new field value to the Terraform file in addition to passing stack_build_id today.
In the Terraform file, we introduce a new boolean variable fips. The FIPS Buildkite pipeline will pass this value as true to the Terraform file.
In the Terraform file, if fips is set to true, we use the value of stack_version otherwise we use the value of stack_build_id, to create the image tag.

but let's move fast to get this up and running and we figure the details of how .package-version can be utilised here afterwards. What do you think of the following?

  1. here do a change in the fashion of if FIPS=TRUE then STACK_BUILD_ID=""
  2. similar to the above here
  3. windows you don't care at the moment, but it's similar to the above here

this will result using only the STACK_VERSION and not the STACK_BUILD_ID

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 14, 2025

The "Start ESS stack for FIPS integration tests" step in CI is failing with the error:

ec_deployment.integration-testing: Creating...
--
  | ╷
  | │ Error: failed creating deployment
  | │
  | │   with ec_deployment.integration-testing,
  | │   on deployment.tf line 100, in resource "ec_deployment" "integration-testing":
  | │  100: resource "ec_deployment" "integration-testing" {
  | │
  | │ api error: 1 error occurred:
  | │ 	* clusters.cluster_plan_change_prohibited: The requested cluster configuration change is not permitted. Value of field 'elasticsearch.docker_image' cannot
  | │ be changed to [docker.elastic.co/cloud-release/elasticsearch-cloud-ess-fips:9.2.0-SNAPSHOT]. (resources.elasticsearch[0].elasticsearch.docker_image)
  | │
  | │
  | ╵
  | ╷
  | │ Error: failed creating deployment
  | │
  | │   with ec_deployment.integration-testing,
  | │   on deployment.tf line 100, in resource "ec_deployment" "integration-testing":
  | │  100: resource "ec_deployment" "integration-testing" {

I'm confused because the same image tag succeeded in the previous CI build: https://buildkite.com/elastic/elastic-agent/builds/25276#0198a930-20fb-4c9e-93c3-80392012a236/136-180

[EDIT] I've compared the Terraform execution plans from both CI builds (previous - successful vs. current - failed) and they are identical except for the docker.elastic.co/beats-ci/elastic-agent-cloud-fips image's SHA (which is expected to be different since we're building different commits) and the buildkite_id tag (which is also expected to be different as it's different builds).

[EDIT] I've also run terraform apply manually with the same terraform configuration that's being used in CI like so:

TF_VAR_docker_images_name_suffix=-fips \
TF_VAR_stack_version=9.2.0-SNAPSHOT \
TF_VAR_integration_server_docker_image="docker.elastic.co/cloud-release/elastic-agent-cloud-fips:9.2.0-SNAPSHOT" \
TF_VAR_deployment_template_id="aws-general-purpose" \
TF_VAR_ess_region="us-gov-east-1" \
EC_API_KEY="XXXXXXX" \
EC_ENDPOINT="https://api.staging.elastic-gov.com" \
terraform apply

I've run this three times so far, one after the other (with a terraform destroy in between). The first attempt succeeded, the second attempt failed with the same error that we're seeing in CI, and the third attempt succeeded as well. So I'm starting to think something about the Staging GovCloud environment might be flaky?

@ycombinator ycombinator enabled auto-merge (squash) August 14, 2025 22:03
@elastic-sonarqube
Copy link

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @ycombinator

Copy link
Contributor

@oakrizan oakrizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm for CI part

@ebeahan ebeahan disabled auto-merge August 15, 2025 12:25
@ebeahan ebeahan merged commit a555980 into elastic:main Aug 15, 2025
19 checks passed
mergify bot pushed a commit that referenced this pull request Aug 15, 2025
…onment (#9198)

* Get Staging FRH/GovCloud API key from Vault

* Point Elastic Cloud API endpoint for TF provider to Staging FRH/GovCloud environment

* Updating Vault path for API key

* [Test commit] No-op change to trigger FIPS tests

* Set ESS region to Staging GovCloud region

* Check if variable is set before using it

* Use default value

* Parameterize deployment template ID

* Debugging terraform

* Forgot to pass deployment template ID

* Use TF_VAR_* instead of adding new env vars

* Fix deployment template ID

* Set ES and Kibana Docker image URLs

* Removing unrelated changes

* Set -fips suffix on default docker images

* Remove commented out lines

* Bring back ESS_REGION env var

* Define default ESS_REGION so it's not unbound

* Restoring file from main

* Try to set STACK_BUILD_ID to ""

* [Testing] Hardcoding stack build ID

* Only use STACK_VERSION when FIPS=true

* Revert "[Testing] Hardcoding stack build ID"

This reverts commit fb275fb.

* Remove unnecessary line

(cherry picked from commit a555980)
mergify bot pushed a commit that referenced this pull request Aug 15, 2025
…onment (#9198)

* Get Staging FRH/GovCloud API key from Vault

* Point Elastic Cloud API endpoint for TF provider to Staging FRH/GovCloud environment

* Updating Vault path for API key

* [Test commit] No-op change to trigger FIPS tests

* Set ESS region to Staging GovCloud region

* Check if variable is set before using it

* Use default value

* Parameterize deployment template ID

* Debugging terraform

* Forgot to pass deployment template ID

* Use TF_VAR_* instead of adding new env vars

* Fix deployment template ID

* Set ES and Kibana Docker image URLs

* Removing unrelated changes

* Set -fips suffix on default docker images

* Remove commented out lines

* Bring back ESS_REGION env var

* Define default ESS_REGION so it's not unbound

* Restoring file from main

* Try to set STACK_BUILD_ID to ""

* [Testing] Hardcoding stack build ID

* Only use STACK_VERSION when FIPS=true

* Revert "[Testing] Hardcoding stack build ID"

This reverts commit fb275fb.

* Remove unnecessary line

(cherry picked from commit a555980)
ycombinator added a commit that referenced this pull request Aug 15, 2025
…onment (#9198) (#9385)

* Get Staging FRH/GovCloud API key from Vault

* Point Elastic Cloud API endpoint for TF provider to Staging FRH/GovCloud environment

* Updating Vault path for API key

* [Test commit] No-op change to trigger FIPS tests

* Set ESS region to Staging GovCloud region

* Check if variable is set before using it

* Use default value

* Parameterize deployment template ID

* Debugging terraform

* Forgot to pass deployment template ID

* Use TF_VAR_* instead of adding new env vars

* Fix deployment template ID

* Set ES and Kibana Docker image URLs

* Removing unrelated changes

* Set -fips suffix on default docker images

* Remove commented out lines

* Bring back ESS_REGION env var

* Define default ESS_REGION so it's not unbound

* Restoring file from main

* Try to set STACK_BUILD_ID to ""

* [Testing] Hardcoding stack build ID

* Only use STACK_VERSION when FIPS=true

* Revert "[Testing] Hardcoding stack build ID"

This reverts commit fb275fb.

* Remove unnecessary line

(cherry picked from commit a555980)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
ycombinator added a commit that referenced this pull request Aug 15, 2025
…onment (#9198) (#9384)

* Get Staging FRH/GovCloud API key from Vault

* Point Elastic Cloud API endpoint for TF provider to Staging FRH/GovCloud environment

* Updating Vault path for API key

* [Test commit] No-op change to trigger FIPS tests

* Set ESS region to Staging GovCloud region

* Check if variable is set before using it

* Use default value

* Parameterize deployment template ID

* Debugging terraform

* Forgot to pass deployment template ID

* Use TF_VAR_* instead of adding new env vars

* Fix deployment template ID

* Set ES and Kibana Docker image URLs

* Removing unrelated changes

* Set -fips suffix on default docker images

* Remove commented out lines

* Bring back ESS_REGION env var

* Define default ESS_REGION so it's not unbound

* Restoring file from main

* Try to set STACK_BUILD_ID to ""

* [Testing] Hardcoding stack build ID

* Only use STACK_VERSION when FIPS=true

* Revert "[Testing] Hardcoding stack build ID"

This reverts commit fb275fb.

* Remove unnecessary line

(cherry picked from commit a555980)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
@ycombinator ycombinator deleted the bk-fips-redirect branch August 18, 2025 15:51
kaanyalti pushed a commit to kaanyalti/elastic-agent that referenced this pull request Sep 4, 2025
…onment (elastic#9198)

* Get Staging FRH/GovCloud API key from Vault

* Point Elastic Cloud API endpoint for TF provider to Staging FRH/GovCloud environment

* Updating Vault path for API key

* [Test commit] No-op change to trigger FIPS tests

* Set ESS region to Staging GovCloud region

* Check if variable is set before using it

* Use default value

* Parameterize deployment template ID

* Debugging terraform

* Forgot to pass deployment template ID

* Use TF_VAR_* instead of adding new env vars

* Fix deployment template ID

* Set ES and Kibana Docker image URLs

* Removing unrelated changes

* Set -fips suffix on default docker images

* Remove commented out lines

* Bring back ESS_REGION env var

* Define default ESS_REGION so it's not unbound

* Restoring file from main

* Try to set STACK_BUILD_ID to ""

* [Testing] Hardcoding stack build ID

* Only use STACK_VERSION when FIPS=true

* Revert "[Testing] Hardcoding stack build ID"

This reverts commit fb275fb.

* Remove unnecessary line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch skip-changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants