Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding upgrade test for unhealthy upgraded Agent #3274

Merged
merged 16 commits into from
Sep 8, 2023

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Aug 21, 2023

What does this PR do?

This PR adds an integration test that ensures that an unhealthy Agent is rolled back.

Why is it important?

To ensure upgrade rollbacks work as expected.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

How to test this PR locally

Check out this PR's branch and then:

DEV=true PLATFORMS=linux/arm64,linux/amd64 PACKAGES=tar.gz mage package
mage integration:single TestStandaloneUpgradeFailsStatus

Related issues

@mergify
Copy link
Contributor

mergify bot commented Aug 21, 2023

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@elasticmachine
Copy link
Contributor

elasticmachine commented Aug 21, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-09-08T14:48:54.210+0000

  • Duration: 26 min 2 sec

Test stats 🧪

Test Results
Failed 0
Passed 6277
Skipped 55
Total 6332

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Aug 21, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.78% (81/82) 👍
Files 66.212% (194/293) 👍
Classes 65.562% (356/543) 👍
Methods 52.612% (1118/2125) 👍
Lines 38.185% (12731/33340) 👍 0.013
Conditionals 100.0% (0/0) 💚

@ycombinator ycombinator force-pushed the integ-test-upgrade-fails-status branch 2 times, most recently from 4c9a949 to f386baf Compare August 22, 2023 22:36
@ycombinator ycombinator marked this pull request as ready for review August 23, 2023 16:48
@ycombinator ycombinator requested a review from a team as a code owner August 23, 2023 16:48
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@ycombinator ycombinator force-pushed the integ-test-upgrade-fails-status branch from a72ba62 to d4350a3 Compare August 24, 2023 21:11
@pierrehilbert pierrehilbert requested a review from pchila August 25, 2023 06:23
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments but overall it seems fine as it is a variation on other upgrade tests we already have and run...

testing/integration/upgrade_test.go Outdated Show resolved Hide resolved
// input when the Agent version matches the upgraded Agent version. This way
// the pre-upgrade version of the Agent runs healthy, but the post-upgrade
// version doesn't.
invalidInputPolicy := fastWatcherCfg + fmt.Sprintf(`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we guarantee that the fastWatcherCfg is supported? is it even needed here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't guarantee it but the settings in fastWatcherCfg should just be ignored by any versions of Agent that don't support it. Also, checkUpgradeWatcherRan will just skip checking if the upgrade watcher finished running for Agents that don't support the fastWatcherCfg settings:

func checkUpgradeWatcherRan(t *testing.T, agentFixture *atesting.Fixture, fromVersion *version.ParsedSemVer) {
t.Helper()
if fromVersion.Less(*version_8_9_0_SNAPSHOT) {
t.Logf("Version %q is too old for a quick update marker check, skipping...", fromVersion)
return
}

So I think we are okay here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the checkUpgradeWatcherRan skips here, doesn't that make the test run pointless?
If I understood correctly, the point of the test is to verify that the watcher rollback the upgrade because of the unhealthy state of the new agent (please correct me if this is not the correct assumption). In such a case wouldn't the upgrade watcher terminate before the full 10 mins default grace period because of the amount of errors ?

Copy link
Contributor Author

@ycombinator ycombinator Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently on the main branch, the fromVersion is 8.11.0. Which means checkUpgradeWatcherRan won't be skipped. If we backport this test to 8.10 and even 8.9, checkUpgradeWatcherRan won't be skipped.

If we were to backport this test to versions before 8.9, then checkUpgradeWatcherRan would be skipped and the rest of the test will fail because the assertion for checking if Agent has rolled back to the pre-upgrade version will timeout. But we don't plan to backport this test to any versions before 8.9 so I think we're good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On an unrelated note, I just noticed that checkUpgradeWatcherRan skips if the fromVersion is < 8.9.0. That made sense when Agent was spawning the Upgrade Watcher from the old (pre-upgrade) Agent's binary. However, starting from 8.10.0 onwards, Agent will spawn the Upgrade Watcher from the new (post-upgrade) Agent's binary. So I think we need an additional check here to skip if the fromVersion is >= 8.10.0 && toVersion < 8.9.0. WDYT?

Copy link
Member

@pchila pchila Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the check is correct as it refers to the earliest version that understand settings for a shorter watch period... The fact that the watcher is spawned from the old or the new shouldn't matter as the outcome in the case of an unhealthy upgraded agent should stay the same...
It actually provides an interesting additional test: that even if the upgraded agent process itself is unhealthy, the agent process stays alive and take the correct action ;)

@pierrehilbert pierrehilbert linked an issue Aug 27, 2023 that may be closed by this pull request
15 tasks
@ycombinator ycombinator force-pushed the integ-test-upgrade-fails-status branch from c33192b to 789f69b Compare August 28, 2023 13:41
@ycombinator ycombinator requested a review from pchila August 28, 2023 14:10
@ycombinator ycombinator force-pushed the integ-test-upgrade-fails-status branch from 789f69b to fcb09a4 Compare August 29, 2023 12:50
@ycombinator ycombinator added backport-v8.10.0 Automated backport with mergify and removed backport-skip labels Aug 29, 2023
@ycombinator ycombinator enabled auto-merge (squash) August 29, 2023 15:13
@ycombinator ycombinator force-pushed the integ-test-upgrade-fails-status branch from fcb09a4 to 9aeade4 Compare September 5, 2023 16:46
@ycombinator
Copy link
Contributor Author

This test is failing on Ubuntu arm64 but not on Ubuntu amd64. The failure is bizzare too, in that it happens in the very first step of the test, where we try to install the Agent fixture and check that it's healthy, and the failure is evidently due to /opt/Elastic/Agent/elastic-agent not existing!

=== RUN   TestStandaloneUpgradeFailsStatus
    upgrade_test.go:944: Testing Elastic Agent upgrade from 8.11.0-SNAPSHOT to 8.10.0-SNAPSHOT...
    fetcher.go:90: Using existing artifact elastic-agent-8.11.0-SNAPSHOT-linux-arm64.tar.gz
    fixture.go:221: Extracting artifact elastic-agent-8.11.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestStandaloneUpgradeFailsStatus2517823418/001
    fixture.go:234: Completed extraction of artifact elastic-agent-8.11.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestStandaloneUpgradeFailsStatus2517823418/001
    fixture.go:540: Components were not modified from the fetched artifact
    upgrade_test.go:971: Install the built Agent
    fixture.go:365: >> running agent with: [/tmp/TestStandaloneUpgradeFailsStatus2517823418/001/elastic-agent-8.11.0-SNAPSHOT-linux-arm64/elastic-agent install --force --non-interactive]
    upgrade_test.go:973: Installing in non-interactive mode.Elastic Agent has been successfully installed.
        
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    upgrade_test.go:977: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_test.go:977
        	Error:      	Condition never satisfied
        	Test:       	TestStandaloneUpgradeFailsStatus
        	Messages:   	Agent never became healthy
    fixture_install.go:124: collecting diagnostics; test failed
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestStandaloneUpgradeFailsStatus-diagnostics-2023-09-05T17:40:27Z.zip]
    fixture_install.go:228: failed to collect diagnostics to /home/ubuntu/agent/build/diagnostics/TestStandaloneUpgradeFailsStatus-diagnostics-2023-09-05T17:40:27Z.zip (fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory): 
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent uninstall --force]
--- FAIL: TestStandaloneUpgradeFailsStatus (150.71s)

If I run just just this test locally on an ARM64 Ubuntu machine, it passes. I'm going to re-run the Buildkite build and see if the test fails again in the same way or in a different way, to perhaps give us some ideas into what might be going on.

@jlind23
Copy link
Contributor

jlind23 commented Sep 6, 2023

@pazone any reason why in this sonar notification #3274 (comment) we don't get any coverage metric?

@cmacknz
Copy link
Member

cmacknz commented Sep 6, 2023

The test failure here appears similar to https://buildkiteartifacts.com/e0f3970e-3a75-4621-919f-e6c773e2bb12/0187faac-2a90-4642-bc78-c33cefa167ba/018a667e-d0ba-4c91-908f-63a0d2c6aed0/018a667f-9c04-4961-b012-33d27279e3e7/build/TEST-report.html?response-content-type=text%2Fhtml&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAQPCP3C7LUJWWWKNY%2F20230906%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230906T123542Z&X-Amz-Expires=600&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEJj%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIG4GMwSJbrlL46bN51XihMmWxfUSFJUuBoR9hS%2FHZh6rAiB%2FwtX73xzbCLFdQ%2BJYDxvJPqBRNuDoe34VtKaLtOc9GyrxAwhwEAAaDDAzMjM3OTcwNTMwMyIMT8Tk9L8wlAORa2SBKs4DXAnvnfYgB9flWasmNmfEM1to2pYULoMDPz6tXgx3JMIr2ll%2FTARSmqupwAkM6KcD1pFZ1%2BDSJ1YjMRR6a0PCvFEFOS%2FZc3wT0zSzlFb6IsuiRnpg15MPQ95YL22%2Fw0wa61BTBDELCDo1TpDX0HGPpUpFwltJweP2No%2Fu%2FIrzp2V11xueoJ3mrJdtric89GviJT06ufpiqAAIO1CKswUfOD4g5wdKKSZtL0RQb6wZ4h0l%2BXa%2B2bUBzR3%2F0EEYT5EoH0crn3vddghjy3Wo2lUtxqdQE0H9hVfuB5Tzt02nq9DNCBB7Yx7wtkAb6Ue1tGTk8SRKi4r%2FywQGjfwbted8vVdrMzV747Mcj4PXlC7ySzlCzM%2BA6Vry9yBuG9XxtjQL5u4cAz3kGLnaMGrrRYoKheVE6aSEh3xm%2B5qIecWuLNfAWvKWkLVQWbfzbT%2FCngNqn%2BxuwHZWHg9UWpzkyFunrpA0aPf%2Bw0Tgpli%2FdiQoS%2BRZOKM%2BSKuHiHdRZ5y%2FZAG6q9V7%2BM7S9oKQjXKv9p98Sk7dqc4%2FkcWlq2Vir8r%2BzLRTQEtFj53vbtbOzUjdfSTwvi%2BerMalccfhPkHuPbE0eKVYeO7YzEWxrYwISKsWMOPQ4KcGOqYBygdRoJZM%2B9QUizmT2OEUBqqG%2FEX%2BSfgWJc9N5fFtNz52IHFSONdamz9qu26KRzNLH0UEOBoC4lJPlhq0VCiBh4hvLga%2FKWMYcQWZ9cVow7SiBwhphpHGEPOBdSDw%2FG6YPNyU0FlfO7sd3NFvvoQKh361LxfCNWnUUZgHU14E7mHk8ytK9gS5vOrKESiSyeqGn%2FIYVndgtGWPniF9qjQxNu2ibCSf6w%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=b3631f059508567bf67d11fbdf5c959d3e0f3f6cb2917c7a870618607a219f6c#github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204)(sudo).github.com/elastic/elastic-agent/testing/integration.TestFleetManagedUpgrade/Upgrade_managed_agent_from_7.17.13_to_8.10.0-SNAPSHOT on main.

I did notice it is more common on arm but also happens (or very similar problems happen) on amd64.

@ycombinator
Copy link
Contributor Author

ycombinator commented Sep 7, 2023

This time TestStandaloneUpgradeFailsStatus did not fail in Buildkite. Re-running and hoping it fails so we can see what might be uninstalling Agent in the middle of the test...

@ycombinator
Copy link
Contributor Author

buildkite test this

@ycombinator
Copy link
Contributor Author

ycombinator commented Sep 7, 2023

This time TestStandaloneUpgradeFailsStatus did not fail in Buildkite. Re-running and hoping it fails so we can see what might be uninstalling Agent in the middle of the test...

Re-running caused TestStandaloneUpgradeFailsStatus to fail again with the same error but the logging I added in dbfc886 didn't reveal much. This type of failure — where Agent randomly disappears during integration tests in CI — is happening on main as well. We think the root cause of this is #3371.

This PR here is, therefore, blocked on #3371 for the moment.

@ycombinator
Copy link
Contributor Author

Workaround for #3371 was just merged: #3370. Re-basing on main...

@elastic-sonarqube
Copy link

SonarQube Quality Gate

Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@ycombinator ycombinator merged commit 5531ad3 into elastic:main Sep 8, 2023
mergify bot pushed a commit that referenced this pull request Sep 8, 2023
* [WIP] Adding upgrade test for unhealthy upgraded Agent

* Fixing up logging

* WIP: fixing some issues with fake package creation

* Use fast watcher config

* [WIP] Different approach to making upgraded Agent unhealthy

* WIP: Adding TODO

* Flesh out rest of the test

* Adding logging

* Make invalid input try to run in upgraded Agent version

* Removing OS restriction

* Tuning status check frequency

* Use artifacts API

* Clarify comment

(cherry picked from commit 5531ad3)
@ycombinator ycombinator deleted the integ-test-upgrade-fails-status branch September 11, 2023 16:46
ycombinator added a commit that referenced this pull request Sep 19, 2023
* [WIP] Adding upgrade test for unhealthy upgraded Agent

* Fixing up logging

* WIP: fixing some issues with fake package creation

* Use fast watcher config

* [WIP] Different approach to making upgraded Agent unhealthy

* WIP: Adding TODO

* Flesh out rest of the test

* Adding logging

* Make invalid input try to run in upgraded Agent version

* Removing OS restriction

* Tuning status check frequency

* Use artifacts API

* Clarify comment

(cherry picked from commit 5531ad3)
ycombinator added a commit that referenced this pull request Oct 2, 2023
* [WIP] Adding upgrade test for unhealthy upgraded Agent

* Fixing up logging

* WIP: fixing some issues with fake package creation

* Use fast watcher config

* [WIP] Different approach to making upgraded Agent unhealthy

* WIP: Adding TODO

* Flesh out rest of the test

* Adding logging

* Make invalid input try to run in upgraded Agent version

* Removing OS restriction

* Tuning status check frequency

* Use artifacts API

* Clarify comment

(cherry picked from commit 5531ad3)
ycombinator added a commit that referenced this pull request Oct 2, 2023
* [WIP] Adding upgrade test for unhealthy upgraded Agent

* Fixing up logging

* WIP: fixing some issues with fake package creation

* Use fast watcher config

* [WIP] Different approach to making upgraded Agent unhealthy

* WIP: Adding TODO

* Flesh out rest of the test

* Adding logging

* Make invalid input try to run in upgraded Agent version

* Removing OS restriction

* Tuning status check frequency

* Use artifacts API

* Clarify comment

(cherry picked from commit 5531ad3)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.10.0 Automated backport with mergify skip-changelog Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create tests to ensure that Elastic Agent's upgrade are working as expected
5 participants