Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[internal/otelarrow] Resolve test flakes; skip one still-flaky test #34794

Merged
merged 15 commits into from
Aug 27, 2024

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Aug 21, 2024

Description: Fixes the causes of flakiness in most cases by using a callback to terminate the test without resorting to sleep statements. There is still one flaky test that for reasons not understood, does not pass. Fortunately, it fails in a repeatable way, and I will debug as part of #34719.

Link to tracking Issue: #34719

@jmacd jmacd added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Aug 21, 2024
@jmacd jmacd requested review from a team and andrzej-stencel August 21, 2024 17:59
@jmacd jmacd requested a review from pjanotti August 21, 2024 17:59
Copy link
Member

@mwear mwear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave the test active in non-windows environments. Something like:

if runtime.GOOS == "windows" {
		t.Skip("skipping flaky test on Windows, see https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34719")
	}

@crobert-1 crobert-1 added the Run Windows Enable running windows test on a PR label Aug 21, 2024
@jmacd jmacd changed the title Disable the TestIntegrationSelfTracing test Disable the OTel-Arrow TestIntegrationSelfTracing test Aug 21, 2024
@jmacd jmacd changed the title Disable the OTel-Arrow TestIntegrationSelfTracing test Reduce flake potential in the OTel-Arrow TestIntegrationSelfTracing test Aug 21, 2024
@jmacd
Copy link
Contributor Author

jmacd commented Aug 21, 2024

This test is failing for other reasons in my local environment. The local failure was an expectation that more than one stream was used, and I find the test wasn't controlling for that variable. I introduced retries for thoroughness and a tiny sleep call to ensure the expected >1 streams takes place. It can be no worse for the Windows problem, and I see it has passed at least once on Windows (from the history above, though see other test flakes on Windows).

Copy link
Member

@mwear mwear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failing test is an instance of: #34792

Copy link
Contributor

@pjanotti pjanotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ignoramus about the internals of this component, but, when testing in Windows, adding a sleep can be advisable given the differences in scheduling and time tick resolution in relation *nix where most devs run their tests. LGTM.

internal/otelarrow/test/e2e_test.go Outdated Show resolved Hide resolved
@jmacd jmacd changed the title Reduce flake potential in the OTel-Arrow TestIntegrationSelfTracing test Fix test flakes in the internal/otelarrow Aug 22, 2024
@jmacd jmacd marked this pull request as draft August 22, 2024 18:57
@jmacd jmacd changed the title Fix test flakes in the internal/otelarrow Fix test flakes in internal/otelarrow Aug 22, 2024
@crobert-1
Copy link
Member

I've submitted #34823 for the receiver/datadog test failures, unrelated to this change.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 22, 2024

There is still a flaky test, the problem appears to be a real problem combined with a slower test start than on the non-windows host. I have added a Skip() and will leave #34719 open to cover fixing it.

@jmacd jmacd changed the title Fix test flakes in internal/otelarrow Resolve test flakes in internal/otelarrow; skip one still-flaky test Aug 22, 2024
@jmacd jmacd marked this pull request as ready for review August 23, 2024 00:53
@jmacd
Copy link
Contributor Author

jmacd commented Aug 23, 2024

The windows/internal test group passes with my changes. The other tests are still preventing a ✅. Hoping to unblock the release and CI process, I left one Skip() in here, which I will investigate next week.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 26, 2024

I recommend we merge this as-is, because (a) it will take some time to properly fix the one skipped test, (b) I have four PRs open in this repository and it becomes difficult to move anything in this situation.

@jmacd jmacd requested a review from jpkrohling August 26, 2024 21:10
Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@github-actions github-actions bot added the cmd/otelcontribcol otelcontribcol command label Aug 27, 2024
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling jpkrohling changed the title Resolve test flakes in internal/otelarrow; skip one still-flaky test [internal/otelarrow] Resolve test flakes; skip one still-flaky test Aug 27, 2024
@jpkrohling jpkrohling merged commit 5c9325e into open-telemetry:main Aug 27, 2024
172 checks passed
@github-actions github-actions bot added this to the next release milestone Aug 27, 2024
codeboten added a commit that referenced this pull request Sep 6, 2024
**Description:** Restore a skipped test, after understanding the nature
of the problem.

The problem was mostly addressed in
#34794,
which left the test disabled. The test had been flaky because while
testing for an out-of-memory condition, the test could fail for timeout
or other reason. To make the test more reliable, this now waits until at
least one ArrowTraces span has been received by both components. After
one span is available, it checks that the expected log messages are
present on both sides.

**Link to tracking Issue:** 
Fixes #34719.

**Testing:** ✅

---------

Co-authored-by: Curtis Robert <crobert@splunk.com>
Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
f7o pushed a commit to f7o/opentelemetry-collector-contrib that referenced this pull request Sep 12, 2024
…pen-telemetry#34794)

**Description:** Fixes the causes of flakiness in most cases by using a
callback to terminate the test without resorting to sleep statements.
There is still one flaky test that for reasons not understood, does not
pass. Fortunately, it fails in a repeatable way, and I will debug as
part of open-telemetry#34719.

**Link to tracking Issue:**
open-telemetry#34719

---------

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
Co-authored-by: Juraci Paixão Kröhling <juraci@kroehling.de>
f7o pushed a commit to f7o/opentelemetry-collector-contrib that referenced this pull request Sep 12, 2024
**Description:** Restore a skipped test, after understanding the nature
of the problem.

The problem was mostly addressed in
open-telemetry#34794,
which left the test disabled. The test had been flaky because while
testing for an out-of-memory condition, the test could fail for timeout
or other reason. To make the test more reliable, this now waits until at
least one ArrowTraces span has been received by both components. After
one span is available, it checks that the expected log messages are
present on both sides.

**Link to tracking Issue:** 
Fixes open-telemetry#34719.

**Testing:** ✅

---------

Co-authored-by: Curtis Robert <crobert@splunk.com>
Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
jriguera pushed a commit to springernature/opentelemetry-collector-contrib that referenced this pull request Oct 4, 2024
**Description:** Restore a skipped test, after understanding the nature
of the problem.

The problem was mostly addressed in
open-telemetry#34794,
which left the test disabled. The test had been flaky because while
testing for an out-of-memory condition, the test could fail for timeout
or other reason. To make the test more reliable, this now waits until at
least one ArrowTraces span has been received by both components. After
one span is available, it checks that the expected log messages are
present on both sides.

**Link to tracking Issue:** 
Fixes open-telemetry#34719.

**Testing:** ✅

---------

Co-authored-by: Curtis Robert <crobert@splunk.com>
Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd/otelcontribcol otelcontribcol command exporter/otelarrow internal/otelarrow receiver/otelarrow Run Windows Enable running windows test on a PR Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants