Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1564: do not force flushing maps when rb is triggered #348

Merged
merged 1 commit into from
Jun 26, 2024

Conversation

jotak
Copy link
Member

@jotak jotak commented Jun 14, 2024

Description

Flushing (without throttling) has a harmful effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated.

High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows, than trying to force using maps this way

Also, use errno as the reason for the metric

With this change + high stress scenario I'm seeing better CPU but more memory slightly increased:

Capture d’écran du 2024-06-14 17-14-26
Patch applied at 16:55

overall, -50% CPU and +10% memory

This should be tested against cluster-density-v2

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 14, 2024

@jotak: This pull request references NETOBSERV-1564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.17.0" version, but no target version was set.

In response to this:

Description

Flushing (without throttling) has a nefast effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated.

High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows.

Also, use errno as the reason for the metric

With this change + high stress scenario I'm seeing better CPU but more memory slightly increased:

Capture d’écran du 2024-06-14 17-14-26

overall, -50% CPU and +10% memory

This should be tested against cluster-density-v2

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 14, 2024

@jotak: This pull request references NETOBSERV-1564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.17.0" version, but no target version was set.

In response to this:

Description

Flushing (without throttling) has a nefast effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated.

High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows.

Also, use errno as the reason for the metric

With this change + high stress scenario I'm seeing better CPU but more memory slightly increased:

Capture d’écran du 2024-06-14 17-14-26
Patch applied at 16:55

overall, -50% CPU and +10% memory

This should be tested against cluster-density-v2

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 14, 2024

@jotak: This pull request references NETOBSERV-1564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Description

Flushing (without throttling) has a harmful effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated.

High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows.

Also, use errno as the reason for the metric

With this change + high stress scenario I'm seeing better CPU but more memory slightly increased:

Capture d’écran du 2024-06-14 17-14-26
Patch applied at 16:55

overall, -50% CPU and +10% memory

This should be tested against cluster-density-v2

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 14, 2024

@jotak: This pull request references NETOBSERV-1564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Description

Flushing (without throttling) has a harmful effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated.

High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows, than trying to force using maps this way

Also, use errno as the reason for the metric

With this change + high stress scenario I'm seeing better CPU but more memory slightly increased:

Capture d’écran du 2024-06-14 17-14-26
Patch applied at 16:55

overall, -50% CPU and +10% memory

This should be tested against cluster-density-v2

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Flushing (without throttling) has a nefast effect in high stressed
scenario, generating a lot of evictions from maps, resulting in many
more flows generated.

High stressed scenarios should rather rely on rb+accounter, which better
handles the number of generated flows.

Also, use errno as the reason for the metric
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 17, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:81e0239

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=81e0239 make set-agent-image

Copy link

codecov bot commented Jun 17, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@88136bd). Learn more about missing BASE report.
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #348   +/-   ##
=======================================
  Coverage        ?   33.33%           
=======================================
  Files           ?       48           
  Lines           ?     3489           
  Branches        ?        0           
=======================================
  Hits            ?     1163           
  Misses          ?     2229           
  Partials        ?       97           
Flag Coverage Δ
unittests 33.33% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
pkg/flow/tracer_ringbuf.go 25.00% <0.00%> (ø)

@@ -91,15 +91,9 @@ func (m *RingBufTracer) listenAndForwardRingBuffer(debugging bool, forwardCh cha
if debugging {
m.stats.logRingBufferFlows(mapFullError)
}
// if the flow was received due to lack of space in the eBPF map
Copy link
Contributor

@msherif1234 msherif1234 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't removing this logic means we will stay in hmap full condition now for longer and possibly dropping flows ? if the concern here is doing many map flush because system is under stress can we trigger a flush every 1s or so hopping to free up some hmap space ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having a fixed-rate flush could be a better solution, but actually there's already a flush every 5s (the cacheActiveTimeout setting) ; users can set it to 1s if that works better for them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we introduce a new "stress timer" (could be automatically set to cacheActiveTimeout / something) that only starts when stress is detected (ie. maps are full) ? but that's a bit complex, it has to be worth it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the way: when you said "this logic means we will stay in hmap full condition now for longer and possibly dropping flows" I don't think this is true; when maps are full, flows are not dropped, they're moved to the ring buffer. Dropped flows are only for busy maps when there isn't a new flow creation, so not something that can lead the map being full (we're updating an entry in that case, not adding a new one, so no incidence on hmap size)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use rb for new flows and it send 1 pkt at a time to userspace so it can't handle burst of traffic either

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still handle that much better than an uncontrolled hot-loop of flush events. Look at the -50% CPU that I'm showing in the PR description

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run perf-scale tests to check the diff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM if perf scale shows improvements

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingress-perf shows slightly improved stats: -4% memory, -8% CPU
(https://docs.google.com/spreadsheets/d/1EN12dogz-0_H_5tSoV24T4iHa9a3wSgoF396YyOqavA/edit?gid=93930639#gid=93930639 / diff: https://docs.google.com/spreadsheets/d/1q_XCJ48h2Q78JapxgNB37nJe-4lTYVlMx7xVrn7ztck/edit?gid=696044201#gid=696044201)

Now running cluster-density. But I'm not sure to see something very different here, as the RB isn't involved a lot anyway.

Problem is that, none of those test really stress the agents. So I may not see something better than this -8% here. The tests that I did above DO stress the agents much more, and I had -50%

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Amoghrd
Copy link

Amoghrd commented Jun 25, 2024

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label Jun 25, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 25, 2024

@jotak: This pull request references NETOBSERV-1564 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Description

Flushing (without throttling) has a harmful effect in high stressed scenario, generating a lot of evictions from maps, resulting in many more flows generated.

High stressed scenarios should rather rely on rb+accounter, which better handles the number of generated flows, than trying to force using maps this way

Also, use errno as the reason for the metric

With this change + high stress scenario I'm seeing better CPU but more memory slightly increased:

Capture d’écran du 2024-06-14 17-14-26
Patch applied at 16:55

overall, -50% CPU and +10% memory

This should be tested against cluster-density-v2

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak jotak requested a review from msherif1234 June 26, 2024 06:25
@msherif1234
Copy link
Contributor

/lgtm

@jotak
Copy link
Member Author

jotak commented Jun 26, 2024

thanks @msherif1234
/approve

Copy link

openshift-ci bot commented Jun 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit fdebe3f into netobserv:main Jun 26, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved jira/valid-reference lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants