Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added timeout for mount command in goss #877

Merged
merged 1 commit into from
Mar 17, 2024

Conversation

melvilgit
Copy link
Contributor

Checklist

  • documentation is changed or added
  • make test passed

Description of change

  1. We have noticed that mount tests hung and became unresponsive in some of our production VM.
  2. The fix is to add a timeout to mount command similar to addr/command.
  3. Default timeout is 1000 ms

@melvilgit melvilgit changed the title Added timeout for mount command in survey tool Added timeout for mount command in goss Feb 13, 2024
@melvilgit melvilgit closed this Feb 13, 2024
@melvilgit melvilgit reopened this Feb 13, 2024
@melvilgit
Copy link
Contributor Author

@aelsabbahy Could you check this ?
Some of our jenkins builds that run goss just hung since mount was not timing out.

@aelsabbahy
Copy link
Member

OSS build credits ran out with Travis CI. I'll open a support ticket for them.

Once the credits are available, I'll close and reopen this PR to trigger the build.

@aelsabbahy aelsabbahy closed this Feb 19, 2024
@aelsabbahy aelsabbahy reopened this Feb 19, 2024
@aelsabbahy
Copy link
Member

CI is back up and running. Once the build is passed, I can review the code changes.

@melvilgit melvilgit closed this Feb 20, 2024
@melvilgit melvilgit reopened this Feb 20, 2024
@melvilgit
Copy link
Contributor Author

CI is back up and running. Once the build is passed, I can review the code changes.

@aelsabbahy The CI passed. Please have a look.

@aelsabbahy
Copy link
Member

aelsabbahy commented Feb 24, 2024

Eyeball check, this looks good. I assume this PR works locally for you for your needs. It doesn't seem to break any of the CI tests which adds confidence.

A few minor questions:

  1. What was the behavior before, did it hang forever or did it eventually fail. Trying to understand the change to the default behavior for existing users.

  2. One second seems like a good default to me, I'm wondering how many existing users have working tests that take longer than a second, if any.

  3. Related to the top two points: What causes this issue? perhaps understanding that will lead to a good assessment on a default timeout, or if it's a misuse of the mountinfo package, or something else.

Many thanks for submitting this btw. Just trying to understand the problem better before merging this solution.

@melvilgit
Copy link
Contributor Author

melvilgit commented Feb 24, 2024

Thanks @aelsabbahy for the review.

A few minor questions:

  1. What was the behavior before, did it hang forever or did it eventually fail. Trying to understand the change to the default behavior for existing users.
    Yes, It hung forever in some of our VM. We use goss periodically from our jenkins node and the test just hung on the VM since the mount system call wasn't responding. We eventually had to kill the goss process.
  2. One second seems like a good default to me, I'm wondering how many existing users have working tests that take longer than a second, if any.
    The issue is intermittent in some of the vms. We run the test in more than 20k vms and only a handful of them have the issue, meaning it is not widespread.
  3. Related to the top two points: What causes this issue? perhaps understanding that will lead to a good assessment on a default timeout, or if it's a misuse of the mountinfo package, or something else.
    The issue happens in vms with faulty hard drives. We did strace and it was just hung with select call
    strace on the process is looping at:
    select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
    select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
    select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)

Many thanks for submitting this btw. Just trying to understand the problem better before merging this solution.

@aelsabbahy
Copy link
Member

Once this branch is updated it looks good to me assuming it correctly solved your issue.

I'm not sure how to reproduce it locally to test the failure situation. 😅

@melvilgit melvilgit force-pushed the mount-timeout branch 2 times, most recently from cb578d5 to 91dc353 Compare February 29, 2024 02:00
Added goss timeout in integration tests

Added timeout to goss integration tests

added goss timeout

added goss timeout

added mount to goss

added timeout to goss
@melvilgit
Copy link
Contributor Author

@aelsabbahy please take a look and merge the pr if good. Thanks :)

@aelsabbahy
Copy link
Member

I'm on travel currently, I will review, merge, and release this in ~11 days or so. I'll make sure no other PRs are merged before this to avoid any more merges/rebases.

Thank you for the contribution, it's nice to see niche problems at scale 😄

@melvilgit
Copy link
Contributor Author

@aelsabbahy Could you please review this once ?

@aelsabbahy
Copy link
Member

Yup, I'll review it once I'm back from travel. I'm flying today, so sometime before the end of this weekend should be ample time for me to take a look at this.

Copy link
Member

@aelsabbahy aelsabbahy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging, thank you for the contribution! This is planned to be in the next goss release.

@aelsabbahy aelsabbahy merged commit 857aeb1 into goss-org:master Mar 17, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants