Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use throttle: 1 for localhost unarchive #498

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

travisdowns
Copy link
Contributor

During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like
/tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture.

This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above.

To fix this, we use throttle: 1 on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output.

Fixes #457.

@travisdowns
Copy link
Contributor Author

travisdowns commented Dec 20, 2024

I tested this locally with a file that is more likely to trigger the race: a tar of 10,000 small files all at the top level of the tar: this triggers the error 100% of the time w/o the fix and passes 1,000 iterations locally with no failures.

I also verified that deploying prom to a 4 host inventory works, 100 iterations without failure. This inventory has 2 hosts each for 2 different architectures (amd64 and arm64).

Copy link
Member

@gardar gardar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a clean and simple solution - I like it! Thanks!

@travisdowns
Copy link
Contributor Author

The test failures look like this:

TASK [prometheus.prometheus._common : Get checksum list for fail2ban_exporter_0.10.2_linux_amd64.tar.gz] ***
fatal: [almalinux-9]: FAILED! => {"msg": "An unhandled exception occurred while running the lookup plugin 'url'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Received HTTP error for https://gitlab.com/hectorjsmith/fail2ban-prometheus-exporter/-/releases/v0.10.2/downloads/fail2ban_exporter_0.10.2_checksums.txt : HTTP Error 308: Permanent Redirect"}

I believe this means the external resource has changed. It looks like the user changed their name and the correct link is:

https://gitlab.com/hctrdev/fail2ban-prometheus-exporter/-/releases/v0.10.2/downloads/fail2ban_exporter_0.10.2_checksums.txt

@SuperQ
Copy link
Contributor

SuperQ commented Jan 10, 2025

I think the 308 error is related to #507

@gardar
Copy link
Member

gardar commented Jan 11, 2025

Please rebase now that the 308 error should be resolved

@travisdowns
Copy link
Contributor Author

travisdowns commented Jan 24, 2025

@gardar I just rebased on tip of main as of today, let's see how things go.

During the common install tasks, we download and unpack the binary to
install on localhost and eventually upload it to each host (among many
other things). Currently the "unpack" step, which extracts the gzipped
tar archive, is perform N times if there are N hosts in the inventory,
but the target directory (something like
/tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with
the same architecture.

This means that all the unarchive tasks are extracting in parallel in an
unsynchronized manner to the same directory. It a miracle that this
works, but at least usually it like it does, but sometimes tar will
complain about a missing file, failing the install. I've confirmed
locally that this is due to the race described above.

To fix this, we use `throttle: 1` on the unpack task. This means that
the first task (for each architecture) will do the unpack and the other
tasks are effectively no-ops, they are reported as "ok" rather than
changed in the output.

Fixes prometheus-community#457.

Signed-off-by: Travis Downs <travis.downs@gmail.com>
@travisdowns
Copy link
Contributor Author

Force push to 341d90f was a formatting fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment