Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk full on Jenkins CI server #3747

Open
targos opened this issue Jun 1, 2024 · 6 comments
Open

Disk full on Jenkins CI server #3747

targos opened this issue Jun 1, 2024 · 6 comments

Comments

@targos
Copy link
Member

targos commented Jun 1, 2024

No description provided.

@targos
Copy link
Member Author

targos commented Jun 1, 2024

I'm looking into it

@targos
Copy link
Member Author

targos commented Jun 1, 2024

Similar to #3288

I logged into the backup server and ran /root/backup_scripts/remove_old.sh ci.nodejs.org.
It freed 100GB.

@targos targos closed this as completed Jun 1, 2024
@targos targos reopened this Jul 29, 2024
@targos
Copy link
Member Author

targos commented Jul 29, 2024

It happened again.

@ryanaslett Maybe the new backup server is not setup to run the cleanup script regularly?

@ryanaslett
Copy link
Contributor

Hmm. Its setup in the crontab:
40 23 * * 6 /usr/bin/rsnapshot -c /usr/local/etc/rsnapshot.conf weekly && /root/backup_scripts/remove_old.sh ci-release.nodejs.org && /root/backup_scripts/remove_old.sh ci.nodejs.org

It should be clearing it out once a week.

The backup server lacks any kind of monitoring or alerting if those tasks do not succeed for whatever reason, so we should probably come up with a strategy to be notified if those crons fail for whatever reason.

@targos
Copy link
Member Author

targos commented Aug 1, 2024

In that case, I think the problem is clear.

Running remove_old.sh ci-release.nodejs.org ends up with an error:

# /root/backup_scripts/remove_old.sh ci-release.nodejs.org
curl: (92) HTTP/2 stream 1 was not closed cleanly before end of the underlying stream
# echo $?
92

So the script never gets a chance to be executed for ci.nodejs.org

@ryanaslett
Copy link
Contributor

remove_old.sh ssh'es into ci, and ci-release and blows away any jobs older than 22 days, then triggers a jenkins reload to recognize the jobs are missing.

The credentials for jenkins were for a jenkins user jbergstroem

jbergstroem is missing the Overall/Read permission Is the error given.

Not sure when they were removed from the Nodejs/build github team, but thats the last time this script probably executed successfully.

I've replaced the credentials with an API token for my account for now and have ran it for ci, but Im not sure how jbergstroem had one api key that worked with both ci and ci-release (maybe moved it over to release from ci somehow?)

The cron will currently delete the jobs on ci, and refresh and then delete the jobs on ci-release and then fail to refresh because its using the same api token.

This should probably be a service account with proper permissions to access the /reload path.

OTOH, this seems like a brittleway to avoid using jenkins own job cleanup mechanism:

image

My recommendation is that we change the jobs on the release server first (since theres only a handful) and remove this cleanup mechanism from ci-release first, and then modify the jobs on ci.nodejs.org to also clean up after themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants