-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add locking in the primary r10k binary. This serializes runs of r10k … #610
Add locking in the primary r10k binary. This serializes runs of r10k … #610
Conversation
…and prevents it from needlessly clobbering changes made by another running r10k install. It will currently block and wait indefinately for the lock to be released.
$stderr.puts "\nError while running: #{e.inspect}".red | ||
$stderr.puts e.backtrace.join("\n").red if ARGV.include? '--trace' | ||
exit(1) | ||
File.open("/tmp/r10k.lock", File::RDWR|File::CREAT, 0644) do |f| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might use a variable here instead of a file name, and put the variable above it. For the optimistic timeline where puppet masters comes to windows and that path needs changed.
Hello and thanks for your contribution to r10k! At this point, I'm a little reluctant to move forward on this particular feature for a couple reasons. First and foremost, the solution implemented here is already available on most Linux distributions to any command via the flock(1) utility and I'd be concerned about re-inventing the wheel to some extent. Second, if we were going to incorporate an exclusive locking feature into r10k itself, it would need to be controllable by a command line flag since it is possible to successfully manage the concurrency issues presented by multiple running r10k processes. (Specifically by using per-process cachedirs and environment level locking, which is roughly the approach the Puppet Enterprise Code Manager currently takes.) Perhaps we can just add some additional documentation about the concurrency issues and provide an example of a wrapper script that uses flock(1) to enforce serialized execution? |
The reason why it should be in the binary itself and opt-out is POLA. r10k in it's default setting should /not/ trash the git repos and prevent the environments from being able to be deployed whether it's run one at a time or not. There are many ways to run the binary, you can have custom webhooks, or are using the mco r10k module, or just running the command by hand. It should be as safe as possible to run in it's default, out of the box configuration. |
My primary reason to implement this in r10k is to prevent every other solution from having to re-implement the same solution (Distributed DRY). See acidprime/r10k#268 as an example where this is occurring. Certainly, there may be some more preferable way to do this, but it should still be done :) |
Yeah, I don't disagree that we should try to make things safe by default. I think that, at least with respect to Git repos, rugged/libgit2 does a better job of this than the shellgit provider right now. Ideally the locking would be implemented around the actually unsafe operations inside of r10k, since something read-only like |
@scotje Do you have a suggestion for the correct insertion point? I'm not that familiar with the guts of r10k but I believe @heliocentric and I are motivated to act, if we know where to hack away with the best chance of an accepted PR. |
@rnelson0 I think most of the "unsafe" git operations occur in @heliocentric is your Git repo corruption issue easily reproducible? I'm curious as to what may have happened there; I could see operations failing due to contention but actual corruption seems like it should be very rare so there may be a different bug in play there. |
I can't speak to helio's issues, but ours occur when we do a feature->production merge-and-delete in one fell swoop. It fires two webhook events, which runs two r10k instances. One starts managing the git files and deploys or deletes the environment (it's a race to see which makes it through first). The other tries to manage the git files and either errors out entirely, or more likely, inspects the files before it changes, then errors because the index is corrupt (from its POV) and ends the r10k run. You can see more detail in the PE support ticket 12465, including logs. The end result is either A) production updated but the feature branch is still present (not bad) or B) feature branch is deleted but the production branch is stale (bad) and someone has to re-run r10k manually. |
Getting the same problem on our puppet masters which have more than one trigger for r10k (‘push’ trigger e.g. webhook or AWS SSM command, plus ‘pull’ trigger e.g. cron job). We’ve had to add a grimy home rolled r10k locking wrapper script to our puppet masters and tell our triggers to use this instead of calling r10k directly. If a second invocation comes in while the first invocation is still running, it will wait for the first to complete and then run (for us this is preferable to flock’s other behaviour which is to make the second invocation instantly fail). I bet many other users of r10k have done the same. We can share the script if you want, but it’s nothing special. |
Any update on this? We too have a custom wrapper script using flock to provide this functionality. |
Please accept this. Due to newer versions being slower and slower, I'm constantly logging into my server and seeing dozens of r10k processes overlapping. |
Making
|
This PR has been marked stale because it has had no activity for 60 days. If you are still interested in getting this merged, please comment and we'll try to move it forward. Otherwise, it will be closed in 7 days. |
…and prevents it from
needlessly clobbering changes made by another running r10k install.
It will currently block and wait indefinately for the lock to be released.