Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add locking in the primary r10k binary. This serializes runs of r10k … #610

Closed

Conversation

heliocentric
Copy link

…and prevents it from

needlessly clobbering changes made by another running r10k install.

It will currently block and wait indefinately for the lock to be released.

…and prevents it from

needlessly clobbering changes made by another running r10k install.

It will currently block and wait indefinately for the lock to be released.
$stderr.puts "\nError while running: #{e.inspect}".red
$stderr.puts e.backtrace.join("\n").red if ARGV.include? '--trace'
exit(1)
File.open("/tmp/r10k.lock", File::RDWR|File::CREAT, 0644) do |f|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might use a variable here instead of a file name, and put the variable above it. For the optimistic timeline where puppet masters comes to windows and that path needs changed.

@scotje
Copy link
Contributor

scotje commented Jun 20, 2016

Hello and thanks for your contribution to r10k! At this point, I'm a little reluctant to move forward on this particular feature for a couple reasons.

First and foremost, the solution implemented here is already available on most Linux distributions to any command via the flock(1) utility and I'd be concerned about re-inventing the wheel to some extent.

Second, if we were going to incorporate an exclusive locking feature into r10k itself, it would need to be controllable by a command line flag since it is possible to successfully manage the concurrency issues presented by multiple running r10k processes. (Specifically by using per-process cachedirs and environment level locking, which is roughly the approach the Puppet Enterprise Code Manager currently takes.)

Perhaps we can just add some additional documentation about the concurrency issues and provide an example of a wrapper script that uses flock(1) to enforce serialized execution?

@heliocentric
Copy link
Author

  1. This isn't reinventing the wheel, flock is a convenience wrapper for programs that don't implement locking internally. Both flock and ruby use the same syscall on linux, but this will also work the same way on windows because ruby uses the windows discretionary locking as well.
  2. A flag to disable locking would make sense, as then it would be opt out.

The reason why it should be in the binary itself and opt-out is POLA. r10k in it's default setting should /not/ trash the git repos and prevent the environments from being able to be deployed whether it's run one at a time or not.

There are many ways to run the binary, you can have custom webhooks, or are using the mco r10k module, or just running the command by hand. It should be as safe as possible to run in it's default, out of the box configuration.

@rnelson0
Copy link
Contributor

My primary reason to implement this in r10k is to prevent every other solution from having to re-implement the same solution (Distributed DRY). See acidprime/r10k#268 as an example where this is occurring.

Certainly, there may be some more preferable way to do this, but it should still be done :)

@scotje
Copy link
Contributor

scotje commented Jun 20, 2016

Yeah, I don't disagree that we should try to make things safe by default. I think that, at least with respect to Git repos, rugged/libgit2 does a better job of this than the shellgit provider right now. Ideally the locking would be implemented around the actually unsafe operations inside of r10k, since something read-only like r10k deploy display (without --fetch) should be fine to run concurrently.

@rnelson0
Copy link
Contributor

@scotje Do you have a suggestion for the correct insertion point? I'm not that familiar with the guts of r10k but I believe @heliocentric and I are motivated to act, if we know where to hack away with the best chance of an accepted PR.

@scotje
Copy link
Contributor

scotje commented Jun 20, 2016

@rnelson0 I think most of the "unsafe" git operations occur in lib/r10k/git/(shellgit|rugged)/working_repository.rb, but I also think that the current providers should have their own locking mechanisms in place. As far as a lock around writing environments, lib/r10k/action/deploy/environment.rb seems like a good place to start.

@heliocentric is your Git repo corruption issue easily reproducible? I'm curious as to what may have happened there; I could see operations failing due to contention but actual corruption seems like it should be very rare so there may be a different bug in play there.

@rnelson0
Copy link
Contributor

I can't speak to helio's issues, but ours occur when we do a feature->production merge-and-delete in one fell swoop. It fires two webhook events, which runs two r10k instances. One starts managing the git files and deploys or deletes the environment (it's a race to see which makes it through first). The other tries to manage the git files and either errors out entirely, or more likely, inspects the files before it changes, then errors because the index is corrupt (from its POV) and ends the r10k run. You can see more detail in the PE support ticket 12465, including logs.

The end result is either A) production updated but the feature branch is still present (not bad) or B) feature branch is deleted but the production branch is stale (bad) and someone has to re-run r10k manually.

@chriskilding
Copy link

Getting the same problem on our puppet masters which have more than one trigger for r10k (‘push’ trigger e.g. webhook or AWS SSM command, plus ‘pull’ trigger e.g. cron job).

We’ve had to add a grimy home rolled r10k locking wrapper script to our puppet masters and tell our triggers to use this instead of calling r10k directly. If a second invocation comes in while the first invocation is still running, it will wait for the first to complete and then run (for us this is preferable to flock’s other behaviour which is to make the second invocation instantly fail). I bet many other users of r10k have done the same.

We can share the script if you want, but it’s nothing special.

@baurmatt
Copy link
Contributor

Any update on this? We too have a custom wrapper script using flock to provide this functionality.

@daveseff
Copy link

daveseff commented Aug 7, 2018

Please accept this. Due to newer versions being slower and slower, I'm constantly logging into my server and seeing dozens of r10k processes overlapping.

@Sharpie
Copy link
Member

Sharpie commented Mar 17, 2021

Making r10k safe to run concurrently without external logic is a good idea. However, there are a couple items in this implementation that I think should be changed:

  • The current proposal uses a global lockfile. Locking should be done per-cachedir to protect git sources being updated and per-environment in the basedir to protect content being checked out. This would avoid introducing conflicts with existing r10k wrappers that already handle concurrency by separating the caches and ensuring only one process is updating an environment at a time.
  • The timeout for acquiring the lock should not be indefinite. After some period of time, any attempt to acquire the lock should fail with an error so that stalled processes do not stack up and operators are alerted to the existence of a hung lock.

@github-actions
Copy link

This PR has been marked stale because it has had no activity for 60 days. If you are still interested in getting this merged, please comment and we'll try to move it forward. Otherwise, it will be closed in 7 days.

@github-actions github-actions bot added the stale label May 26, 2021
@github-actions github-actions bot closed this Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants