Skip to content

Conversation

@jpvrenen
Copy link
Contributor

No description provided.

@vnitinv
Copy link
Contributor

vnitinv commented May 25, 2016

@stacywsmith @dgarros Can you please review this pull request.

@dgarros
Copy link
Contributor

dgarros commented May 26, 2016

@jpvrenen I understand we are facing an issue with Dual RE system but it looks like this patch will affect all type of junos devices and increase the commit time by 10s,
Is my understanding correct ?

Set to yes in a Dual RE scenario when 'commit sync' is enabled in the
config.
(set system commit synchronize)
@jpvrenen
Copy link
Contributor Author

@dgarros Yes from my point of view this seem to happen in a DualRE scenario, when a 'valid' lock can exist that needs time to 'end' before 'commit'. but only when 'commit sync' is enabled in the configuration, pushing to single RE systems don't have this issue.

I have added a 'dual_re' option/parameter.

@dgarros
Copy link
Contributor

dgarros commented Jun 6, 2016

Hi @jpvrenen,

Sorry it's taking so long but to be honest I'm not sure if this is the right approach to solve the problem.
What if in some cases even a 2 RE system require more than 10s or less than 10s.

I was thinking maybe we should just have an option to add a delay before doing the commit
Something like

wait_before_commit:
        description:
             - Indicate how many seconds the module should wait before commiting the changes
         required: false
         default: 0

Any other ideas are welcome

@vnitinv @stacywsmith any thoughts ?

@jpvrenen
Copy link
Contributor Author

Hi @dgarros ,
Yes, thought about that as well as another possible solution only thing that it introduces is 'how' would one determine how many seconds delay should be used for this option to be effective. I must admit it is more flexible to add the 'wait_before_commit' as an option.

Ideally it would be best to 'know' if some (valid) lock condition still exists because of the 'commit check' command before running the actual 'commit'.

@stacywsmith
Copy link
Contributor

@jpvrenen @dgarros @vnitinv

I've been looking into this. I feel it's a Junos bug, and I'll continue to look into it further and file a Junos PR. That said, this problem exists on deployed code and it would be best to have the Ansible module deal with it more gracefully.

I suggest that we wrap the "cu.commit(**opts)" in a try/except block. If an error is raised and the error message contains "remote lock-configuration failed on" then set "opts['force_sync'] = True" and retry the "cu.commit(**opts)". If the commit still fails with sync_force, then just raise the exception.

Any thoughts on this approach? If others agree with the general approach, I'll get this coded and submit a pull request.

@jpvrenen
Copy link
Contributor Author

jpvrenen commented Jun 13, 2016

@dgarros @vnitinv @stacywsmith

Hi,

Using try/except sounds like a good idea, however instead of using the force option why not use the timeout option instead?

Basically you will be 'forcing' a configuration that has a lock condition which exitsts becaue of the 'commit check' issued in the first place. This condition will resolve automatically after the commit check has completed normally on the second RE. Issueing the 'force' option will also cause any lock conditions (that don't resolve automatically) to be ignored.

Please see below from the Juniper documentation about the 'force' option.

http://www.juniper.net/documentation/en_US/junos15.1/topics/concept/junos-software-configuration-routing-engines-synchronization-overview.html
" When you issue the commit synchronize command with the force option from one Routing Engine, the configuration sessions on the other Routing Engine will be terminated and its configuration synchronized with that on the Routing Engine from which you issued the command."

&&

" Note: We recommend that you use the force option only if you are unable to resolve the issues that caused the commit synchronize command to fail."

@stacywsmith
Copy link
Contributor

@jpvrenen

It appears the time it takes this lock condition to resolve is related to the size of the configuration being checked/committed. I don't think a single static value can cover a very large configuration without sleeping for some large amount of time that is wasteful in most other conditions.

Maybe we can do this in a loop that uses an exponential back off until the commit succeeds or some timeout occurs (but even then the ultimate timeout will be somewhat arbitrary).

@jpvrenen
Copy link
Contributor Author

@stacywsmith
Well now you mention it I was looking more closely at the responses see below:

May 23 10:07:16 [4601] Outgoing:
May 23 10:07:16 [4601] Outgoing:
May 23 10:07:16 [4601] Outgoing:
May 23 10:07:16 [4601] Outgoing: ]]>]]>

There appears to be only 1 'ok' after both RE config checks are already completed. Then right after the 'ok' the 'commit' is given:
May 23 10:07:16 [4601] Incoming: <?xml etc etc /nc:rpc]]>]]>

practically no delay:
May 23 10:07:17 [4601] Outgoing: remote lock-configuration failed on re1

What I am getting at here is that theoretically the actual commit is only given 'after' the commit check has completed on both route engines, but it would apparently take just a second or so longer for the DB lock on the second RE to be released. Following this reasoning only 1 'small' timeout could potentially suffice independent of configuration size, since it will not be related.

I just did a small test to wait for 1s instead of 10s before issuing 'commit' and it went right through no lock error!

new insight, commit check is completed on both route engines when commit
is given (1 ok is given for both RE), apparently a short amount of time
is needed for the lock to clear on the second RE.
@vnitinv
Copy link
Contributor

vnitinv commented Jul 19, 2016

@dgarros @stacywsmith are we ok with 1 sec sleep? If yes, then for this 1 sec sleep should we have 1 more params dual_re (as added by @jpvrenen )

@dgarros
Copy link
Contributor

dgarros commented Jul 22, 2016

@vnitinv

I think we should follow @stacywsmith suggestions but instead of using "force" option, let's use the the delay option with let's say 3 tries. First 1s, second 2s, third 4s ... if still not, raise exception.

Part 1

I suggest that we wrap the "cu.commit(opts)" in a try/except block. If an error is raised and the error message contains "remote lock-configuration failed on" then set "opts['force_sync'] = True" and retry the "cu.commit(opts)". If the commit still fails with sync_force, then just raise the exception.

Part 2

Maybe we can do this in a loop that uses an exponential back off until the commit succeeds or some timeout occurs (but even then the ultimate timeout will be somewhat arbitrary).

@jpvrenen
Copy link
Contributor Author

@vnitinv @dgarros
Using try/except will throw an warning/error every time a configuration is pushed, I cant recall (at this point) if a rollback is then performed so I am not sure if configuration would be retained between the delayed commit(s). Also I don't think it is a good thing performing an action when you already know it will cause a warning.

I have thought about some other solutions:
1 find out (dynamically) if the system contains more that 1 route engine and add a delay based on that, this will potentially add some overall time to the program.
2 add a 'fixed' delay between check and commit
3 add a time_between_check_and_commit parameter in seconds (between 0 - 4s), optional so the user can have some control over this.

Instead of adding a default value of 1s the 'check_commit_wait'
parameter indicates a timeout is needed between check and commit and can
be set between 1 and 4s.
Only check if parameter is used.
@jpvrenen
Copy link
Contributor Author

Changed the 'dual_re' parameter into a 'check_commit_wait' parameter value can be between 1 - 4 s.

No default value
@jpvrenen
Copy link
Contributor Author

@vnitinv @dgarros
'dual_re' parameter has been replaced by 'check_commit_wait' parameter, which gives an optional 1 - 4 s delay between the 'commit check' and 'commit' command.

@stacywsmith
Copy link
Contributor

@vnitinv I believe I'm now OK with this pull request. If you agree, can we go ahead and get this merged?

@stacywsmith
Copy link
Contributor

@jpvrenen It looks like you ran into this issue on Junos 14.2R4. Can you confirm that's the release you are using?

I've also been tracking this as a Junos bug. We believe we may have identified the root cause, but I'd like to confirm the issue we've found is truly the issue you're hitting.

@jpvrenen
Copy link
Contributor Author

@stacywsmith Yes I confirm that we are using Junos 14.2R4.9

@vnitinv vnitinv merged commit b3f85f5 into Juniper:master Aug 25, 2016
opts['confirm'] = args['confirm']

if args['check_commit_wait']:
check_commit_wait = int(args['check_commit_wait'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tabs here breaking the script

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jNicker I merged pull request #156 to fix this issue. Thanks for the submission!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants