-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watchdog/polling to make sure AlarmServer, Envisalink Connection still working. #44
Comments
Tornado should see the TCP disconnect, I did test this but I will test it again. |
There are a number of cases where you won't see the TCP close, or won't see it close in a timely manner if you aren't writing to the connection. Enabling TCP keepalives on the socket can help, but the interval is 2 hours on most systems before it will detect a broken connection. So usually it's best to have some application level keep alive that periodically writes data and checks for a response within a given period of time. Problems could occur:
|
(I clearly need more coffee. sorry for the noise.) For testing, it seems one could use AlarmServer's built-in proxy, to connect two Alarmserver instances, and simulate/inject a few errors. |
As an example issue #45, an unknown TPI code generated a keyerror exception, leaving AlarmServer running but not fully functional. |
Currently tornado branch doesn't notice if the envisalink closes the connection. Trying to write something (through the proxy at this point) causes a write to fail which then correctly triggers a reconnect |
Is there any command we could use as a 'ping' to keep the connection alive? We could use the whole zone/partition dump but that is kind of noisy, is there anything else? |
@rct I assume this never got fixed? |
@rct added keep alive to envisalink client using command 000 (had to break out the TPI docs again!). Default is 60 seconds but can be overridden by config. This solves 2 of the 3 problems you mentioned. In the latter problem where exceptions in the parsing can cause issues, that is my crappy coding from when I wrote this many years ago. Will need to revisit the code that parses the messages and ensure we handle all possible exceptions. |
@juggie, I never got back to working on this. I needed to learn more about Tornado first. I looked over the implementation, looks good. From what I can tell, not knowing tornadio, it should accomplish forcing the connection to break if something happens, like the envisalink got unexpectedly rebooted. The thing that I think needs to be done is the timer check to verify that something has actually been received in the last interval. If not the connection should be broken. |
@juggie - My thinking at the time was that it would be good to implement a send queue, and track state. Every command should get an immediate response from the envisalink. We should be able to figure out whether each command sent got a successful response or not. States: I think only one command should be outstanding at a time. The envisalink and the panel are fairly small microcontrollers, and the protocols aren't good enough to have tagging, etc. IIRC, there are some command responses that indicate a command should be retried. Clearly much of the time we can get away without adding all this, but the most stuff that uses the alarmserver the more some of those edge conditions could crop up. |
@rct As I thought about this, I realized the timer to send the ping to the envisalink could interfere with other commands. And more so, while sending custom, multiline commands. |
@juggie - to get the desired behavior, the timer should be set (and reset) after data is received from the envisalink. The interval starts after last receive, if more data comes in the current timer gets cancelled and reset. This should cause the ping to only be sent after there has been no activity for the timer interval. It would still be possible to interrupt a multi-command key press sequence, but only if the gaps between the commands were longer than the timer interval. I'm curious now whether the NOP/ping command would actually break a multi-command sequence -- One guess at implementation is that the NOP/ping would only be processed by the envisalink and wouldn't involve any keybus transactions. |
@rct Had some spare time (i'm at a conference) and took another stab at this. I wanted to do it a little more elegantly, but remove_timer does not work properly. |
Minor detail noticed during attempts at some testing. This keeps firing even after the connection is lost and reconnects are attempted, so it will keep generating AttributeError: 'NoneType' object has no attribute 'write'. |
Tested with Tornado branch. There doesn't appear to be any sort of watchdog/polling to catch if alarmserver still has a working connection to the envisalink.
Turning on Time Stamp broadcasts would make sure there are some periodic transmissions from the envisalink that could be used as a sort of heartbeat.
The text was updated successfully, but these errors were encountered: