Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog/polling to make sure AlarmServer, Envisalink Connection still working. #44

Open
rct opened this issue Jun 25, 2016 · 14 comments
Open

Comments

@rct
Copy link
Collaborator

rct commented Jun 25, 2016

Tested with Tornado branch. There doesn't appear to be any sort of watchdog/polling to catch if alarmserver still has a working connection to the envisalink.

Turning on Time Stamp broadcasts would make sure there are some periodic transmissions from the envisalink that could be used as a sort of heartbeat.

@juggie
Copy link
Owner

juggie commented Jun 25, 2016

Tornado should see the TCP disconnect, I did test this but I will test it again.

@rct rct changed the title Watchdog/poolin make sure AlarmServer, Envisalink Connection still working. Watchdog/pooling to make sure AlarmServer, Envisalink Connection still working. Jun 25, 2016
@rct rct changed the title Watchdog/pooling to make sure AlarmServer, Envisalink Connection still working. Watchdog/polling to make sure AlarmServer, Envisalink Connection still working. Jun 25, 2016
@rct
Copy link
Collaborator Author

rct commented Jun 25, 2016

There are a number of cases where you won't see the TCP close, or won't see it close in a timely manner if you aren't writing to the connection.

Enabling TCP keepalives on the socket can help, but the interval is 2 hours on most systems before it will detect a broken connection.

So usually it's best to have some application level keep alive that periodically writes data and checks for a response within a given period of time.

Problems could occur:

  • in the network
  • the envisalink could conceivably have a bug and become unresponsive or restart. What happens in the case of an envisalink reboot/firmware upgrade/restart/crash, etc.?
  • within alarmserver itself - the case I've seen so far, there is a parse error that causes an exception.

@rct rct closed this as completed Jun 25, 2016
@rct rct reopened this Jun 25, 2016
@rct
Copy link
Collaborator Author

rct commented Jun 25, 2016

(I clearly need more coffee. sorry for the noise.)

For testing, it seems one could use AlarmServer's built-in proxy, to connect two Alarmserver instances, and simulate/inject a few errors.

@rct
Copy link
Collaborator Author

rct commented Jul 1, 2016

As an example issue #45, an unknown TPI code generated a keyerror exception, leaving AlarmServer running but not fully functional.

@rct
Copy link
Collaborator Author

rct commented Jul 6, 2016

Currently tornado branch doesn't notice if the envisalink closes the connection. Trying to write something (through the proxy at this point) causes a write to fail which then correctly triggers a reconnect

@juggie
Copy link
Owner

juggie commented Jul 8, 2016

Is there any command we could use as a 'ping' to keep the connection alive? We could use the whole zone/partition dump but that is kind of noisy, is there anything else?

@juggie
Copy link
Owner

juggie commented Apr 25, 2017

@rct I assume this never got fixed?

@juggie
Copy link
Owner

juggie commented Apr 25, 2017

@rct added keep alive to envisalink client using command 000 (had to break out the TPI docs again!). Default is 60 seconds but can be overridden by config. This solves 2 of the 3 problems you mentioned.

In the latter problem where exceptions in the parsing can cause issues, that is my crappy coding from when I wrote this many years ago. Will need to revisit the code that parses the messages and ensure we handle all possible exceptions.

@rct
Copy link
Collaborator Author

rct commented May 5, 2017

@juggie, I never got back to working on this. I needed to learn more about Tornado first. I looked over the implementation, looks good. From what I can tell, not knowing tornadio, it should accomplish forcing the connection to break if something happens, like the envisalink got unexpectedly rebooted.

The thing that I think needs to be done is the timer check to verify that something has actually been received in the last interval. If not the connection should be broken.

@rct
Copy link
Collaborator Author

rct commented May 5, 2017

@juggie - My thinking at the time was that it would be good to implement a send queue, and track state. Every command should get an immediate response from the envisalink. We should be able to figure out whether each command sent got a successful response or not.

States:
idle -> commands to send -> sent/waiting for command response -> brief wait to drain additional responses -> back to either commands to send state or idle state.

I think only one command should be outstanding at a time. The envisalink and the panel are fairly small microcontrollers, and the protocols aren't good enough to have tagging, etc.

IIRC, there are some command responses that indicate a command should be retried.

Clearly much of the time we can get away without adding all this, but the most stuff that uses the alarmserver the more some of those edge conditions could crop up.

@juggie
Copy link
Owner

juggie commented May 7, 2017

@rct As I thought about this, I realized the timer to send the ping to the envisalink could interfere with other commands. And more so, while sending custom, multiline commands.

@rct
Copy link
Collaborator Author

rct commented May 9, 2017

@juggie - to get the desired behavior, the timer should be set (and reset) after data is received from the envisalink. The interval starts after last receive, if more data comes in the current timer gets cancelled and reset. This should cause the ping to only be sent after there has been no activity for the timer interval.

It would still be possible to interrupt a multi-command key press sequence, but only if the gaps between the commands were longer than the timer interval.

I'm curious now whether the NOP/ping command would actually break a multi-command sequence -- One guess at implementation is that the NOP/ping would only be processed by the envisalink and wouldn't involve any keybus transactions.

@juggie
Copy link
Owner

juggie commented May 18, 2017

@rct Had some spare time (i'm at a conference) and took another stab at this. I wanted to do it a little more elegantly, but remove_timer does not work properly.

@rct
Copy link
Collaborator Author

rct commented Aug 22, 2017

Minor detail noticed during attempts at some testing. This keeps firing even after the connection is lost and reconnects are attempted, so it will keep generating AttributeError: 'NoneType' object has no attribute 'write'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants