Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IoT: connection drop after around 24 hours #2603

Closed
kilabyte opened this issue Dec 11, 2019 · 20 comments
Closed

IoT: connection drop after around 24 hours #2603

kilabyte opened this issue Dec 11, 2019 · 20 comments
Assignees
Labels
triage me I really want to be triaged. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@kilabyte
Copy link

Our raspi connection to google iot via mqtt fails to persist past the 24hour mark. I have the JWT refreshing about every 20 mins or so.

Running on a raspi 3 with mqtt and code is identical from the examples posted here just changed my device id and project id and other vars to match my instance.

@busunkim96 busunkim96 assigned gguuss and unassigned kurtisvg Dec 11, 2019
@busunkim96 busunkim96 changed the title connection drop after around 24 hours IoT: connection drop after around 24 hours Dec 11, 2019
@gguuss
Copy link
Contributor

gguuss commented Dec 11, 2019

Thanks for reporting this issue, I'll see if I can figure out what's going on.

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Dec 12, 2019
@kilabyte
Copy link
Author

If i use the pubsub client library as also on your documentation this works flawlessly but then i get no information within the IoT console for things like last seen, heartbeat, config sent etc

@gguuss
Copy link
Contributor

gguuss commented Dec 12, 2019

There are a lot of potential causes for a disconnect. If you go to the console and look up your device are you seeing any error details?

@kilabyte
Copy link
Author

I get either "The connection was closed or broken by the client" or i get "The authorization token expired" Which either doesnt make sense. The raspi is also hardwired to my router and im refreshing the token every 20 minutes.

@kilabyte
Copy link
Author

Or if i use the client library instead is there anyway to specify the device id so i can maintain status of each iot device?

@gguuss
Copy link
Contributor

gguuss commented Dec 13, 2019

I'm able to reproduce the console error issue now. There's a chance the device is not correctly refreshing the JWT, but it looks like to me like it should work as the JWT is refreshed in the call to get_client.

There's a chance on the reconnect we need to call:

client.disconnect()
client.loop()

To cleanly disconnect after this line to avoid the error. But I believe I have tested this before and it worked, despite the console errors. If you're doing a lot of calculations between the calls to mqtt.loop() the client loses its connection and behavior after that could get inconsistent.

@kilabyte
Copy link
Author

ok i can update and try that. what is the difference between client.loop(), client.stop_loop() and client.loop_start() ?

@gguuss
Copy link
Contributor

gguuss commented Dec 13, 2019

client.loop is synchronous, client.loop_stop / client.loop_start are asynchronous helpers that create a background thread. The asynchronous methods run smoother in practice but can be difficult to test.

Don't mix client.loop() with client.loop_stop() and client.loop_start().

@gguuss
Copy link
Contributor

gguuss commented Dec 13, 2019

Of note, adding the disconnect seems to be the right thing to do, I'm working on a PR to set it up.

@busunkim96 busunkim96 removed the triage me I really want to be triaged. label Dec 13, 2019
@kilabyte
Copy link
Author

kilabyte commented Dec 14, 2019

Also can evey ioT device use the same public key? Or should each device have a unique one? and does client.loop() can be called without calling client.loop_stop()?

@kilabyte
Copy link
Author

kilabyte commented Dec 14, 2019

To cleanly disconnect after this line to avoid the error.

Ok so i added the disconnect here and i get this just repeating:
on_disconnect 1: Out of memory. Connection: Connection Accepted. on_disconnect 1: Out of memory. Connection: Connection Accepted.

Im using client.loop_start() every time the loop is entered as i get better performance with the async call. Should i be calling client.loop_stop() somewhere?

If i just use client.loop() i see performance issues but it looks like the token refresh works but now my 30 second loop takes 60 seconds

@kilabyte
Copy link
Author

kilabyte commented Dec 14, 2019

OK i think i figured it out. I kept with the async loop calls. I put a client.loop_start() within my getClient method right before the return. Then when the JWT refresh is called i call client.loop_stop(). Once the connection is recreated it starts the loop again since i have it abstracted into a connect method. Im going to run this for a bit and see what i get

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Dec 14, 2019
@leahecole leahecole added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Dec 16, 2019
@gguuss
Copy link
Contributor

gguuss commented Dec 16, 2019

Sounds good, @kilabyte - I have created a new branch where we cleanly disconnect the MQTT sample. Feel free to reopen the issue if you're still seeing the client disconnecting or crashing.

@gguuss gguuss closed this as completed Dec 16, 2019
@kilabyte
Copy link
Author

kilabyte commented Dec 17, 2019

@gguuss can you just explain why you used .loop() and .loop_stop() in the same file? and i think for performance your PR would be better off with using the async calls instead that way you dont have to implicitly disconnect as i have found this has issues with the backoff in the disconnect callback

@gguuss
Copy link
Contributor

gguuss commented Dec 17, 2019

@kilabyte Which file are you referring to? This should be a bug if it's going on. In my anecdotal tests, there is no harm in calling loop in the mix with loop_start() and loop_stop() but if you call loop_start without ever calling loop_stop later on, the program will hang.

If you're referring to the removed lines from manager_test.py those were a correction of the issue I'm referring to. In the cloud_iot_mqtt.py example I'm using the synchronous / explicit methods because they are easier to debug. To replace them with the loop_start / loop_stop methods would be the first thing I would do in real-world code.

@kilabyte
Copy link
Author

Gotcha! Thanks for the explanation :) much appreciated. For handling like network offline and disconnects that is where the backoff comes in right? And then any offline message that was "published" is stored locally until the network is back in which case the messages will all dump out? (sorry i know this is off topic to the issue)

@gguuss
Copy link
Contributor

gguuss commented Dec 18, 2019

@kilabyte Yes, you got it. The back off is to ensure devices behave well under a number of circumstances. The sample/demo/example code itself does not buffer messages or do anything intelligent in terms of online / offline. There are other solutions built on top of Cloud IoT Core that are more robust.

@kilabyte
Copy link
Author

do you have links to said examples?

@gguuss
Copy link
Contributor

gguuss commented Dec 19, 2019

A few come to mind:

There's also the device SDK written in C.

@kilabyte
Copy link
Author

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage me I really want to be triaged. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants