Bug 16617 - Controller (SqueezePlay) may never reconnect fully to MySB.com after connection loss
: Controller (SqueezePlay) may never reconnect fully to MySB.com after connecti...
Status: CLOSED FIXED
Product: SqueezePlay
Classification: Unclassified
Component: SqueezeNetwork
: unspecified
: All Debian Linux
: P3 major (vote)
: 7.6.0
Assigned To: Bradley D. Wall
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-10-18 07:08 UTC by Alan Young
Modified: 2011-05-18 13:37 UTC (History)
6 users (show)

See Also:
Category: Bug


Attachments
Proposed patch (1.09 KB, patch)
2010-10-18 07:10 UTC, Alan Young
Details | Diff
Log where SB Radio gets into this state (155.16 KB, text/plain)
2010-10-20 00:53 UTC, Alan Young
Details
Proposed patch (2.92 KB, text/plain)
2010-10-28 03:30 UTC, Alan Young
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Young 2010-10-18 07:08:28 UTC
If SqueezeNetwork responds with a reconnect=none advice to a Comet request, then the Comet module will go into state DISCONNECTED but the SlimServer module, which manages connections, will think it is in state CONNECTING. Not even a new user request, such as "Try again" from the UI, will force a reconnection attempt.
Comment 1 Alan Young 2010-10-18 07:10:47 UTC
Created attachment 7011 [details]
Proposed patch
Comment 2 Mickey Gee 2010-10-18 09:29:24 UTC
Fix in special firmware to try out
Comment 3 Alan Young 2010-10-20 00:51:20 UTC
Note, this bug is specifically about the blue-icon problem when connected (connecting) to mysqueezebox.com and when the player network-level connection remains ok. That is, it is possible to ping the player and to connect to it over ssh. Bug 16575 covers the case where the network connection no longer works.
Comment 4 Alan Young 2010-10-20 00:53:28 UTC
Created attachment 7013 [details]
Log where SB Radio gets into this state
Comment 5 Alan Young 2010-10-20 01:32:04 UTC
Some observations from the log just attached:

1. SlimProto reconnected at 17:53:

Oct 18 17:53:37 squeezeplay: INFO   net.slimproto - SlimProto.lua:754 connection error: closed, reconnecting in 2.17 seconds
Oct 18 17:53:39 squeezeplay: INFO   net.slimproto - SlimProto.lua:577 connect to baby.squeezenetwork.com (79.125.21.137)
Oct 18 17:53:39 squeezeplay: INFO   net.slimproto - SlimProto.lua:135 Send HELO: reconnect-bit=true bytesReceived(H,L)=0,75755591

2. 18:01:51 was probably the next time I initiated any user action (Pause). I do not recall observing the state of the wireless icon. At that time we had:

Oct 18 18:01:51 squeezeplay: INFO   applet.SlimMenus - SlimMenusApplet.lua:441 _menuSink(1) nil menuDirective: add isCurrentServer:true
Oct 18 18:01:51 squeezeplay: INFO   applet.SlimMenus - SlimMenusApplet.lua:713 hiding any 'connecting to server' popup after menu response from current server, SlimServer {mysqueezebox.com}
Oct 18 18:01:51 squeezeplay: INFO   applet.ChooseMusicSource - ChooseMusicSourceApplet.lua:543 Hiding popup, exists?: nil
Oct 18 18:02:01 squeezeplay: INFO   applet.ScreenSavers - ScreenSaversApplet.lua:292 activating Clock screensaver
Oct 18 18:02:51 squeezeplay: WARN   net.thread - NetworkThread.lua:146 network thread timeout for Task(SocketHttp {mysqueezebox.com_Request}(R))
Oct 18 18:02:53 squeezeplay: WARN   net.thread - NetworkThread.lua:146 network thread timeout for Task(SocketHttp {baby.squeezenetwork.com}(R))
Oct 18 18:04:58 squeezeplay: INFO   applet.ScreenSavers - ScreenSaversApplet.lua:292 activating Clock screensaver

I think that this suggests that the Comet connection has been lost.

3. Over the next 2h 45m various message which are probably consistent with not being connected are logged.

4. At 20:51 we get:

Oct 18 20:51:22 squeezeplay: WARN   net.comet - Comet.lua:876 Comet {mysqueezebox.com}: _response, /meta/reconnect id=nil failed: invalid clientId

That is the first "invalid clientId" message.

5. At 21:04 SlimProto reconnects again:

Oct 18 21:04:36 squeezeplay: INFO   net.slimproto - SlimProto.lua:754 connection error: closed, reconnecting in 4.743 seconds
Oct 18 21:04:41 squeezeplay: INFO   net.slimproto - SlimProto.lua:577 connect to baby.squeezenetwork.com (79.125.21.137)
Oct 18 21:04:41 squeezeplay: INFO   net.slimproto - SlimProto.lua:135 Send HELO: reconnect-bit=0 bytesReceived(H,L)=0,86593044

6. There is little activity for several hours until another "/meta/reconnect id=nil failed: invalid clientId", and then another a few hours later. Then there is:

Oct 19 20:27:03 squeezeplay: WARN   net.comet - Comet.lua:876 Comet {mysqueezebox.com}: _response, /meta/reconnect id=nil failed: invalid clientId
Oct 19 20:27:06 squeezeplay: WARN   net.comet - Comet.lua:876 Comet {mysqueezebox.com}: _response, /slim/subscribe id=nil failed: invalid clientId
Oct 19 20:28:06 squeezeplay: WARN   net.thread - NetworkThread.lua:146 network thread timeout for Task(SocketHttp {mysqueezebox.com_Request}(R))

7. From then on, the following recurs with 60s between each sequence:

Oct 19 20:28:06 squeezeplay: WARN   net.thread - NetworkThread.lua:146 network thread timeout for Task(SocketHttp {mysqueezebox.com_Request}(R))
Oct 19 20:28:13 squeezeplay: WARN   net.thread - NetworkThread.lua:146 network thread timeout for Task(SocketHttp {mysqueezebox.com_Chunked}(R))
Oct 19 20:28:14 squeezeplay: INFO   applet.SlimMenus - SlimMenusApplet.lua:441 _menuSink(50) SlimServer {oz} menuDirective: nil isCurrentServer:nil
Oct 19 20:28:14 squeezeplay: INFO   applet.SlimMenus - SlimMenusApplet.lua:556 Eliminated myMusic node from server, since now handled locally
Oct 19 20:28:14 squeezeplay: WARN   net.comet - Comet.lua:876 Comet {mysqueezebox.com}: _response, /slim/subscribe id=nil failed: invalid clientId

8. Finally, we get:

Oct 19 21:14:31 squeezeplay: WARN   net.comet - Comet.lua:876 Comet {mysqueezebox.com}: _response, nil id=nil failed: POST body too large (max=51200)
Oct 19 21:14:31 squeezeplay: WARN   net.comet - Comet.lua:956 Comet {mysqueezebox.com}: _response, unknown error: POST body too large (max=51200)

I think that this results in a reconnect=none message from SN and so no more retry attempts occur.
Comment 6 Mickey Gee 2010-10-25 09:14:14 UTC
Related to bug 15919
Comment 7 Alan Young 2010-10-28 02:58:48 UTC
*** Bug 15919 has been marked as a duplicate of this bug. ***
Comment 8 Alan Young 2010-10-28 03:27:01 UTC
In addition to the earlier error of the possibility of getting into the advice reconnect=none state, with no way out, several others have been found, including one that will provoke that state.

1. If the packet containing the /meta/connect and /meta/subscribe requests is lost (presumably due to a network error or server process failure)then the following /meta/reconnect request does not include the /meta/subscribe. As a consequence, none of the /slim/subscribe subscriptions are honoured, and the connection repeatedly times out due to no network traffic. 

2.1 If the packet containing the /slim/subscribe subscription requests that follows on after a successful handshake, or their acknowledgements, are lost (presumably due to a server process failure) and a new handshake is then required, then the set of active subscriptions are added as new /slim/subscribe requests to the set of  outstanding requests (which already contains the old /slim/subscribe requests) to be sent when the new connection is established. That is, the /slim/subscribe requests are duplicated.

2.2. After 2.1, when the new connection is established, the clientId strings in the outstanding requests are replaced with the new clientId from the successful handshake. The regular expression used to do this is insufficient to handle the syntax of clientId strings used by SqueezeNetwork (SN, mysqueezebox.com) and so, in this case (there is no such problem with Squeezebox Server), the requests are sent with invalid clientIds. This causes SN to reject the requests and force a new handshake. This leads to further duplicates /slim/subscribe requests in the next connection attempt.

2.3 The build-up of duplicate /slim/subscribe requests caused by 2.1 and 2.2 results in a request that is too large for the server (the limit is 5 KiB) which in turn results in it sending a error contain advice for reconnect=none, the originally reported problem.
Comment 9 Alan Young 2010-10-28 03:30:57 UTC
Created attachment 7018 [details]
Proposed patch

Fix incorrect use of self.state (should be oldstate) in setEndpoint(). Not actually provoking a bug at present because the only call to setEndpoint() is bracketed by disconnect()/connect() calls.

Server may send reconnection advice of reconnect=none under certain error conditions. This causes Comet.lua to enter UNCONNECTED state and not attempt a retry, but SlimServer.lua will assume that it is CONNECTING state (retrying). In order for it to be possible to escape from this situation, in the case that whatever provoked the reconnect=none was actually a transient condition, we allow a new request to force a reconnection attempt.

When re-enabling subscriptions that were previously active as part of a new handshake, also remove them from the set of sent requests that have not yet been acknowledged, and that will get resent when the new connection is established; otherwise duplicate subscription requests can be sent.

The RE used to match clientId strings in the response channels of subscription requests in the queue was insufficient. It needs also to match 'X' as well as hexadecimal characters. 

Include /meta/subscribe along with /meta/reconnect in case this reconnect is happening after a successful handshake but failed /meta/connect.
Comment 10 SVN Bot 2010-10-28 12:51:53 UTC
 == Auto-comment from SVN commit #9464 to the network repo by agrundman ==
 == http://svn.slimdevices.com/network?view=revision&revision=9464 ==

Bug 16617, handle the case where a /meta/reconnect comes in without an existing /meta/subscribe having been setup
Comment 11 SVN Bot 2010-10-29 06:42:59 UTC
 == Auto-comment from SVN commit #9470 to the network repo by agrundman ==
 == http://svn.slimdevices.com/network?view=revision&revision=9470 ==

Bug 16617, workaround for bad /slim/subscribe packet issue
Comment 12 Mickey Gee 2010-11-02 10:40:39 UTC
*** Bug 16645 has been marked as a duplicate of this bug. ***
Comment 13 SVN Bot 2010-11-03 02:04:09 UTC
 == Auto-comment from SVN commit #9209 to the jive repo by ayoung ==
 == http://svn.slimdevices.com/jive?view=revision&revision=9209 ==

Fixed bug 16617: Controller (SqueezePlay) may never reconnect fully to MySB.com after connection loss 

Fix incorrect use of self.state (should be oldstate) in setEndpoint(). Not actually provoking a bug at present because the only call to setEndpoint() is bracketed by disconnect()/connect() calls.

Server may send reconnection advice of reconnect=none under certain error conditions. This causes Comet.lua to enter UNCONNECTED state and not attempt a retry, but SlimServer.lua will assume that it is CONNECTING state (retrying). In order for it to be possible to escape from this situation, in the case that whatever provoked the reconnect=none was actually a transient condition, we allow a new request to force a reconnection attempt.

When re-enabling subscriptions that were previously active as part of a new handshake, also remove them from the set of sent requests that have not yet been acknowledged, and that will get resent when the new connection is established; otherwise duplicate subscription requests can be sent.

The RE used to match clientId strings in the response channels of subscription requests in the queue was insufficient. It needs also to match 'X' as well as hexadecimal characters. 

Include /meta/subscribe along with /meta/reconnect in case this reconnect is happening after a successful handshake but failed /meta/connect.
Comment 14 Alan Young 2010-11-17 23:02:33 UTC
*** Bug 16645 has been marked as a duplicate of this bug. ***
Comment 15 Bradley D. Wall 2011-05-06 15:21:24 UTC
Update?
Comment 16 Bradley D. Wall 2011-05-11 16:07:17 UTC
Not seeing messages in server log 7.6.0 32398.  Going to close.