Bug 7376 - SBC is sometimes unresponsive
: SBC is sometimes unresponsive
Status: CLOSED FIXED
Product: SB Controller
Classification: Unclassified
Component: UI
: unspecified
: PC Windows XP
: P1 critical (vote)
: 7.0.1
Assigned To: Richard Titmuss
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-02 09:21 UTC by sbjaerum
Modified: 2009-09-08 09:20 UTC (History)
4 users (show)

See Also:
Category: ---


Attachments
Wireshark capture during error situation (41.66 KB, application/octet-stream)
2008-03-27 13:54 UTC, sbjaerum
Details
wireshark capture during successful pause (47.68 KB, application/octet-stream)
2008-03-27 14:19 UTC, sbjaerum
Details
networ.http log (444.46 KB, text/plain)
2008-03-31 14:22 UTC, sbjaerum
Details
cometd + http log (655.69 KB, text/plain)
2008-03-31 14:59 UTC, sbjaerum
Details
Squeezecenter log (775.01 KB, text/plain)
2008-04-01 09:40 UTC, sbjaerum
Details
Jive log (23.29 KB, text/plain)
2008-04-01 09:41 UTC, sbjaerum
Details
Wireshark dump (23.68 KB, application/octet-stream)
2008-04-01 09:45 UTC, sbjaerum
Details
SBC logging (Jeff Flowerday) (152.68 KB, text/plain)
2008-04-03 07:52 UTC, Jeff Flowerday
Details
Another dump (696.85 KB, text/plain)
2008-04-03 08:32 UTC, Jeff Flowerday
Details

Note You need to log in before you can comment on or make changes to this bug.
Description sbjaerum 2008-03-02 09:21:08 UTC
Steps to reproduce:
1. Set SBC to control a SBR, and start play music on SBR.
2. Let SBC go to sleep.
3. Press pause on SBC, the music does not pause. Can't browse library. Seems like SBC is in some congested state.
4. Eventually SBC becomes responsive again.

It might not happen all the time, but after some iterations of the above sequence, this problem always shows up on my SBC.
SBC version is r2085.
Comment 1 Blackketter Dean 2008-03-02 19:55:02 UTC
My darling wife has been complaining about this too.  Sometimes takes 10 seconds to get all connected up.
Comment 2 sbjaerum 2008-03-02 23:38:02 UTC
Dean, I don't think we are talking about the same thing. What I am observing is not related to the connection time when coming out of suspend. I am seeing the problem when SBC is in sleep mode, i.e. screen turned off and lowered CPU clock (happens after one minute of inactivity). The congested state can last much longer than 10 seconds. The SBC just consumes pause presses with no effect on the playback on SBR. In this state, I can navigate the menus just fine. But when entering a menu that requires communication with server, for instance browse Albums, it just hangs at the right arrow animation stage.

Looks like SBC is missing a response on the first button pres, and all subsequent SBC <-> SC communication is blocked as a result of this.

I believe it is easier to reproduce this if the SBC lies flat on a table, with the pause button event waking it up from sleep, i.e. motion sensor is not triggered first.

Comment 3 sbjaerum 2008-03-03 09:11:30 UTC
I see this in the log:
Mar  3 18:08:46 (none) user.err jive: (NetworkThread.lua:136) - network thread timeout for Task(SocketHttp {DELL_Request}(R))
Mar  3 18:08:46 (none) user.err jive: (SocketHttp.lua:443) - SocketHttp {DELL_Request}:t_rcvHeaders.pump:nil

Don't know if it is related.
Comment 4 sbjaerum 2008-03-07 09:54:55 UTC
When the SBC suddenly starts to respond, it sometimes executed all the play button presses being blocked somewhere.

This must be a high priority bug. A common user scenario is to have music playing, then grab the controller (not used for a couple minutes) to pause the music. Everyone would expect that the music pauses immediately. I can't imagine that I am the only one experiencing this...
Comment 5 sbjaerum 2008-03-11 12:00:22 UTC
The analog clock screensaver sometimes seems to kick in too early in this situation.
Comment 6 sbjaerum 2008-03-27 02:30:19 UTC
Complaints about this problem is starting to pop up in the forum:
http://forums.slimdevices.com/showthread.php?t=45328

Also I think bug 7603 is a duplicate of this bug.
Comment 7 Richard Titmuss 2008-03-27 02:56:16 UTC
Steinar, can you try editing the /etc/init.d/wlan script and commenting out the 'iwconfig eth0 power on' line. Your battery life will take a serious dent, but I am interested if this fixes the unresponsiveness problem.
Comment 8 sbjaerum 2008-03-27 05:26:05 UTC
According to http://wiki.slimdevices.com/index.cgi?JivePowerModes the power to wlan is not touched when going into sleep mode.
Still worthwhile to do the test?
Comment 9 Richard Titmuss 2008-03-27 06:12:25 UTC
wlan power save mode is enabled all the time, this change disables it. yes it is worth doing the test. it is likely i will be making a change that actively enables/disable wlan power save, so i would be interested to know if this will solve your problem.

Comment 10 sbjaerum 2008-03-27 08:03:21 UTC
I will try to squeeze in time to do the test tonight.
Comment 11 sbjaerum 2008-03-27 10:07:05 UTC
I uncommented the requested line in /etc/init.d/wlan
Unfortunately no effect on this issue.
Comment 12 Richard Titmuss 2008-03-27 10:10:28 UTC
Do you 'uncommented'? I requested that the line was commented. It should look like this:

    # Enable power save
    #/usr/sbin/iwconfig eth0 power on

And you'll need to reboot after the change.
Comment 13 sbjaerum 2008-03-27 10:13:39 UTC
Typo by me.
I commented the line, so that it looks like the line specified.
Then rebooted.
Bug shows up immediately.
Comment 14 sbjaerum 2008-03-27 13:54:38 UTC
Created attachment 3137 [details]
Wireshark capture during error situation

Wireshark capture during error situation.
Comment 15 sbjaerum 2008-03-27 14:19:51 UTC
Created attachment 3138 [details]
wireshark capture during successful pause

wireshark capture during successful pause
Comment 16 sbjaerum 2008-03-27 16:01:55 UTC
What is a SYN packet?
When pause is successful, I see such packets just before the pause command is sent.
When failing, those SYN packets are missing.
When in the unresponsive state, it remains unresponsive until I see those SYN packets in wireshark. After the SYN packets, Jive behaves normal.
Comment 17 Richard Titmuss 2008-03-31 03:39:31 UTC
*** Bug 7603 has been marked as a duplicate of this bug. ***
Comment 18 Richard Titmuss 2008-03-31 04:43:08 UTC
The SYN packets are sent as part of the TCP handshake, it indicates that a new TCP connection has been established before the successful pause command. I don't think the lack of the SYN packets are the problem.

Steinar, what operating system do you use?

From the packet captures in both cases the POST from jive is seen. In the failure case no HTTP response is seen from SC. I assume that jive must be timing out the connection, and re-establishing a new connection which is why it starts to work.

So I think that either:
1. SC is dropping this HTTP request
2. Something is going wrong before this that means Jive has an invalid Comet clientId. What does SC do in this case?

Andy, do you have any ideas?
Comment 19 Andy Grundman 2008-03-31 06:58:41 UTC
Can you get a network.http debug?  This would show if SC gets the HTTP connection at all.
Comment 20 sbjaerum 2008-03-31 11:35:12 UTC
In response to comment #18:
I observe a consistent one to one correlation between the presence/nonpresence of the SYN packets and the responsive/nonresponsive situation. Might be just a coincidence.

SC is running on WinXP.
Comment 21 sbjaerum 2008-03-31 13:52:47 UTC
I enabled network.http logging.

On successful pause requests, I see messages like this in the log:

[08-03-31 22:33:12.8260] Slim::Web::HTTP::acceptHTTP (259) Accepted connection 5 from 10.0.0.6:60113
[08-03-31 22:33:12.8324] Slim::Web::HTTP::processHTTP (360) Reading request...
[08-03-31 22:33:12.8346] Slim::Web::HTTP::processHTTP (377) HTTP request: from 10.0.0.6:60113 (Slim::Web::HTTP::ClientConn=GLOB(0x66d848c)) for POST HTTP/1.1 /cometd
[08-03-31 22:33:12.8361] Slim::Web::HTTP::processHTTP (384) Raw request headers: [
POST /cometd HTTP/1.1
Accept-Language: en
Host: 10.0.0.4:9000
User-Agent: Jive/7.0.1 r2158
Content-Length: 133
Content-Type: text/json

[{"id":40,"data":{"request":["22:91:92:10:11:5b",["pause","0"]],"response":"\/77378f4d\/slim\/request"},"channel":"\/slim\/request"}]
]
[08-03-31 22:33:12.8379] Slim::Web::HTTP::processHTTP (484) Raw path is [/cometd]
[08-03-31 22:33:12.8394] Slim::Web::HTTP::processHTTP (492) Handling [/cometd] using raw function

On failing pause requests, I see nothing like this in the network.http log.
Comment 22 Richard Titmuss 2008-03-31 13:54:56 UTC
I think it may be important to see what's earlier in the log, leading up to the failure. Can you post that?
Comment 23 sbjaerum 2008-03-31 14:22:32 UTC
Created attachment 3164 [details]
networ.http log

No signs of the pause command a bit more than one minute later than the first.
Comment 24 sbjaerum 2008-03-31 14:59:05 UTC
Created attachment 3167 [details]
cometd + http log

Log starts when music is unpaused.
Somewhere in there, pause was pressed without response.
Log ends when music was paused with remote control.
Comment 25 sbjaerum 2008-04-01 09:40:48 UTC
Created attachment 3180 [details]
Squeezecenter log
Comment 26 sbjaerum 2008-04-01 09:41:18 UTC
Created attachment 3181 [details]
Jive log
Comment 27 sbjaerum 2008-04-01 09:45:33 UTC
Created attachment 3182 [details]
Wireshark dump

The three latest attachments should be approximately synchronized in time.
Starts when Jive is used to unpause music.
Ends when remote control is used to pause music.
In between there is a Jive pause button pushed that is not responded to.
Comment 28 Richard Titmuss 2008-04-01 14:18:20 UTC
From campfire:


Richard T.  	
steinar, the logs look good. i still don't understand what's wrong but they capture the problem well.
Richard T. 	
andy, do you have a few minutes to look at them. it all seems very strange.
Richard T. 	
my reading of the logs is:
Richard T. 	
in the SC log you need to look for port 43917.
Richard T. 	
you can see the connection at 18:10:52 and a pause request is sent on that connection.
Richard T. 	
this request is handled and the response sent during 18:10:52.
Richard T. 	
the next time this port is logged is on a keep-alive timeout, at 18:12:07.
Richard T. 	
so this is approx. 75 seconds later.
Richard T. 	
moving to the network capture.
Richard T. 	
the connection to port 43917 starts at 1.05, and the first request is ACK'd by jive at 1.23.
Apr 1 	
12:30 PM
Richard T. 	
the next time port 43917 appears is at 85.72 when jive posts another request. you can see the tcp ACK, but no http response.
Richard T. 	
i reckon we should see a tcp connection close approx 75 seconds after 1.23, so around 76.23. it's missing.
Richard T. 	
then on the jive log:
Richard T. 	
the first post appears to be missing, but the second request is queued at 18:12:12. the http state machine for DELL_Request gets to t_rcvHeaders (18:12:14) and moves no further. so jive is waiting for the http response.
Richard T. 	
-
Richard T. 	
to me this clearly looks like SC/XP has not fully closed the connection on port 43917. and even more worrying is the XP is still sending tcp ack's for this connection.
Apr 1 	
12:35 PM
Andy G. 	
so is it only a problem under windows?
Richard T. 	
not much in closeHTTPSocket (where the keep-alive is logged) looks like it can go wrong. unless the the close handler (if any) is fails, what would happen then?
Richard T. 	
well it's hard to say for sure. but i've never seen it running on linux.
Richard T. 	
but i could not recreate on xp last night, i'll try again.
Andy G. 	
hmm I guess if closeHTTPSocket crashed it would be bad, but the log should show that
Apr 1 	
1:10 PM
Richard T. 	
ok i really can't recreate this.
Richard T. 	
watching the SC logs and wireshark i see the tcp FIN packets every time SC closes the connection.
Richard T. 	
(tail for win32 is nice, it colours the display based on keywords. i've often wanted that!)
Richard T. 	
so Steinar, what version/patch level of XP are you running. and any anti-virus software?
Richard T. 	
or firewalls.
Apr 1 	
1:15 PM
Steinar B. 	
XP service pack 2. Norton Internet Security.
Richard T. 	
ok, what happens if you disable NIS?
Steinar B. 	
Let me try.
Apr 1 	
1:30 PM
Steinar B. 	
has left the room
Steinar B. 	
has entered the room
Steinar B. 	
Disabled firewall and virus protection. Was still able to recreate problem.
Richard T. 	
hmm.
Apr 1 	
1:40 PM
Richard T. 	
all i can suggest is putting a log statement at the end of closeHTTPSocket to make completely sure that the function is not crashing.
Apr 1 	
1:55 PM
Steinar B. 	
I put a print statement at start of closeHTTPSocket, and one at the end. The entire function is always executed. Are you sure that closeHTTPSocket is actually called when it should?
Richard T. 	
yes, it's not a problem with when it's called.
Richard T. 	
basically SC is closing the connection as it has been idle for several seconds.
Richard T. 	
this should be fine as jive will see the connection close, and will reopen next time it sends a command. the problem is the connection close is getting lost.
Richard T. 	
so can you try with the $httpClient->shutdown(2) again, and double check that this is actually called (add some extra debug).
Apr 1 	
2:00 PM
Richard T. 	
certainly the docs on the M$ site suggest that calling shutdown() is the right thing to do on windows.
Apr 1 	
2:05 PM
Comment 29 Richard Titmuss 2008-04-01 14:18:48 UTC
I didn't mean to change the assignment.
Comment 30 sbjaerum 2008-04-02 13:15:26 UTC
I am pleased to inform that the patch below (suggested by Richard) has cured the problem:

Index: server/Slim/Web/HTTP.pm
===================================================================
--- server/Slim/Web/HTTP.pm     (revision 18292)
+++ server/Slim/Web/HTTP.pm     (working copy)
@@ -2541,7 +2541,7 @@
        # little more assertive about closing the socket. Windows-only
        # for now, but could be considered for other platforms and
        # non-streaming connections.
-       if ($streaming && Slim::Utils::OSDetect::OS() eq 'win') {
+       if (Slim::Utils::OSDetect::OS() eq 'win') {
                $httpClient->shutdown(2);
        }
Comment 31 Andy Grundman 2008-04-02 14:48:44 UTC
Great, applied as 7.0.1 change 18309.
Comment 32 Richard Titmuss 2008-04-02 14:49:33 UTC
*** Bug 7697 has been marked as a duplicate of this bug. ***
Comment 33 Jeff Flowerday 2008-04-02 15:03:41 UTC
How long before we can get at this build?
Comment 34 Richard Titmuss 2008-04-02 15:04:46 UTC
Tomorrow, it's built overnight.
Comment 35 Jeff Flowerday 2008-04-02 17:21:54 UTC
If the fix is changing the one line in server/Slim/Web/HTTP.pm as indicated above and restarting my service, then I can confirm that the change doesn't fix the problem.
Comment 36 Andy Grundman 2008-04-02 17:26:34 UTC
Changing the file doesn't work in the Windows build unless you run slimserver.pl with ActivePerl.
Comment 37 Jeff Flowerday 2008-04-03 07:10:35 UTC
I just downloaded 18321 and installed it, updated the remote when prompted.

It didn't fix it.  I browse to a song, start it playing, sit the remote down for 20-30 seconds, pick it up hit back till I'm back at my artists, try entering an artist and all I get is the spinning >

I then hit back, wait a minute or two and the remote then allows me into that artist. If I don't hit back the spinning > will go on forever.
Comment 38 Richard Titmuss 2008-04-03 07:16:57 UTC
Jeff could capture some debug logs from the SBC on an SD card (if you have one available). To do this create a /log folder on the SD card, but it into your SBC and reboot. The in Settings > Advanced Settings > Logging, change net.http and net.comet debugging from INFO to DEBUG. Then recreate the problem a two or three times and power off the SBC. If you could upload the logs from the SD card that would be great.

Thanks.
Comment 39 Jeff Flowerday 2008-04-03 07:52:01 UTC
Created attachment 3195 [details]
SBC logging (Jeff Flowerday)

I got it to happen once, but then the SBC controller paniced because of FAT issues on my SD card.  This is what it captured up to the point where SBC rebooted.
Comment 40 Jeff Flowerday 2008-04-03 08:32:44 UTC
Created attachment 3196 [details]
Another dump

Since my SBC crashed and rebooted it's been pretty hard to reproduce the problem.  It happened once in this log.
Comment 41 Fredrik Storm 2008-04-03 12:46:29 UTC
Downloaded latest 7.0.1-18321
Reinstalled and manually (meny,settings,advanced....)uploaded the Jive.
Now tried at least 30-40 times tonight to replicate the bug 7376.
This was something I had major problems with before.
It seem to be working now. I will try even more during weekend.
GREAT
Running XP SP2, Comodo Firewall, AVG Antivirus + spyware, and an Speedtouch 100mbit+Router
Comment 42 Jeff Flowerday 2008-04-03 22:00:37 UTC
I had problems initially this morning after installing the update on the controller.  After another forced reboot of the controller because of the FAT filesystem panic when trying to log, it's working much much better.  I'm not sure if the logs told you guys anything or maybe they were legit wireless network issues, I don't know.

I have been completely unable to reproduce the problem tonight.  I've since switched my network back to WPA2 and it's working well in WPA2 mode as well.

Unless you saw something in the logs I sent you, I have no issue with you closing this again.  If it happens again I'll catch more logs for you and post here.

Thanks,
Jeff
Comment 43 Richard Titmuss 2008-04-04 01:47:54 UTC
Thanks Jeff. I have just has a quick look at your logs and nothing obvious stands out as a connection problem. If you do get problems again in addition to the jive logs a wireshark capture for tcp port 9000 on the server would probably be most useful to.
Comment 44 James Richardson 2008-05-08 09:29:50 UTC
Verified fixed in 7.0.1 - 19325  r2409 - r23
Comment 45 James Richardson 2008-05-15 12:28:42 UTC
This bug has recently been fixed in the latest release of SqueezeCenter 7.0.1

Please try that version, if you still see the error, then reopen this bug.

To download this version, please navigate to: http://www.slimdevices.com/su_downloads.html