Bugzilla – Bug 7376
SBC is sometimes unresponsive
Last modified: 2009-09-08 09:20:03 UTC
Steps to reproduce: 1. Set SBC to control a SBR, and start play music on SBR. 2. Let SBC go to sleep. 3. Press pause on SBC, the music does not pause. Can't browse library. Seems like SBC is in some congested state. 4. Eventually SBC becomes responsive again. It might not happen all the time, but after some iterations of the above sequence, this problem always shows up on my SBC. SBC version is r2085.
My darling wife has been complaining about this too. Sometimes takes 10 seconds to get all connected up.
Dean, I don't think we are talking about the same thing. What I am observing is not related to the connection time when coming out of suspend. I am seeing the problem when SBC is in sleep mode, i.e. screen turned off and lowered CPU clock (happens after one minute of inactivity). The congested state can last much longer than 10 seconds. The SBC just consumes pause presses with no effect on the playback on SBR. In this state, I can navigate the menus just fine. But when entering a menu that requires communication with server, for instance browse Albums, it just hangs at the right arrow animation stage. Looks like SBC is missing a response on the first button pres, and all subsequent SBC <-> SC communication is blocked as a result of this. I believe it is easier to reproduce this if the SBC lies flat on a table, with the pause button event waking it up from sleep, i.e. motion sensor is not triggered first.
I see this in the log: Mar 3 18:08:46 (none) user.err jive: (NetworkThread.lua:136) - network thread timeout for Task(SocketHttp {DELL_Request}(R)) Mar 3 18:08:46 (none) user.err jive: (SocketHttp.lua:443) - SocketHttp {DELL_Request}:t_rcvHeaders.pump:nil Don't know if it is related.
When the SBC suddenly starts to respond, it sometimes executed all the play button presses being blocked somewhere. This must be a high priority bug. A common user scenario is to have music playing, then grab the controller (not used for a couple minutes) to pause the music. Everyone would expect that the music pauses immediately. I can't imagine that I am the only one experiencing this...
The analog clock screensaver sometimes seems to kick in too early in this situation.
Complaints about this problem is starting to pop up in the forum: http://forums.slimdevices.com/showthread.php?t=45328 Also I think bug 7603 is a duplicate of this bug.
Steinar, can you try editing the /etc/init.d/wlan script and commenting out the 'iwconfig eth0 power on' line. Your battery life will take a serious dent, but I am interested if this fixes the unresponsiveness problem.
According to http://wiki.slimdevices.com/index.cgi?JivePowerModes the power to wlan is not touched when going into sleep mode. Still worthwhile to do the test?
wlan power save mode is enabled all the time, this change disables it. yes it is worth doing the test. it is likely i will be making a change that actively enables/disable wlan power save, so i would be interested to know if this will solve your problem.
I will try to squeeze in time to do the test tonight.
I uncommented the requested line in /etc/init.d/wlan Unfortunately no effect on this issue.
Do you 'uncommented'? I requested that the line was commented. It should look like this: # Enable power save #/usr/sbin/iwconfig eth0 power on And you'll need to reboot after the change.
Typo by me. I commented the line, so that it looks like the line specified. Then rebooted. Bug shows up immediately.
Created attachment 3137 [details] Wireshark capture during error situation Wireshark capture during error situation.
Created attachment 3138 [details] wireshark capture during successful pause wireshark capture during successful pause
What is a SYN packet? When pause is successful, I see such packets just before the pause command is sent. When failing, those SYN packets are missing. When in the unresponsive state, it remains unresponsive until I see those SYN packets in wireshark. After the SYN packets, Jive behaves normal.
*** Bug 7603 has been marked as a duplicate of this bug. ***
The SYN packets are sent as part of the TCP handshake, it indicates that a new TCP connection has been established before the successful pause command. I don't think the lack of the SYN packets are the problem. Steinar, what operating system do you use? From the packet captures in both cases the POST from jive is seen. In the failure case no HTTP response is seen from SC. I assume that jive must be timing out the connection, and re-establishing a new connection which is why it starts to work. So I think that either: 1. SC is dropping this HTTP request 2. Something is going wrong before this that means Jive has an invalid Comet clientId. What does SC do in this case? Andy, do you have any ideas?
Can you get a network.http debug? This would show if SC gets the HTTP connection at all.
In response to comment #18: I observe a consistent one to one correlation between the presence/nonpresence of the SYN packets and the responsive/nonresponsive situation. Might be just a coincidence. SC is running on WinXP.
I enabled network.http logging. On successful pause requests, I see messages like this in the log: [08-03-31 22:33:12.8260] Slim::Web::HTTP::acceptHTTP (259) Accepted connection 5 from 10.0.0.6:60113 [08-03-31 22:33:12.8324] Slim::Web::HTTP::processHTTP (360) Reading request... [08-03-31 22:33:12.8346] Slim::Web::HTTP::processHTTP (377) HTTP request: from 10.0.0.6:60113 (Slim::Web::HTTP::ClientConn=GLOB(0x66d848c)) for POST HTTP/1.1 /cometd [08-03-31 22:33:12.8361] Slim::Web::HTTP::processHTTP (384) Raw request headers: [ POST /cometd HTTP/1.1 Accept-Language: en Host: 10.0.0.4:9000 User-Agent: Jive/7.0.1 r2158 Content-Length: 133 Content-Type: text/json [{"id":40,"data":{"request":["22:91:92:10:11:5b",["pause","0"]],"response":"\/77378f4d\/slim\/request"},"channel":"\/slim\/request"}] ] [08-03-31 22:33:12.8379] Slim::Web::HTTP::processHTTP (484) Raw path is [/cometd] [08-03-31 22:33:12.8394] Slim::Web::HTTP::processHTTP (492) Handling [/cometd] using raw function On failing pause requests, I see nothing like this in the network.http log.
I think it may be important to see what's earlier in the log, leading up to the failure. Can you post that?
Created attachment 3164 [details] networ.http log No signs of the pause command a bit more than one minute later than the first.
Created attachment 3167 [details] cometd + http log Log starts when music is unpaused. Somewhere in there, pause was pressed without response. Log ends when music was paused with remote control.
Created attachment 3180 [details] Squeezecenter log
Created attachment 3181 [details] Jive log
Created attachment 3182 [details] Wireshark dump The three latest attachments should be approximately synchronized in time. Starts when Jive is used to unpause music. Ends when remote control is used to pause music. In between there is a Jive pause button pushed that is not responded to.
From campfire: Richard T. steinar, the logs look good. i still don't understand what's wrong but they capture the problem well. Richard T. andy, do you have a few minutes to look at them. it all seems very strange. Richard T. my reading of the logs is: Richard T. in the SC log you need to look for port 43917. Richard T. you can see the connection at 18:10:52 and a pause request is sent on that connection. Richard T. this request is handled and the response sent during 18:10:52. Richard T. the next time this port is logged is on a keep-alive timeout, at 18:12:07. Richard T. so this is approx. 75 seconds later. Richard T. moving to the network capture. Richard T. the connection to port 43917 starts at 1.05, and the first request is ACK'd by jive at 1.23. Apr 1 12:30 PM Richard T. the next time port 43917 appears is at 85.72 when jive posts another request. you can see the tcp ACK, but no http response. Richard T. i reckon we should see a tcp connection close approx 75 seconds after 1.23, so around 76.23. it's missing. Richard T. then on the jive log: Richard T. the first post appears to be missing, but the second request is queued at 18:12:12. the http state machine for DELL_Request gets to t_rcvHeaders (18:12:14) and moves no further. so jive is waiting for the http response. Richard T. - Richard T. to me this clearly looks like SC/XP has not fully closed the connection on port 43917. and even more worrying is the XP is still sending tcp ack's for this connection. Apr 1 12:35 PM Andy G. so is it only a problem under windows? Richard T. not much in closeHTTPSocket (where the keep-alive is logged) looks like it can go wrong. unless the the close handler (if any) is fails, what would happen then? Richard T. well it's hard to say for sure. but i've never seen it running on linux. Richard T. but i could not recreate on xp last night, i'll try again. Andy G. hmm I guess if closeHTTPSocket crashed it would be bad, but the log should show that Apr 1 1:10 PM Richard T. ok i really can't recreate this. Richard T. watching the SC logs and wireshark i see the tcp FIN packets every time SC closes the connection. Richard T. (tail for win32 is nice, it colours the display based on keywords. i've often wanted that!) Richard T. so Steinar, what version/patch level of XP are you running. and any anti-virus software? Richard T. or firewalls. Apr 1 1:15 PM Steinar B. XP service pack 2. Norton Internet Security. Richard T. ok, what happens if you disable NIS? Steinar B. Let me try. Apr 1 1:30 PM Steinar B. has left the room Steinar B. has entered the room Steinar B. Disabled firewall and virus protection. Was still able to recreate problem. Richard T. hmm. Apr 1 1:40 PM Richard T. all i can suggest is putting a log statement at the end of closeHTTPSocket to make completely sure that the function is not crashing. Apr 1 1:55 PM Steinar B. I put a print statement at start of closeHTTPSocket, and one at the end. The entire function is always executed. Are you sure that closeHTTPSocket is actually called when it should? Richard T. yes, it's not a problem with when it's called. Richard T. basically SC is closing the connection as it has been idle for several seconds. Richard T. this should be fine as jive will see the connection close, and will reopen next time it sends a command. the problem is the connection close is getting lost. Richard T. so can you try with the $httpClient->shutdown(2) again, and double check that this is actually called (add some extra debug). Apr 1 2:00 PM Richard T. certainly the docs on the M$ site suggest that calling shutdown() is the right thing to do on windows. Apr 1 2:05 PM
I didn't mean to change the assignment.
I am pleased to inform that the patch below (suggested by Richard) has cured the problem: Index: server/Slim/Web/HTTP.pm =================================================================== --- server/Slim/Web/HTTP.pm (revision 18292) +++ server/Slim/Web/HTTP.pm (working copy) @@ -2541,7 +2541,7 @@ # little more assertive about closing the socket. Windows-only # for now, but could be considered for other platforms and # non-streaming connections. - if ($streaming && Slim::Utils::OSDetect::OS() eq 'win') { + if (Slim::Utils::OSDetect::OS() eq 'win') { $httpClient->shutdown(2); }
Great, applied as 7.0.1 change 18309.
*** Bug 7697 has been marked as a duplicate of this bug. ***
How long before we can get at this build?
Tomorrow, it's built overnight.
If the fix is changing the one line in server/Slim/Web/HTTP.pm as indicated above and restarting my service, then I can confirm that the change doesn't fix the problem.
Changing the file doesn't work in the Windows build unless you run slimserver.pl with ActivePerl.
I just downloaded 18321 and installed it, updated the remote when prompted. It didn't fix it. I browse to a song, start it playing, sit the remote down for 20-30 seconds, pick it up hit back till I'm back at my artists, try entering an artist and all I get is the spinning > I then hit back, wait a minute or two and the remote then allows me into that artist. If I don't hit back the spinning > will go on forever.
Jeff could capture some debug logs from the SBC on an SD card (if you have one available). To do this create a /log folder on the SD card, but it into your SBC and reboot. The in Settings > Advanced Settings > Logging, change net.http and net.comet debugging from INFO to DEBUG. Then recreate the problem a two or three times and power off the SBC. If you could upload the logs from the SD card that would be great. Thanks.
Created attachment 3195 [details] SBC logging (Jeff Flowerday) I got it to happen once, but then the SBC controller paniced because of FAT issues on my SD card. This is what it captured up to the point where SBC rebooted.
Created attachment 3196 [details] Another dump Since my SBC crashed and rebooted it's been pretty hard to reproduce the problem. It happened once in this log.
Downloaded latest 7.0.1-18321 Reinstalled and manually (meny,settings,advanced....)uploaded the Jive. Now tried at least 30-40 times tonight to replicate the bug 7376. This was something I had major problems with before. It seem to be working now. I will try even more during weekend. GREAT Running XP SP2, Comodo Firewall, AVG Antivirus + spyware, and an Speedtouch 100mbit+Router
I had problems initially this morning after installing the update on the controller. After another forced reboot of the controller because of the FAT filesystem panic when trying to log, it's working much much better. I'm not sure if the logs told you guys anything or maybe they were legit wireless network issues, I don't know. I have been completely unable to reproduce the problem tonight. I've since switched my network back to WPA2 and it's working well in WPA2 mode as well. Unless you saw something in the logs I sent you, I have no issue with you closing this again. If it happens again I'll catch more logs for you and post here. Thanks, Jeff
Thanks Jeff. I have just has a quick look at your logs and nothing obvious stands out as a connection problem. If you do get problems again in addition to the jive logs a wireshark capture for tcp port 9000 on the server would probably be most useful to.
Verified fixed in 7.0.1 - 19325 r2409 - r23
This bug has recently been fixed in the latest release of SqueezeCenter 7.0.1 Please try that version, if you still see the error, then reopen this bug. To download this version, please navigate to: http://www.slimdevices.com/su_downloads.html