Bugzilla – Bug 323
TCP streams randomly break
Last modified: 2009-01-14 10:13:17 UTC
The symptoms are that now and again the TCP stream will cut out for anything between a few seconds and 30 minutes or so. During this period the player displays the standard error about being unable to contact the server, even though the squeezebox is still pingable fine from the server. Sometimes the player recovers, sometimes not. When it does, it can recover within seconds or can take half an hour. The SlimServer runs on a Red Hat Linux machine connected to the wireless network via a D-Link DWL-2000AP+ in client mode. The access point is a Netgear DG834G. The Squeezebox shows a rock steady 92% signal strength. N.B. This is almost certainly NOT a wireless-related issue. The wireless network works perfectly for everything else. Please see the analysis of the network traffic below: http://adamspiers.org/upload/Squeezebox-TCP-bug/ which shows a session where the music plays great for over 700 seconds before the stream falters with TCP retransmissions, then breaks and ICMP Port Unreachables are seen originating from the squeezebox. Towards the end of the session I hit 'play' on the remote and it started up ok again. This problem occurs with all releases/firmware I have tried. Currently on 2004/05/15 nightly and firmware 23 (please note that your Bugzilla setup only goes up to version 22!) I have lots of experience with both Perl and networking and am very keen to work with you guys to debug this. Thanks!
I've just upgraded to slimserver-2004_05_23-1.noarch.rpm and am now seeing multiple duplicate ACKs from the squeezebox for most packets coming from the server! This is in addition to the TCP retransmissions and ICMP Port Unreachables which were seen before. Something very odd is surely going on... New packet dump uploaded: http://adamspiers.org/upload/Squeezebox-TCP-bug/dup-ACKs.pcap
Yet more progress ... I was playing around with the debugging settings, and as luck would have it, only seconds after I enabled d_slimproto, the stream cut out again (it's actually happening every few minutes at the moment). http://adamspiers.org/upload/Squeezebox-TCP-bug/slimserver.log.cut-out If I'm interpreting the debug log correctly, the buffers suddenly start emptying despite a 92% signal strength, and then a short while later the player seems to give up on the current stream and re-establish a new socket pair (see "hello from existing client" in the log). This seems to agree with the attached graph and first packet capture, which shows that the high (unprivileged) TCP port on the Squeezebox gets incremented by one every time there's a glitch.
Didn't mention this before - I get *lots* of both of these types of error in /tmp/slimserver.log: 2004-05-23 20:45:51.3427 Metadata byte not read, trying again: Resource temporarily unavailable 2004-05-23 20:45:51.3430 Metadata byte not read, trying again: Resource temporarily unavailable 2004-05-23 20:45:51.3433 Metadata byte not read, trying again: Resource temporarily unavailable 2004-05-23 20:45:51.3435 Metadata byte not read, trying again: Resource temporarily unavailable 2004-05-23 20:45:51.3438 Metadata byte not read, trying again: Resource temporarily unavailable getpeername() on closed socket GEN1538 at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206. getpeername() on closed socket GEN1539 at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206. getpeername() on closed socket GEN1556 at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206. getpeername() on closed socket GEN1557 at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206. getpeername() on closed socket GEN1568 at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206. getpeername() on closed socket GEN1577 at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206. getpeername() on closed socket GEN1580 at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206.
I also have experienced the same problems. I see massive DUP TCP ACK, increasing the TCP window size each time. Once that has settled, I see loads of TCP retransmissions. My audio drops out, the s'box loses coms and displays the standard error messages. Also, during playback of long audio files I get box crashses and reboots. I cannot play AAC unprotected at all for more than a few seconds, audio repeatedly drops out. The log for my ethereal capture (which shows a crash) can be found at : www.findsimon.com/slimcrash
Adam, I am looking at stopped.pcap (the data connection on port 9000) and the thing that immediately struck me is that the window size from the client is always 16000, but the server is not sending full-size packets. This leads me to believe that what we're seeing in ethereal is not really what's on the wire, and that there may be flow control and/or re-packetizing going on in the ethernet card. It reminds me of similar things I saw when trying to track down the problem with "TCP segmentation offloading" on Windows XP. Also interesting is that whereas the data connection (port 9000) "stalls out" at 694 seconds, it seems to resume again briefly at 722 seconds even though the control connection has been dropped and subsequently reestablished during this time period. So it doesn't actually die until after the control connection is reestablished (which is expected actually - the server doesn't try to keep streaming in this case). So aside from the weird window size while streaming, what it looks like is just that the network "went dead" completely for a little while.... the control channel times out after 10 sec and then it gets reestablished. I would suggest the first step here is to test over ethernet instead of wireless, to determine if there's an intermittent outage in the signal that is causing the squeezebox to time out on the control connection. As a separate thing I would see if you have any way to do a capture from another machine using a hub off of this server. It's not clear if there really is a flow control problem but the server sending small (<MSS) segments against a large (16KB) window leads me to believe that this trace isn't telling the whole story.
OK I'll give that a try. Further to your comment that it's strange that the server only sends small packets, the largest I can see it sending is 1500 bytes (not including the ethernet headers) which is what the MTU of eth0 on the server is set to. So surely that's not surprising? One thing which does concern me however is some dubious settings I have on my DLink AP (which is acting as an ethernet/wirless bridge): RTS threshold set to 4096 Fragmentation set to 4096 I now quote from the AP's online help: RTS Threshold "Request to Send" threshold is measured in bytes. This value should remain at its default setting of 2432 bytes. Any packet size exceeding 2432 bytes will trigger the use of RTS/CTS (Request to Send/Clear to Send) mechanism for wireless transmission. If you encounter inconsistent data flow, only minor modifications to the value range between 256 and 2432 are recommended. Fragmentation This value should remain at its default setting of 2346 bytes. The fragmentation threshold determines whether packets will be fragmented before wireless transmission. If you experience a high packet error rate, you may slightly increase your "Fragmentation" value within the value range of 256 to 2346. Setting the Fragmentation value too low may result in poor performance. I don't know if the RTS threshold being too high could be a problem. I doubt the fragmentation one is, given previous MTU comments. Anyway, I'll test with more sensible settings. I'll also give firmware 24 a go, and try hooking the Squeezebox directly into the router wirefully, and report back.
I'm still seeing cutouts with firmware 25, albeit with quicker recovery in between each (at first impressions, anyway). http://adamspiers.org/upload/Squeezebox-TCP-bug/cutout-firmware25-cropped.pcap I'll try out firmware 24 soon.
Hey Adam: did you have a chance to look at version 24? Also, does this behavior happen when connected via ethernet or is it a wireless only problem?
Adam, one more thing... Please update to the latest nightly and see if it continues happening. The most recent builds contain a fix in some memory timing that might explain the problem that you are seeing.
I haven't tried wired, firmwares 24 or 26 yet. Will try this weekend and let you know. I'm beginning to suspect that it's some strange interaction between the player's TCP stack and my PC's wireless ethernet bridge (actually a DWL-2000AP+ WAP in client mode). But I am very keen to get to the bottom of what precisely is going wrong with the stream. Still haven't heard any conjectures from anyone as to why I saw ICMP port unreachables for instance, although I'm not sure they're occurring with the newer firmware.
It still happened regularly with firmwares up to 28, although at least the stream always seems to recover after about 30 seconds. I have just upgraded to 5.3.0b1 and firmware 31 and it still happens although on first impressions maybe less regularly? But I am still seeing the dup ACKs for every packet! Surely this could be related. Please can you at least confirm or deny that these dup ACKs imply a semi-broken TCP stack in the player firmware?
Sample capture of dup ACKs from firmware 31 here: http://www.adamspiers.org/upload/Squeezebox-TCP-bug/dup-ACKs-firmware-31.pcap
Sean: can you take a look at the pcap dump?
This is ancient. Adam, unless you're still seeing this we'll close it.
Marking as fixed. Adam: if it's still an issue, please reopen.
Closing fixed bugs