Bug 323 – TCP streams randomly break

Bug 323 - TCP streams randomly break

Summary:

TCP streams randomly break

Status:	CLOSED FIXED

Product:	SB 1
Classification:	Unclassified
Component:	Firmware
Version:	31
Platform:	PC All

Importance:	P2 major (vote)
Target Milestone:	8.0.0
Assigned To:	Sean Adams

URL:	http://adamspiers.org/upload/Squeezeb...
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2004-05-23 03:53 UTC by Adam Spiers
Modified:	2009-01-14 10:13 UTC (History)
CC List:	0 users

See Also:
Category:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Adam Spiers 2004-05-23 03:53:10 UTC

The symptoms are that now and again the TCP stream will cut out for
anything between a few seconds and 30 minutes or so.  During this
period the player displays the standard error about being unable to
contact the server, even though the squeezebox is still pingable fine
from the server.  Sometimes the player recovers, sometimes not.  When
it does, it can recover within seconds or can take half an hour.

The SlimServer runs on a Red Hat Linux machine connected to the
wireless network via a D-Link DWL-2000AP+ in client mode.  The
access point is a Netgear DG834G.  The Squeezebox shows a rock steady
92% signal strength.

N.B. This is almost certainly NOT a wireless-related issue.  The wireless
network works perfectly for everything else.  Please see the analysis of
the network traffic below:

http://adamspiers.org/upload/Squeezebox-TCP-bug/

which shows a session where the music plays great for over 700 seconds before
the stream falters with TCP retransmissions, then breaks and ICMP Port
Unreachables are seen originating from the squeezebox.  Towards the end of the
session I hit 'play' on the remote and it started up ok again.

This problem occurs with all releases/firmware I have tried.  Currently on
2004/05/15 nightly and firmware 23 (please note that your Bugzilla setup only
goes up to version 22!)

I have lots of experience with both Perl and networking and am very keen to
work with you guys to debug this.  Thanks!

Comment 1 Adam Spiers 2004-05-23 08:54:12 UTC

I've just upgraded to slimserver-2004_05_23-1.noarch.rpm and am now seeing
multiple duplicate ACKs from the squeezebox for most packets coming from the
server!  This is in addition to the TCP retransmissions and ICMP Port
Unreachables which were seen before.  Something very odd is surely going on...
New packet dump uploaded:

http://adamspiers.org/upload/Squeezebox-TCP-bug/dup-ACKs.pcap

Comment 2 Adam Spiers 2004-05-23 09:13:10 UTC

Yet more progress ...  I was playing around with the debugging settings,
and as luck would have it, only seconds after I enabled d_slimproto, the
stream cut out again (it's actually happening every few minutes at the
moment).

http://adamspiers.org/upload/Squeezebox-TCP-bug/slimserver.log.cut-out

If I'm interpreting the debug log correctly, the buffers suddenly start
emptying despite a 92% signal strength, and then a short while later 
the player seems to give up on the current stream and re-establish a new
socket pair (see "hello from existing client" in the log).  This seems to
agree with the attached graph and first packet capture, which shows that
the high (unprivileged) TCP port on the Squeezebox gets incremented by
one every time there's a glitch.

Comment 3 Adam Spiers 2004-05-30 15:27:07 UTC

Didn't mention this before - I get *lots* of both of these types of error in
/tmp/slimserver.log:

2004-05-23 20:45:51.3427 Metadata byte not read, trying again: Resource
temporarily unavailable
2004-05-23 20:45:51.3430 Metadata byte not read, trying again: Resource
temporarily unavailable
2004-05-23 20:45:51.3433 Metadata byte not read, trying again: Resource
temporarily unavailable
2004-05-23 20:45:51.3435 Metadata byte not read, trying again: Resource
temporarily unavailable
2004-05-23 20:45:51.3438 Metadata byte not read, trying again: Resource
temporarily unavailable


getpeername() on closed socket GEN1538 at
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206.
getpeername() on closed socket GEN1539 at
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206.
getpeername() on closed socket GEN1556 at
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206.
getpeername() on closed socket GEN1557 at
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206.
getpeername() on closed socket GEN1568 at
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206.
getpeername() on closed socket GEN1577 at
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206.
getpeername() on closed socket GEN1580 at
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/IO/Socket.pm line 206.

Comment 4 Simon Spooner 2004-06-10 14:18:20 UTC

I also have experienced the same problems.
I see massive DUP TCP ACK, increasing the TCP window size each time.
Once that has settled, I see loads of TCP retransmissions.

My audio drops out, the s'box loses coms and displays the standard error messages.
Also, during playback of long audio files I get box crashses and reboots.
I cannot play AAC unprotected at all for more than a few seconds, audio repeatedly drops out.

The log for my ethereal capture (which shows a crash) can be found at :

www.findsimon.com/slimcrash

Comment 5 Sean Adams 2004-06-11 11:01:28 UTC

Adam,

I am looking at stopped.pcap (the data connection on port 9000) and the thing that immediately struck
me is that the window size from the client is always 16000, but the server is not sending full-size
packets. This leads me to believe that what we're seeing in ethereal is not really what's on the wire, and
that there may be flow control and/or re-packetizing going on in the ethernet card. It reminds me of
similar things I saw when trying to track down the problem with "TCP segmentation offloading" on
Windows XP.

Also interesting is that whereas the data connection (port 9000) "stalls out" at 694 seconds, it seems to
resume again briefly at 722 seconds even though the control connection has been dropped and
subsequently reestablished during this time period. So it doesn't actually die until after the control
connection is reestablished (which is expected actually - the server doesn't try to keep streaming in this
case).

So aside from the weird window size while streaming, what it looks like is just that the network "went
dead" completely for a little while.... the control channel times out after 10 sec and then it gets
reestablished.

I would suggest the first step here is to test over ethernet instead of wireless, to determine if there's an
intermittent outage in the signal that is causing the squeezebox to time out on the control connection.

As a separate thing I would see if you have any way to do a capture from another machine using a hub
off of this server. It's not clear if there really is a flow control problem but the server sending small
(<MSS) segments against a large (16KB) window leads me to believe that this trace isn't telling the
whole story.

Comment 6 Adam Spiers 2004-06-14 16:19:41 UTC

OK I'll give that a try.  Further to your comment that it's strange that the
server only sends small packets, the largest I can see it sending is 1500 bytes
(not including the ethernet headers) which is what the MTU of eth0 on the server
is set to.  So surely that's not surprising?

One thing which does concern me however is some dubious settings I have on my
DLink AP (which is acting as an ethernet/wirless bridge):

RTS threshold set to 4096
Fragmentation set to 4096

I now quote from the AP's online help:

RTS Threshold
"Request to Send" threshold is measured in bytes. This value should remain at
its default setting of 2432 bytes. Any packet size exceeding 2432 bytes will
trigger the use of RTS/CTS (Request to Send/Clear to Send) mechanism for
wireless transmission. If you encounter inconsistent data flow, only minor
modifications to the value range between 256 and 2432 are recommended.
 
Fragmentation
This value should remain at its default setting of 2346 bytes. The fragmentation
threshold determines whether packets will be fragmented before wireless
transmission. If you experience a high packet error rate, you may slightly
increase your "Fragmentation" value within the value range of 256 to 2346.
Setting the Fragmentation value too low may result in poor performance.

I don't know if the RTS threshold being too high could be a problem.  I doubt
the fragmentation one is, given previous MTU comments.  Anyway, I'll test with
more sensible settings.

I'll also give firmware 24 a go, and try hooking the Squeezebox directly into
the router wirefully, and report back.

Comment 7 Adam Spiers 2004-06-17 15:28:14 UTC

I'm still seeing cutouts with firmware 25, albeit with quicker recovery in
between each (at first impressions, anyway).

http://adamspiers.org/upload/Squeezebox-TCP-bug/cutout-firmware25-cropped.pcap

I'll try out firmware 24 soon.

Comment 8 Blackketter Dean 2004-06-24 16:12:15 UTC

Hey Adam: did you have a chance to look at version 24?  Also, does this behavior happen when 
connected via ethernet or is it a wireless only problem?

Comment 9 Blackketter Dean 2004-06-24 16:38:23 UTC

Adam, one more thing... Please update to the latest nightly and see if it continues happening.  The 
most recent builds contain a fix in some memory timing that might explain the problem that you are 
seeing.

Comment 10 Adam Spiers 2004-06-25 04:00:57 UTC

I haven't tried wired, firmwares 24 or 26 yet.  Will try this
weekend and let you know.

I'm beginning to suspect that it's some strange interaction between
the player's TCP stack and my PC's wireless ethernet bridge (actually
a DWL-2000AP+ WAP in client mode).  But I am very keen to get to the
bottom of what precisely is going wrong with the stream.  Still haven't
heard any conjectures from anyone as to why I saw ICMP port unreachables
for instance, although I'm not sure they're occurring with the newer
firmware.

Comment 11 Adam Spiers 2004-08-08 06:52:40 UTC

It still happened regularly with firmwares up to 28, although at least the
stream always seems to recover after about 30 seconds.

I have just upgraded to 5.3.0b1 and firmware 31 and it still happens although on
first impressions maybe less regularly?  But I am still seeing the dup ACKs for
every packet!  Surely this could be related.  Please can you at least confirm or
deny that these dup ACKs imply a semi-broken TCP stack in the player firmware?

Comment 12 Adam Spiers 2004-08-08 07:00:50 UTC

Sample capture of dup ACKs from firmware 31 here:

http://www.adamspiers.org/upload/Squeezebox-TCP-bug/dup-ACKs-firmware-31.pcap

Comment 13 Blackketter Dean 2005-03-28 09:08:04 UTC

Sean: can you take a look at the pcap dump?

Comment 14 Sean Adams 2005-03-28 17:31:28 UTC

This is ancient. Adam, unless you're still seeing this we'll close it.

Comment 15 Blackketter Dean 2005-04-20 13:07:31 UTC

Marking as fixed.  Adam: if it's still an issue, please reopen.

Comment 16 James Richardson 2009-01-14 10:13:17 UTC

Closing fixed bugs