Bugzilla – Bug 15846
USB drive disconnects during stress testing
Last modified: 2019-01-25 10:11:41 UTC
In 3 instances with different versions of Touch firmware (r8622, 8627) on two different Touch setups, I've experienced this after running tests for about 24 hours. Here's both of the setups: 1: Fab4 with USB drive, playing flac tracks continuously. Not a server for other players. 2: Fab4 with USB-IDE converter drive, playing MP3 tracks continuously. Serving MP3 tracks to 5 other players (4 IP3K, one Baby) Not sure whether rebooting is needed, since I always get a firmware update prompt. Also not sure if I have to restart TinySC, since firmware update always starts a rescan. Here's the log info. By the time I see it, the log just has this stuff repeating. Will try getting turning on more switches, but not sure. Mar 5 10:56:59 squeezeplay: stack traceback: Mar 5 10:56:59 squeezeplay: /usr/share/jive/jive/net/SocketHttp.lua:388: in function 'pump' Mar 5 10:56:59 squeezeplay: /usr/share/jive/jive/net/SocketTcp.lua:200: in function 'writePump' Mar 5 10:56:59 squeezeplay: /usr/share/jive/jive/net/Socket.lua:186: in function </usr/share/jive/jive/net/Socket.lua:184> Mar 5 10:56:59 squeezeplay: INFO net.comet - Comet.lua:804 Comet {Mickeys Fab4 in Office (USB)}: _getEventSink error: connection refused Mar 5 10:56:59 squeezeplay: INFO net.comet - Comet.lua:997 Comet {Mickeys Fab4 in Office (USB)}: handleAdvice state=CONNECTING Mar 5 10:56:59 squeezeplay: INFO squeezebox.server - SlimServer.lua:714 disconnected Mickeys Fab4 in Office (USB) idleTimeoutTriggered: nil Mar 5 10:56:59 squeezeplay: INFO applet.AlarmSnooze - AlarmSnoozeApplet.lua:310 notify_serverDisconnected: SlimServer {Mickeys Fab4 in Office (USB)} is now disconnected Mar 5 10:56:59 squeezeplay: WARN applet.AlarmSnooze - AlarmSnoozeApplet.lua:323 notify_serverDisconnected: SlimServer {Mickeys Fab4 in Office (USB)} - disconnected, but no server alarm in progress : nil Mar 5 10:56:59 squeezeplay: INFO net.comet - Comet.lua:1038 Comet {Mickeys Fab4 in Office (USB)}: advice is retry, connect in 2.679 seconds Mar 5 10:57:01 squeezeplay: INFO net.slimproto - SlimProto.lua:599 connect to 172.19.120.150 (172.19.120.150) Mar 5 10:57:01 squeezeplay: INFO net.slimproto - SlimProto.lua:773 connection error: closed, reconnecting in 1.297 seconds Mar 5 10:57:02 squeezeplay: ERROR net.http - SocketHttp.lua:388 SocketHttp {Mickeys Fab4 in Office (USB)_Chunked}:t_sendRequest.pump: connection refused
Can someone look and see whether latest Linux bug fixes for storage and USB drivers have been applied? Most likely they've not been updated in one year.
Switching to latest Fab4 units and not using PB1 or PB2 units.
Enabled the kernel logs and running the test to reproduce the problem and to capture the logs. Meanwhile I will be looking into the driver and the latest changes in the kernel.
I have been running the test for more than 24 hours now and have not seen it happen. I looked at the kernel changes and there has been a lot of modifications. As soon as the bug happens I should be able to understand the nature of it and to fix it.
I think this bug should be targeted for post 7.5.0. It is too late to make drastic changes in the USB driver in the kernel. I think the risk exceeds the rewards.
The changes in the kernel related to USB and SCSI support after 2.6.26 has been a lot. To find the right changes and apply them I need to reproduce the bug and have the logs. Here is the links to the kernel changes: http://www.kernel.org/ ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
My system is still running after almost three days. I will load and run some test tools that put a lot of load on the device and I will check the performance/reliability of the system. This bug's target date should be postponed to after 7.5.0
Here is the link to the dt test that could be used to run stress tests on the device. I will run the test and it should give us good indication of the performance and the reliability of the USB/SCSI connections. http://www.scsifaq.org/RMiller_Tools/index.html
Here is an example of the way I would run the dt on my system. I believe we should use the dt as part of our DVT process if we are planning to support the external drives. #dt of=/dev/sda1 capacity=124m iotype=random bs=512 pattern=iot aios=14 dlimit=1024 prefix='logitech' dtype=disk runtime=03d00h00m0s enable=aio,debug records=26000 oncerr=abort enable=aio,debug records=26000 oncerr=abort enable=noprog alarm=10s noprogt=20s I think we should also run the vmstat at the same time to monitor IO, Memory, and CPU load while this operation is in progress: #vmstat 10 Here is a sample result of running dt: " Write Statistics: Total records processed: 26000 @ 512 bytes/record (0.500 Kbytes) Total bytes transferred: 13312000 (13000.000 Kbytes, 12.695 Mbytes) Average transfer rates: 48124 bytes/sec, 46.996 Kbytes/sec Number I/O's per second: 93.992 Total passes completed: 0 Total errors detected: 0/1 Total elapsed time: 04m36.62s Total system time: 00m06.49s Total user time: 00m02.54s Starting time: Mon Mar 15 22:33:07 2010 Ending time: Mon Mar 15 22:37:43 2010 dt: All requests completed before cancel... dt: Closing file '/dev/sda1', fd = 3... dt: Attempting to reopen file '/dev/sda1', open flags = 0 (0)... dt: File '/dev/sda1' successfully reopened, fd = 3 dt: All requests completed before cancel... Read Statistics: Total records processed: 26000 @ 512 bytes/record (0.500 Kbytes) Total bytes transferred: 13312000 (13000.000 Kbytes, 12.695 Mbytes) Average transfer rates: 363716 bytes/sec, 355.191 Kbytes/sec Number I/O's per second: 710.383 Total passes completed: 1 Total errors detected: 0/1 Total elapsed time: 00m36.60s Total system time: 00m07.14s Total user time: 00m02.71s Starting time: Mon Mar 15 22:33:07 2010 Ending time: Mon Mar 15 22:38:55 2010 " Here is a sample result of running vmstat: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 3 0 12008 70388 11544 0 0 352 342 1416 5549 5 13 0 82 0 2 0 10032 71820 11544 0 0 143 453 1244 4925 9 10 0 81 0 2 0 10412 71960 11544 0 0 14 427 1105 4173 2 7 0 91 1 2 0 10440 71960 11544 0 0 0 311 974 3968 2 5 0 93 0 1 0 10440 71960 11544 0 0 0 232 916 3535 3 42 0 55 1 1 0 67256 16584 11544 0 0 1532 0 2588 12807 53 40 0 8 1 0 0 47696 36696 11544 0 0 2011 0 3130 16724 17 52 0 31 1 0 0 27060 55848 11544 0 0 1915 0 2994 17108 22 53 0 25 1 1 0 81040 2840 11544 0 0 1769 0 2847 17386 44 42 0 14 0 2 0 73004 10804 11544 0 0 796 192 1782 7877 8 24 0 67 1 3 0 70456 13344 11544 0 0 254 230 1192 4459 3 10 0 88 1 3 0 67684 16580 11544 0 0 324 250 1300 5086 5 12 0 83
In my home setup, I am seeing my Western Digital Elements 500 GB drive disconnect under 0 stress conditions. The drive was connected with no activity for most of the day. Three times, I had to power of the SB, plug the drive into a PC, delete the Squeezebox folder using a computer, power up the SB Touch, plug in the drive, and wait for the server to re-find all the files. Sometimes when this happens, the SB seems to find 2x the number of songs than what I have on the drive. I cannot reliably repeat this test. I have a SB Boom and a SB Radio hooked to the server, but I have seen this issue happen when nothing else is attached. As far as I know, I am using the latest FW version.
Vahid is no longer working for us.
TinySC will not be updated any more.