Bugzilla – Bug 11766
Firmware Update is not getting downloaded properly
Last modified: 2009-10-05 14:36:20 UTC
Windows XP MCE SP3 Many versions of SC 7.4 Probably related to bug 10745 but this is about fab4 so NOT adding on to that one for security reasons. The download of the firmware for the fab4 isn't always working correctly. There have been several cases, unfortunately not reproducable on demand, where SC 7.4 downloads the new fab4 version file but doesn't download the actual firmware file. I still see the prior version firmware file. The unit never gets an update message and SC never tries to get the firmware again so it is skipped completely. Stopping/starting SC does NOT have any effect - the firmware is still not downloaded. The latest occurance of this happening is for the r5625 firmware and with SC 25912. WORKAROUND: Manually delete the fab4 version file from the cache folder and restart SC. A new version file will be downloaded and the firmware download will be started again. Also, waiting until the next firmware update is released will often have SC download the newest version and firmware. This seems rather serious since a firmware upgrade is required for proper fab4 operation. I'd say that an additional check by SC (perhaps whenever a unit connects?) to ensure that there is actually a firmware file of the correct version and then restarting the download if there isn't would take care of this case - even if a root cause wasn't identified.
michael: can you take a look?
James - as you were able to reproduce bug 10745, can you reproduce this one too? Adrian - you fixed above bug 10745. What did you have to change back then? Doug - could you please follow some of the debugging steps as described in bug 10745 and get us some log files?
Bug 10949 maybe related to this one. I'll investigate and try to reproduce on all OS versions. In my testing, I used OSX only, hope it's not OS specific.
Just to be clear - this seems to happen only intermittently. I have not been able to determine a pattern. I have tracked it working for 3 times in a row and then I'm surprised that it skips one - usually when I see people commenting on the newest version on the forum that I don't have yet. I'm using a wired connection from the SBt to the SC server (Windows XP MCE SP3) and comcast as my internet provider. The service has been stable/up with no outtages that I've caught/caused for weeks now. I generally update SC weekly (Saturday morning, US central time) so it is usually running for about a week at a time without restart. 10745 and 10949 both seem more about getting the firmware from SC to the unit. This is about SC getting the firmware from the internet to SC's cache folder. I'm not sure what extra logging I should turn on and look for the results of with this issue. Can you also give me some text I can search for in the SC log file to see if anything is there when it is attempting to pull down the update? I've never seen it have an initial failure on a startup check, only a periodic check (might be because of a small sample size, though). Of course, once it has the new version file and no actual firmware file it will not repair itself until the next firmware is released even if SC is restarted. Once I manually downloaded the firmware file and put it in the cache folder. I then reconnected the fab and it prompted to upgrade so I do think that that part is OK. Could it be a timing thing where the version file is put out on the website before the build has been deployed and that is when SC is checking for it so it only gets the one piece??? Just making completely uninformed guesses here, I've no idea exactly how that update process is setup. Also, I was wondering if a fall back plan of just having SC compare the version file info against an actual local copy of the firmware file (i.e. just make sure that there is one) and if it isn't there for some reason to going and getting it would be enough insurance if the true root cause isn't identified.
Please could you try to reproduce with player.firmware debugging on. Its not easy to speculate what is causing the problem you see without knowing the exact state which will be seen from the debugs. Michael - can't see the fix for 10745 impacting this - the last fix was only for controllers and made sure that the server defaulted no machine name to jive so that old controllers could be upgraded. That's not relavent to this case..
Well, I guess that I did manage to get it to happen. I started SC then deleted the fab version and firmware files. More than 12 hours later I checked it and I have a fab version file, but no firmware file... I will upload the printscreen of the cache directory and the log file.
Created attachment 5136 [details] Log file Looks like there may be more than 1 problem in the log. Please let me know if I should attempt to duplicate and/or file other bugs.
Created attachment 5137 [details] print screen of cache folder contents
Created attachment 5138 [details] contents of fab version file
All of these logs seem to suggest the downloads are failing due to timeouts or network connectivity issues. Its possible that this is caused by the download server rate limiting your connection. Difficult to say from this but the server is giving up as it can't download the file in most cases. Matt, Brandon - Andy suggested I cc you - can we remove or make the rate limiting for the download server less aggressive? Do you produce stats on how many users are rate limited and are these per file - if so can we check whether beta access for fab4 is getting rate limited?
The rate-limiting we have in place on SN is only for http connections on 80/443 (as opposed to other traffic on 9000/3483), and rate-limits SYNs from a given source IP as opposed to actual "files transferred" or bytes or anything of the sort. The ratelimit is currently set at one connection every 5 seconds on average, with a burst of 120 requests. We originally put this in place because of abusive clients spamming repeated download requests on new connections (most likely, SC itself, although I don't think we ever tracked down the root cause). 12 requests per minute should be more than enough for normal usage, I think, and the 120 burst can handle any periodic burst of activity. Currently somewhere in the vicinity of 10% of all connection requests to SN for http service are rejected by the ratelimiting (which is higher than we'd like, but afaik that's just due to the original unsolved SC bug). If we can get the IP address of the failing client fairly quickly, we can check the ratelimit tables at the time to see if that IP address is being limited or not.
10% sounds worryingly high... I hope fab4 can cope with this if the assumption is that it will download new firmware during the out of the box installation... Could you divert the ones being rejected to allow you to log the url they are requesting?
(In reply to comment #12) > 10% sounds worryingly high... I hope fab4 can cope with this if the assumption > is that it will download new firmware during the out of the box installation... > > Could you divert the ones being rejected to allow you to log the url they are > requesting? Another parameter I didn't point out above is that aside from the 120-connection burst allowed, the ratelimiter forgets clients that haven't sent a request in 30 seconds. We took logs of this before we implemented the rate-limiting. You have to remember that this isn't 10% of all legitimate requests. The broken SC's out there (or whatever it is) spam at a rate of several independent requests per second, all on new connections. Legitimate SC's query at a much lower rate, and real web browsers tend to reuse the same connection for several requests. We're not denying 10% of legitimate users, we're denying 10% of incoming connection requests over the long term, the majority of which come in giant high-speed bursts from a small handful of users with very abusive patterns (which we believe to be broken SC's). We can log the IPs that get denied relatively easily, but logging the requests automatically (as opposed to manually with a sniffer) may be more difficult, I'll have to look into it. The easiest way to prove or disprove anything about this would be to reproduce the problem and have Andy or I check the table of ratelimited IP addresses in realtime to see if that IP had been limited or not.
Well I can't reproduce the specific problem seen by Doug, but I do get rate limited some times if stopping and starting the server a lot for development. I suspect several plugin authors will do this too - if there is a jive/fab4 on the lan then stopping and restarting the server will cause it to fetch the version files. However I am not sure what is happening for Doug - perhaps he can retry and if he sees the same error can post his IP address.
I may be confused anyways, the ratelimiting I was discussing is on SN itself. These firmware requests in the log actually go to update.squeezenetwork.com, which is just an alias for downloads.slimdevices.com (not hosted on SN). Matt would know more, but I believe we also implemented similar (but different) ratelimiting there as well. There seems to be some mismatch between version numbers served by fab4.squeezenetwork.com and the files available at http://update.squeezenetwork.com/update/firmware/7.4/ right now anyways, I'm not sure if that's related to this or not.
Yep - this conversation is only relavent to update. Its broken for fab4 at present as the sha file gives 404. This is actually bad as SC will download the entire bin file and this ditch it when it can't get the sha1 file - so this creates its own dos mechanism.
So it looks like I am being rate limited at present - could you check the logs for 86.139.208.114? I think we should change SC so that it avoids fetching the version file if it is less than 30 mins old (or some value) this would avoid the update server being hit by developers each time they restart the server. However we should agree this values as I would like it to mean that we don't rate limit in this case - i.e. update requests every 30 mins for each type of player should pass the rate limiter.
Updated SC to avoid hammering the update server if frequently restarted. However I do have to point out that selling a device which requires an update when it starts up and then having rate limiting on the update server which will refuse to serve the file so that the update process essentially locks up is not a smart thing to do! If any of the SC users who are rate limited buy a fab4 then then will be very fustrated!
Agreed, we're working really hard to make sure that the firmware can get to that first (and subsequent) updates reliably. We need to make sure that the whole upgrade system is extremely reliable end-to-end. Matthew and Brandon: Can you review the download back-end to make sure that we don't have a problem here with rate limiting, etc?
Just another data point: I was recently blocked from the server for several hours. Couldn't even pull up the Fab4 update URL in a web browser. The only things running at the time were SC 7.4 and the Fab4 sitting in the "could not connect screen" from trying to connect to mysqueezebox.com for the firmware update. Brandon mentions just a 30 minute block, so something must have been continuously hitting the server. Could it be that while Fab4 is sitting in that screen it's also continually trying to connect to the server, in spite of the one menu selection there that says "Try again"? If so, then it's going to just keep causing itself to be blocked indefinitely. I was able to connect the next day after leaving Fab4 unplugged overnight.
I think that its about time that SN do the hosting for firmware updates. It makes sense in terms of reliability, performance and cost. The current restrictions on updates.slimdevices.com are not designed to slow down connections from new SC installations, but rather are in place to deal with old buggy SC releases that could get into situations where they were re-downloading firmwares 24x7. It protects both the customer as well as us from huge bandwidth bills. Its not perfect, but its a tricky balance... Moving firmwares to SN servers allows us to spread the bandwidth usage out across the 3 datacenters globally. Additionally there's probably some usefulness to build a script that serves up these files rather than apache doing it directly -- we could be much smarter about bandwidth limiting heavy-users, as well as allowing beta products to be more aggressive about their firmware downloads.
I agree, SN should be serving up our downloads (including SC and firmware). I don't know about apache vs a custom server, we don't have "heavy" users, only regular users and broken ones. Brandon: can you work with Matt and Andy to move our regular downloads to SN?
Agree any form of rate limiting is a tricky balance, but I we should aim for no rate limiting if its part of the mainstream upgrade process which is critical to maintaining the out of box experience.. If you do need rate limiting, then we need to make sure SC and SP acts in sympathy with it - so it would be useful to review the client implementation against the rate limiting algorithm [currently SC will retry agressively when it fails, so this was probably the reason beta testers had a problem] I hope that is reduced now, but we may want to change the backoff algorithm in SC. SP is different - I don't think that has a retry/backoff algorithm at all?
Firmware updates are now moved to SN. update.slimdevices.com, update.squeezenetwork.com, and update.mysqueezebox.com now all resolve to the 3 SN racks via GeoIP. This also changes the ratelimiting, as SN implements limiting a bit differently than the previous host (downloads.slimdevices.com). The ratelimits on SN have been bumped a bit to compensate, and are now set at 20/min with a burst of 200 (and again, with a timeout of 30s). These are implemented via linux's ipt_hashlimit module, which uses a Token Bucket Filter. In practical terms, this means the ratelimit allows a long-term average of one new TCP connection to our HTTP services (multiple requests on a single HTTP/1.1 connection don't count) every 3 seconds per source IP address. The burst limit means (assuming a clean starting slate) you can connect up to 200 times in a row as fast as you want initially before the ratelimit really starts taking effect. If you ever stop making new connections for a full 30 seconds, you will be forgotten (all ratelimit parameters reset). I don't expect anything but abusive (broken) clients or poor DoS attempts to break these limits, but we'll keep an eye on it.
*** Bug 11848 has been marked as a duplicate of this bug. ***
Since this bug is still open, can I request that the MIME Content-type header for the .version files be changed to 'text/plain' on the new server? Right now it's sending an 'application/octet-stream' header, which is seen as a binary file. You used to be able to simply click on the .version file and view it in your web browser, which was handy.
(In reply to comment #26) > Since this bug is still open, can I request that the MIME Content-type header > for the .version files be changed to 'text/plain' on the new server? Right now > it's sending an 'application/octet-stream' header, which is seen as a binary > file. You used to be able to simply click on the .version file and view it in > your web browser, which was handy. Done, thanks for noticing.
This bug has been marked as fixed in the 7.4.0 release version of SqueezeBox Server! * SqueezeCenter: 28672 * Squeezebox 2 and 3: 130 * Transporter: 80 * Receiver: 65 * Boom: 50 * Controller: 7790 * Radio: 7790 Please see the Release Notes for all the details: http://wiki.slimdevices.com/index.php/Release_Notes If you haven't already, please download and install the new version from http://www.logitechsqueezebox.com/support/download-squeezebox-server.html If you are still experiencing this problem, feel free to reopen the bug with your new comments and we'll have another look.