Bugzilla – Bug 4423
Crash on full rescan during a disk defrag
Last modified: 2009-07-30 14:02:38 UTC
Got this error message while attempting to do a clear and rescan of the data. A disk defrager was running at the time, perhaps the file lock could not have been achieved. Carp::Clan::__ANON__(): Error executing 'UPDATE tracks SET thumb = ? WHERE ( id = ? )': DBD::mysql::st execute failed: Lock wait timeout exceeded; try restarting transaction at C:\Program Files\SlimServer\server\CPAN/DBIx/Class/Storage/DBI.pm line 771
Basically the only thing I can think we might do about this bug is try to trap the error message and present a simpler one to the end user. Maybe something like 'Slimserver timed out trying to scan [filename]. Please try again.' Is that what you had in mind, Alex?
Perhaps SS should retry a fixed/configurable amount of times. I've always thought that computer should try to fix themselves before asking humans for help. :) (And, the program should never crash, no matter what the error happens to be within the environment.) Is there a way for mysql to have an exclusive lock on the file if it is necessary for it to function? Can an exclusive file lock be checked if errors occur?
*** Bug 4439 has been marked as a duplicate of this bug. ***
Maybe the timeout just needs to be increased?
Brandon: Can you pitch in here? What's the timeout set to and can we increase it dramatically? We've seen this in another bug and it's not clear what the timeout buys us.
(In reply to comment #5) > Brandon: Can you pitch in here? What's the timeout set to and can we increase > it dramatically? We've seen this in another bug and it's not clear what the > timeout buys us. > First off, the error message doesn't relate to OS-level file locks, this is an internal lock timeout within the MySQL server (meaning that the UPDATE statement was waiting for another parallel conflicting statement from another MysQL client connection to complete). The two parallel connections to the MySQL server are from the main slimserver process and scanner.pl (which is spawned off as a separate process to handle the library rescan). Basically, scanner.pl is holding very long-lived locks (due to long-lived transactions) on the tables it operates on, and slimserver itself wants to update these tables as well (in the dupe bug, this was to set lastPlayed for example). It could probably happen randomly even on a lightly-loaded Linux box, but the heavy I/O load of a defrag probably slows down the scanner.pl transactions sufficiently to greatly increase the probability of the timeout occurring. The default innodb_lock_wait_timeout is 50 seconds, which is what we're using. We could increase this in server/MySQL/my.tt by adding an explicitly higher setting there. This would reduce the probability of running into the issue and is the quickest fixup. The downside is it won't totally eliminate the issue, and as a corollary to the good behaviors Alex described earlier in this ticket, it's better to crash than to have something hang seemingly indefinitely. The better (but more involved) fix is to clean up the flow of scanner.pl (will require changes in various scanning-related modules too, like Slim::Utils::Scanner, Slim::Music::Import, etc) so that it operates with smaller discrete transactions. There are some comments in scanner.pl about the bulk of the work not being wrapped in txn_do for just this purpose, but (at least in my initial reading of the code) I think that work is effectively in a large bulk transaction anyways, as AutoCommit is off and the Slim::Schema->newTrack calls don't seem to be invoked with the 'commit' argument. The downside there is that if the scan fails in the middle, it will leave the track database in an inconsistent state, but that probably already occurs (as scanner.pl is already broken up into at minimum 3 discrete transactions), and can be easily fixed with a full rescan again. Another (perhaps orthogonal) way to improve scanner.pl's transaction times would be to do all of the scanning work in memory first and then commit the metadata to the DB after it's done, so that the DB txn isn't hanging out waiting on scan I/Os to complete. This might not be viable however, as large music libraries might result in more temporary results in memory than would fit in our min memory requirements. AFAICS, the basics of this issue haven't changed from the 6.5.1 in the report through trunk.
New scanner should take care of this issue.