Bugzilla – Bug 6690
Asian UTF-8 tags not rendered on Squeezebox
Last modified: 2009-07-31 10:16:20 UTC
In the transition from 6.5.4 to 7.* nightly builds, my Asian tags are no longer rendered on the SB display -- just some garbled characters. Interestingly the Asian UTF-8 filenames are rendered correctly if I browse directly to the Music Folder. But any attempt to view the tags is garbled.
Add some tracing in Slim/Formats/FLAC.pm: [08-01-21 06:49:35.4464] Slim::Formats::FLAC::_decodeUTF8 (919) Warning: Before decode 周華健 [08-01-21 06:49:35.5786] main::__ANON__ (307) Warning: [06:49:35.5772] Wide character in print at /usr/lib/perl5/vendor_perl/5.8.8/Log/Log4perl/Appender/Screen.pm line 32. [08-01-21 06:49:35.5760] Slim::Formats::FLAC::_decodeUTF8 (925) Warning: After decode 周è This means the problem is either in Slim::Utils::Unicode::utf8decode or Slim::Utils::Unicode::utf8toLatin1. Some more tracing shows the problem to be Slim::Utils::Unicode::utf8decode: [08-01-21 07:02:38.1204] Slim::Formats::FLAC::_decodeUTF8 (919) Warning: Before decode 周華健 [08-01-21 07:02:38.1242] Slim::Formats::FLAC::_decodeUTF8 (921) Warning: Calling utf8decode [08-01-21 07:02:38.1569] Log::Log4perl::Appender::log (189) Warning: Wide character in print at /usr/lib/perl5/vendor_perl/5.8.8/Log/Log4perl/Appender/Screen.pm line 32. [08-01-21 07:02:38.1538] Slim::Formats::FLAC::_decodeUTF8 (927) Warning: After decode 周è
Earl: can you attach a file that exhibits this behavior?
(In reply to comment #2) > Earl: can you attach a file that exhibits this behavior? > Yes, I'll create some meta-data that is appropriate. Most likely I'll get some time to investigate a bit more tonight to narrow in on the culprit.
Created attachment 2700 [details] UTF8 FLAC tags showing Asian font
Created attachment 2701 [details] FLAC file containing UTF8 tags using Asian font
> Most likely I'll get some time to investigate a bit more tonight to narrow in > on the culprit. I don't understand why _decodeUTF8 in FLAC.pm doesn't set the Perl UTF8 flag to make life simpler for utf8decode in Unicode.pm. It tries: # Bail early if it's just ascii if (looks_like_ascii($string) || Encode::is_utf8($string)) { ... } but looks_like_ascii() fails because of the UTF8 encoding, and Encode::is_utf8() fails because the Perl UTF8 flag isn't set. In fact, if _decodeUTF8 sets the UTF8 flag, there is no need to call utf8decode at all! So perhaps this is sufficient: if ($] > 5.007) { $value = Encode::decode_utf8($value); # $value = Slim::Utils::Unicode::utf8decode($value, 'utf8'); }
> So perhaps this is sufficient: > > if ($] > 5.007) { > $value = Encode::decode_utf8($value); > # $value = > Slim::Utils::Unicode::utf8decode($value, 'utf8'); > } Yup, this seems to do the trick, though I don't know if this is the "right" way from your point of view. Patch is attached.
Created attachment 2702 [details] Patch to decode UTF8 directly
It's very likely not that simple, as many Windows users (most!) don't use utf8 yet. We have to go through some guessing. I'll take a closer look.
Could you add some more logging to the following lines to find out which encoding it finally assumes? (lines 460+)
(In reply to comment #10) > Could you add some more logging to the following lines to find out which > encoding it finally assumes? (lines 460+) Yes, I can do this. I've attached a FLAC file (see previous response) that you can use to reproduce the problem at your end.
(In reply to comment #10) > Could you add some more logging to the following lines to find out which > encoding it finally assumes? (lines 460+) Ok, so from the trace below it seems that encodingFromString() is confused. It in turn relies on looks_like_utf8() to yield the right answer. This in turn is quite simple: sub looks_like_utf8 { use bytes; return 1 if $_[0] =~ /^\xef\xbb\xbf/; return 1 if $_[0] =~ /($utf8_re_bits)/o; return 0; } So it's either got to start with "\xef\xbb\xbf" (which it doesn't), or contain UTF8 (which it also apparently doesn't)! In fact the string contains: % echo "周華健" | od -t x1 e5 91 a8 e8 8f af e5 81 a5 0a The last 0a is the newline. So why doesn't the regexp match? The regexp is constructed using: $utf8_re_bits = join "|", map { latin1toUTF8(chr($_)) } (127..255); Hmmm ... I suspect this is broken partly because latin1toUTF8() returns Perl UTF8 characters (wide) for codes 128..255 whereas I suspect the characters in the string are still narrow. The original 6.5.1 code didn't attempt to use the localised guessing, instead relying on Encode::Guess::guess_encoding(): if (looks_like_ascii($string)) { return $string; } my $orig = $string; if ($string && $] > 5.007 && !Encode::is_utf8($string)) { eval { my $icode = Encode::Guess::guess_encoding($string); ---------------------------------------------------------------------------------------------------------------------- [08-01-22 20:31:27.5589] Slim::Utils::Unicode::utf8decode_guess (463) Warning: Guess encoding charset from string 周華健 [08-01-22 20:31:27.5642] Slim::Utils::Unicode::utf8decode_guess (466) Warning: Guessed encoding charset is cp1252 [08-01-22 20:31:27.6561] Slim::Utils::Unicode::utf8decode_guess (471) Warning: Find encoding from charset Encode::XS=SCALAR(0xab79ff4) [08-01-22 20:31:27.6636] Log::Log4perl::Appender::log (189) Warning: Wide character in print at /usr/lib/perl5/vendor_perl/5.8.8/Log/Log4perl/Appender/Screen.pm line 32. [08-01-22 20:31:27.6603] Slim::Utils::Unicode::utf8decode_guess (481) Warning: Transformed string using 4 is 周è ---------------------------------------------------------------------------------------------------------------------- print STDERR "Guess encoding charset from string $string\n"; my $charset = encodingFromString($string); my $encoding = undef; print STDERR "Guessed encoding charset is $charset\n"; if ($charset && $charset ne 'raw') { $encoding = Encode::find_encoding($charset); print STDERR "Find encoding from charset $encoding\n"; } else { $encoding = Encode::Guess::guess_encoding($string); } if (ref $encoding) { my $transform = $encoding->decode($string, $FB_QUIET); print STDERR "Transformed string using $FB_QUIET is $transform\n"; return $encoding->decode($string, $FB_QUIET);
Thanks for all the feedback. I'm still trying to understand what's going on here. Here's another try: could you please install Encode::Detect::Detector and see whether this changes the behaviour? It's optionally used to determine the encoding - according to the comment in the code with better results.
(In reply to comment #13) > Thanks for all the feedback. I'm still trying to understand what's going on > here. > > Here's another try: could you please install Encode::Detect::Detector and see > whether this changes the behaviour? It's optionally used to determine the > encoding - according to the comment in the code with better results. > Is there a particular site or version you want me to use? For example: http://search.cpan.org/~jgmyers/Encode-Detect-1.00/Detect.pm
No, no special version. Take what's easiest to install, either using your package manager or CPAN directly.
(In reply to comment #13) > Thanks for all the feedback. I'm still trying to understand what's going on > here. > > Here's another try: could you please install Encode::Detect::Detector and see > whether this changes the behaviour? It's optionally used to determine the > encoding - according to the comment in the code with better results. Ok. I've tried two things. First, I followed up my hunch that $utf8_re_bits is misguided: # $utf8_re_bits = join "|", map { latin1toUTF8(chr($_)) } (127..255); $utf8_re_bits = join "|", map { chr($_) } (127..255); This regexp is used by looks_like_utf8() which is only called if Encode::is_utf8($string) returns false. Because Encode::is_utf8() returns false, we know the string *must* contain octets (narrow chars) rather than Perl wide characters. The original $utf8_re_bits explicitly makes wide characters, and I think that: wide_chr(128) != '\x7f' Changing the definition of $utf8_re_bits to just contain narrow characters: chr(128) == '\x7f' Having $utf8_re_bits match wide characters is pointless because Encode::is_utf8($string) will indicate a positive result before we even get a chance to try the regular expression! Now the characters are guessed correctly. Here is the output: ---------------------------------------------------------------------- [08-01-23 21:31:48.9325] Slim::Utils::Unicode::utf8decode_guess (464) Warning: Guess encoding charset from string 周華健 [08-01-23 21:31:48.9376] Slim::Utils::Unicode::encodingFromString (784) Warning: Matched by looks_like_utf8() [08-01-23 21:31:48.9414] Slim::Utils::Unicode::utf8decode_guess (467) Warning: Guessed encoding charset is utf8 [08-01-23 21:31:48.9452] Slim::Utils::Unicode::utf8decode_guess (472) Warning: Find encoding from charset Encode::utf8=HASH(0x9b63134) [08-01-23 21:31:48.9524] Log::Log4perl::Appender::log (189) Warning: Wide character in print at /usr/lib/perl5/vendor_perl/5.8.8/Log/Log4perl/Appender/Screen.pm line 32. [08-01-23 21:31:48.9492] Slim::Utils::Unicode::utf8decode_guess (482) Warning: Transformed string using 4 is 周華健 ---------------------------------------------------------------------- Second, I returned $utf8_re_bits to what I believe is the broken form (ie containing wide chars). I installed Encode::Detect as you suggested and instrumented to verify that it is indeed being invoked. Yes, it generates the correct result: ---------------------------------------------------------------------- [08-01-23 21:26:51.9529] Slim::Utils::Unicode::utf8decode_guess (463) Warning: Guess encoding charset from string 周華健 [08-01-23 21:26:51.9582] Slim::Utils::Unicode::encodingFromString (792) Warning: Encode::Detect::Detector returns UTF-8 [08-01-23 21:26:51.9614] Slim::Utils::Unicode::utf8decode_guess (466) Warning: Guessed encoding charset is utf-8 [08-01-23 21:26:51.9655] Slim::Utils::Unicode::utf8decode_guess (471) Warning: Find encoding from charset Encode::utf8=HASH(0x937fcac) [08-01-23 21:26:51.9719] Log::Log4perl::Appender::log (189) Warning: Wide character in print at /usr/lib/perl5/vendor_perl/5.8.8/Log/Log4perl/Appender/Screen.pm line 32. [08-01-23 21:26:51.9687] Slim::Utils::Unicode::utf8decode_guess (481) Warning: Transformed string using 4 is 周華健 ---------------------------------------------------------------------- My conclusion: 1. $utf8_re_bits is wrong in using latin1toUTF8() and should just use chr() otherwise looks_like_utf8() will always return false! 2. Encode::Detect::Detector is good to have as a backup.
Ok, I've tested your suggested change on OSX/Windows/CentOS4 based systems and didn't find any negative sideeffects yet. I'm still reluctant committing this for 7.0. Dan - do you see potential issues with this change? Index: /Users/mh/Documents/workspace/SC7.0/Slim/Utils/Unicode.pm =================================================================== --- /Users/mh/Documents/workspace/SC7.0/Slim/Utils/Unicode.pm (revision 16692) +++ /Users/mh/Documents/workspace/SC7.0/Slim/Utils/Unicode.pm (working copy) @@ -169,7 +169,7 @@ } # Create a regex for looks_like_utf8() - $utf8_re_bits = join "|", map { latin1toUTF8(chr($_)) } (127..255); + $utf8_re_bits = join "|", map { chr($_) } (127..255); $recomposeTable = { "o\x{30c}" => "\x{1d2}",
After looking some more at this code I'm not sure your conclusion is correct: "Because Encode::is_utf8() returns false, we know the string *must* contain octets (narrow chars) rather than Perl wide characters." That might be true for this case (utf8decode_guess()), but encodingFromString or looks_like_utf8 are called from other places, too. We can't rely on Encode::is_utf8 being false in all cases. You said Encode::Detect::Detector with a plain installation does work for you?
(In reply to comment #18) > After looking some more at this code I'm not sure your conclusion is correct: > > "Because Encode::is_utf8() returns false, we know the string *must* contain > octets (narrow chars) rather than Perl wide characters." > > That might be true for this case (utf8decode_guess()), but encodingFromString > or looks_like_utf8 are called from other places, too. We can't rely on > Encode::is_utf8 being false in all cases. > > You said Encode::Detect::Detector with a plain installation does work for you? > Yes Encode::Detect::Detector with a plain installation (yum installed on FC6) works. The current source makes it clear that Encode::Detect::Detector is optional, so unless you guys make it mandatory you shouldn't rely on its presence to fix this issue. Your comment about encodingFromString() and looks_like_utf8() being called from other places is true. Are these methods private to the module, or available to external clients? If looks_like_utf8() is not permitted to make assumptions about the text its handed then perhaps it is more prudent to include the use of Encode::is_utf8() there: sub looks_like_utf8() { return 1 if (Encode::is_utf8(...)); return 1 if (... $utf8_re_bits ) # Use narrow match here ... blah blah ... } Actually when using narrow characters, I think $utf8_re_bits is now overkill. A simple regexp with something like /[\x7f..\xff]/ is clearer and more concise.
Earl - Dan (who wrote that code) suggested not doing this change without thorough testing by QA. This won't happen before 7.0. Targetting for 7.0.1. Thanks for your understanding.
(In reply to comment #20) > Earl - Dan (who wrote that code) suggested not doing this change without > thorough testing by QA. This won't happen before 7.0. Targetting for 7.0.1. > Thanks for your understanding. Thanks for the update. I'm ok with that. I know the source of the problem and can work around it for my setup.
One more question: are you using a 64 bit distribution? How did you install SC on your box? I just noticed that we're including Encode::Detect::Detector with SC7 (in the CPAN folder). You should not have needed to install this manually. The only reason why this should have failed for you is either you use a non x86 platform or you installed form the *-nocpan.* archive.
(In reply to comment #22) > One more question: are you using a 64 bit distribution? How did you install SC > on your box? I just noticed that we're including Encode::Detect::Detector with > SC7 (in the CPAN folder). You should not have needed to install this manually. > > The only reason why this should have failed for you is either you use a non x86 > platform or you installed form the *-nocpan.* archive. Ahh ... interesting. No, I'm using a 32 bit installation. My latest installation is indeed squeezecenter-7.0-16473-noCPAN.tgz. You're right, Encode::Detect::Detector fixes the problem. I still think $utf8_re_bits = join "|", map { latin1toUTF8(chr($_)) } (127..255); is broken (or at least ineffective) because it effectively only matches [\x{7f}..\x{ff}]. Look at latin1toUTF8(): sub latin1toUTF8 { my $data = shift; if ($] > 5.007) { $data = eval { Encode::encode('utf8', $data, $FB_QUIET) } || $da ta; } else { $data =~ s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0 x3F)/eg; } return $data; } Each character code from \x80 to \xff is mapped to *two* octets. Now look at looks_like_utf8(): sub looks_like_utf8 { use bytes; return 1 if $_[0] =~ /^\xef\xbb\xbf/; return 1 if $_[0] =~ /($utf8_re_bits)/o; return 0; } The "use bytes" pragma says to ignore encoding, and just look at the raw octets. That means that $utf8_re_bits is looking for *pairs of octets* that encode \x{7f}..\x{ff}. I note the difference between \xff and \x{ff}: \x1B hex char (example: ESC) \x{263a} long hex char (example: Unicode SMILEY) I'm pretty sure now the reason that looks_like_utf8() fails is that it will not match all the *other* wide characters from \x{100}..\x{ffff} because the octet pairs it's looking for are just those in the range \x{7f}..\x{ff} (constructed via latin1toUTF8 using chr(0x7f) .. chr(0xff)). Fundamentally looks_like_utf8() is flawed because $utf8_re_bits() misses the vast majority of encoded characters.
we'll require Encode::Detect::Detector in 7.0.1
change 17825 in trunk - Encode::Detect::Detector is now a mandatory module, as it give us more reliable results in Unicode handling
Earl: please open a new bug if you find more instances of UTF-8 errors
This bug has recently been fixed in the latest release of SqueezeCenter 7.0.1 Please try that version, if you still see the error, then reopen this bug. To download this version, please navigate to: http://www.slimdevices.com/su_downloads.html
Reduce number of active targets for SC