Bug 6690 – Asian UTF-8 tags not rendered on Squeezebox

Bug 6690 - Asian UTF-8 tags not rendered on Squeezebox

Summary:

Asian UTF-8 tags not rendered on Squeezebox

Status:	CLOSED FIXED

Product:	Logitech Media Server
Classification:	Unclassified
Component:	Display
Version:	7.0
Platform:	PC Linux (other)

Importance:	P1 normal (vote)
Target Milestone:	7.x
Assigned To:	Michael Herger

URL:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2008-01-20 21:50 UTC by quiet.dragon
Modified:	2009-07-31 10:16 UTC (History)
CC List:	0 users

See Also:
Category:	---

Attachments
UTF8 FLAC tags showing Asian font (40.55 KB, image/jpeg) 2008-01-21 19:00 UTC, quiet.dragon	Details
FLAC file containing UTF8 tags using Asian font (10.85 KB, application/octet-stream) 2008-01-21 19:00 UTC, quiet.dragon	Details
Patch to decode UTF8 directly (359 bytes, patch) 2008-01-21 19:52 UTC, quiet.dragon	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description quiet.dragon 2008-01-20 21:50:44 UTC

In the transition from 6.5.4 to 7.* nightly builds, my Asian tags are no longer rendered
on the SB display -- just some garbled characters.

Interestingly the Asian UTF-8 filenames are rendered correctly if I browse directly
to the Music Folder. But any attempt to view the tags is garbled.

Comment 1 quiet.dragon 2008-01-21 07:05:37 UTC

Add some tracing in Slim/Formats/FLAC.pm:

[08-01-21 06:49:35.4464] Slim::Formats::FLAC::_decodeUTF8 (919) Warning: Before decode 周華健
[08-01-21 06:49:35.5786] main::__ANON__ (307) Warning: [06:49:35.5772] Wide character in print at /usr/lib/perl5/vendor_perl/5.8.8/Log/Log4perl/Appender/Screen.pm line 32.
[08-01-21 06:49:35.5760] Slim::Formats::FLAC::_decodeUTF8 (925) Warning: After decode å‘¨è

This means the problem is either in Slim::Utils::Unicode::utf8decode or Slim::Utils::Unicode::utf8toLatin1.

Some more tracing shows the problem to be Slim::Utils::Unicode::utf8decode:

[08-01-21 07:02:38.1204] Slim::Formats::FLAC::_decodeUTF8 (919) Warning: Before decode 周華健
[08-01-21 07:02:38.1242] Slim::Formats::FLAC::_decodeUTF8 (921) Warning: Calling utf8decode
[08-01-21 07:02:38.1569] Log::Log4perl::Appender::log (189) Warning: Wide character in print at /usr/lib/perl5/vendor_perl/5.8.8/Log/Log4perl/Appender/Screen.pm line 32.
[08-01-21 07:02:38.1538] Slim::Formats::FLAC::_decodeUTF8 (927) Warning: After decode å‘¨è

Comment 2 Blackketter Dean 2008-01-21 09:48:16 UTC

Earl: can you attach a file that exhibits this behavior?

Comment 3 quiet.dragon 2008-01-21 09:58:55 UTC

(In reply to comment #2)
> Earl: can you attach a file that exhibits this behavior?
> 

Yes, I'll create some meta-data that is appropriate.

Most likely I'll get some time to investigate a bit more tonight to narrow in on the culprit.

Comment 4 quiet.dragon 2008-01-21 19:00:09 UTC

Created attachment 2700 [details]
UTF8 FLAC tags showing Asian font

Comment 5 quiet.dragon 2008-01-21 19:00:54 UTC

Created attachment 2701 [details]
FLAC file containing UTF8 tags using Asian font

Comment 6 quiet.dragon 2008-01-21 19:11:04 UTC

> Most likely I'll get some time to investigate a bit more tonight to narrow in
> on the culprit.

I don't understand why  _decodeUTF8 in FLAC.pm doesn't set the Perl UTF8 flag to
make life simpler for utf8decode in Unicode.pm.

It tries:

        # Bail early if it's just ascii
        if (looks_like_ascii($string) || Encode::is_utf8($string)) {
           ...
        }

but looks_like_ascii() fails because of the UTF8 encoding, and Encode::is_utf8() fails
because the Perl UTF8 flag isn't set.

In fact, if _decodeUTF8 sets the UTF8 flag, there is no need to call utf8decode at all!
So perhaps this is sufficient:

                        if ($] > 5.007) {
                                 $value = Encode::decode_utf8($value);
                        #        $value = Slim::Utils::Unicode::utf8decode($value, 'utf8');
                        }

Comment 7 quiet.dragon 2008-01-21 19:52:18 UTC

> So perhaps this is sufficient:
> 
>                         if ($] > 5.007) {
>                                  $value = Encode::decode_utf8($value);
>                         #        $value =
> Slim::Utils::Unicode::utf8decode($value, 'utf8');
>                         }

Yup, this seems to do the trick, though I don't know if this is the "right" way
from your point of view.

Patch is attached.

Comment 8 quiet.dragon 2008-01-21 19:52:52 UTC

Created attachment 2702 [details]
Patch to decode UTF8 directly

Comment 9 Michael Herger 2008-01-22 06:41:35 UTC

It's very likely not that simple, as many Windows users (most!) don't use utf8 yet. We have to go through some guessing. I'll take a closer look.

Comment 10 Michael Herger 2008-01-22 06:43:53 UTC

Could you add some more logging to the following lines to find out which encoding it finally assumes? (lines 460+)

Comment 11 quiet.dragon 2008-01-22 09:14:53 UTC

(In reply to comment #10)
> Could you add some more logging to the following lines to find out which
> encoding it finally assumes? (lines 460+)

Yes, I can do this.

I've attached a FLAC file (see previous response) that you can use to reproduce
the problem at your end.

Comment 12 quiet.dragon 2008-01-22 20:56:12 UTC

(In reply to comment #10)
> Could you add some more logging to the following lines to find out which
> encoding it finally assumes? (lines 460+)

Ok, so from the trace below it seems that encodingFromString() is confused. It in turn relies
on looks_like_utf8() to yield the right answer. This in turn is quite simple:

sub looks_like_utf8 {
        use bytes;

        return 1 if $_[0] =~ /^\xef\xbb\xbf/;
        return 1 if $_[0] =~ /($utf8_re_bits)/o;
        return 0;
}

So it's either got to start with "\xef\xbb\xbf" (which it doesn't), or contain UTF8 (which it also
apparently doesn't)! In fact the string contains:

            % echo "周華健" | od -t x1
            e5 91 a8 e8 8f af e5 81 a5 0a

The last 0a is the newline. So why doesn't the regexp match? The regexp is constructed using:

        $utf8_re_bits = join "|", map { latin1toUTF8(chr($_)) } (127..255);

Hmmm ... I suspect this is broken partly because latin1toUTF8() returns Perl UTF8 characters (wide)
for codes 128..255 whereas I suspect the characters in the string are still narrow.

The original 6.5.1 code didn't attempt to use the localised guessing, instead relying on
Encode::Guess::guess_encoding():

        if (looks_like_ascii($string)) {
                return $string;
        }

        my $orig = $string;

        if ($string && $] > 5.007 && !Encode::is_utf8($string)) {

                eval {

                        my $icode = Encode::Guess::guess_encoding($string);



----------------------------------------------------------------------------------------------------------------------

[08-01-22 20:31:27.5589] Slim::Utils::Unicode::utf8decode_guess (463) Warning: Guess encoding charset from string 周華健
[08-01-22 20:31:27.5642] Slim::Utils::Unicode::utf8decode_guess (466) Warning: Guessed encoding charset is cp1252
[08-01-22 20:31:27.6561] Slim::Utils::Unicode::utf8decode_guess (471) Warning: Find encoding from charset Encode::XS=SCALAR(0xab79ff4)
[08-01-22 20:31:27.6636] Log::Log4perl::Appender::log (189) Warning: Wide character in print at /usr/lib/perl5/vendor_perl/5.8.8/Log/Log4perl/Appender/Screen.pm line 32.
[08-01-22 20:31:27.6603] Slim::Utils::Unicode::utf8decode_guess (481) Warning: Transformed string using 4 is å‘¨è

----------------------------------------------------------------------------------------------------------------------

print STDERR "Guess encoding charset from string $string\n";
        my $charset  = encodingFromString($string);
        my $encoding = undef;
print STDERR "Guessed encoding charset is $charset\n";

        if ($charset && $charset ne 'raw') {

                $encoding = Encode::find_encoding($charset);
print STDERR "Find encoding from charset $encoding\n";

        } else {

                $encoding = Encode::Guess::guess_encoding($string);
        }

        if (ref $encoding) {

                my $transform = $encoding->decode($string, $FB_QUIET);
print STDERR "Transformed string using $FB_QUIET is $transform\n";
                return $encoding->decode($string, $FB_QUIET);

Comment 13 Michael Herger 2008-01-23 02:44:16 UTC

Thanks for all the feedback. I'm still trying to understand what's going on here. 

Here's another try: could you please install Encode::Detect::Detector and see whether this changes the behaviour? It's optionally used to determine the encoding - according to the comment in the code with better results.

Comment 14 quiet.dragon 2008-01-23 09:11:39 UTC

(In reply to comment #13)
> Thanks for all the feedback. I'm still trying to understand what's going on
> here. 
> 
> Here's another try: could you please install Encode::Detect::Detector and see
> whether this changes the behaviour? It's optionally used to determine the
> encoding - according to the comment in the code with better results.
> 

Is there a particular site or version you want me to use?

For example:

http://search.cpan.org/~jgmyers/Encode-Detect-1.00/Detect.pm

Comment 15 Michael Herger 2008-01-23 12:45:07 UTC

No, no special version. Take what's easiest to install, either using your package manager or CPAN directly.

Comment 16 quiet.dragon 2008-01-23 21:51:07 UTC

(In reply to comment #13)
> Thanks for all the feedback. I'm still trying to understand what's going on
> here. 
> 
> Here's another try: could you please install Encode::Detect::Detector and see
> whether this changes the behaviour? It's optionally used to determine the
> encoding - according to the comment in the code with better results.

Ok. I've tried two things.

First, I followed up my hunch that $utf8_re_bits is misguided:

        # $utf8_re_bits = join "|", map { latin1toUTF8(chr($_)) } (127..255);
        $utf8_re_bits = join "|", map { chr($_) } (127..255);

This regexp is used by looks_like_utf8() which is only called if Encode::is_utf8($string)
returns false. Because Encode::is_utf8() returns false, we know the string *must* contain
octets (narrow chars) rather than Perl wide characters.

The original $utf8_re_bits explicitly makes wide characters, and I think that:

        wide_chr(128) != '\x7f'

Changing the definition of $utf8_re_bits to just contain narrow characters:

        chr(128) == '\x7f'


Having $utf8_re_bits match wide characters is pointless because Encode::is_utf8($string) will
indicate a positive result before we even get a chance to try the regular expression!


Now the characters are guessed correctly. Here is the output:
----------------------------------------------------------------------
[08-01-23 21:31:48.9325] Slim::Utils::Unicode::utf8decode_guess (464) Warning: Guess encoding charset from string 周華健
[08-01-23 21:31:48.9376] Slim::Utils::Unicode::encodingFromString (784) Warning: Matched by looks_like_utf8()
[08-01-23 21:31:48.9414] Slim::Utils::Unicode::utf8decode_guess (467) Warning: Guessed encoding charset is utf8
[08-01-23 21:31:48.9452] Slim::Utils::Unicode::utf8decode_guess (472) Warning: Find encoding from charset Encode::utf8=HASH(0x9b63134)
[08-01-23 21:31:48.9524] Log::Log4perl::Appender::log (189) Warning: Wide character in print at /usr/lib/perl5/vendor_perl/5.8.8/Log/Log4perl/Appender/Screen.pm line 32.
[08-01-23 21:31:48.9492] Slim::Utils::Unicode::utf8decode_guess (482) Warning: Transformed string using 4 is 周華健
----------------------------------------------------------------------


Second, I returned $utf8_re_bits to what I believe is the broken form (ie containing wide chars). I installed
Encode::Detect as you suggested and instrumented to verify that it is indeed being invoked.


Yes, it generates the correct result:
----------------------------------------------------------------------
[08-01-23 21:26:51.9529] Slim::Utils::Unicode::utf8decode_guess (463) Warning: Guess encoding charset from string 周華健
[08-01-23 21:26:51.9582] Slim::Utils::Unicode::encodingFromString (792) Warning: Encode::Detect::Detector returns UTF-8
[08-01-23 21:26:51.9614] Slim::Utils::Unicode::utf8decode_guess (466) Warning: Guessed encoding charset is utf-8
[08-01-23 21:26:51.9655] Slim::Utils::Unicode::utf8decode_guess (471) Warning: Find encoding from charset Encode::utf8=HASH(0x937fcac)
[08-01-23 21:26:51.9719] Log::Log4perl::Appender::log (189) Warning: Wide character in print at /usr/lib/perl5/vendor_perl/5.8.8/Log/Log4perl/Appender/Screen.pm line 32.
[08-01-23 21:26:51.9687] Slim::Utils::Unicode::utf8decode_guess (481) Warning: Transformed string using 4 is 周華健
----------------------------------------------------------------------


My conclusion:

1. $utf8_re_bits is wrong in using latin1toUTF8() and should just use chr() otherwise
   looks_like_utf8() will always return false!

2. Encode::Detect::Detector is good to have as a backup.

Comment 17 Michael Herger 2008-01-24 02:50:06 UTC

Ok, I've tested your suggested change on OSX/Windows/CentOS4 based systems and didn't find any negative sideeffects yet. I'm still reluctant committing this for 7.0.

Dan - do you see potential issues with this change?

Index: /Users/mh/Documents/workspace/SC7.0/Slim/Utils/Unicode.pm
===================================================================
--- /Users/mh/Documents/workspace/SC7.0/Slim/Utils/Unicode.pm	(revision 16692)
+++ /Users/mh/Documents/workspace/SC7.0/Slim/Utils/Unicode.pm	(working copy)
@@ -169,7 +169,7 @@
 	}
 
 	# Create a regex for looks_like_utf8()
-	$utf8_re_bits = join "|", map { latin1toUTF8(chr($_)) } (127..255);
+	$utf8_re_bits = join "|", map { chr($_) } (127..255);
 
 	$recomposeTable = {
 		"o\x{30c}" => "\x{1d2}",

Comment 18 Michael Herger 2008-01-25 02:02:17 UTC

After looking some more at this code I'm not sure your conclusion is correct:

"Because Encode::is_utf8() returns false, we know the string *must* contain octets (narrow chars) rather than Perl wide characters."

That might be true for this case (utf8decode_guess()), but encodingFromString or looks_like_utf8 are called from other places, too. We can't rely on Encode::is_utf8 being false in all cases.

You said Encode::Detect::Detector with a plain installation does work for you?

Comment 19 quiet.dragon 2008-01-25 09:44:46 UTC

(In reply to comment #18)
> After looking some more at this code I'm not sure your conclusion is correct:
> 
> "Because Encode::is_utf8() returns false, we know the string *must* contain
> octets (narrow chars) rather than Perl wide characters."
> 
> That might be true for this case (utf8decode_guess()), but encodingFromString
> or looks_like_utf8 are called from other places, too. We can't rely on
> Encode::is_utf8 being false in all cases.
> 
> You said Encode::Detect::Detector with a plain installation does work for you? 
> 

Yes Encode::Detect::Detector with a plain installation (yum installed on FC6) works.

The current source makes it clear that Encode::Detect::Detector is optional, so unless
you guys make it mandatory you shouldn't rely on its presence to fix this issue.

Your comment about encodingFromString() and looks_like_utf8() being called from other
places is true.

Are these methods private to the module, or available to external clients?



If looks_like_utf8() is not permitted to make assumptions about the text its handed
then perhaps it is more prudent to include the use of Encode::is_utf8() there:

sub looks_like_utf8() {

   return 1 if (Encode::is_utf8(...));
   return 1 if (... $utf8_re_bits )   # Use narrow match here
   ... blah blah ...
}

Actually when using narrow characters, I think $utf8_re_bits is now overkill.

A simple regexp with something like /[\x7f..\xff]/ is clearer and more concise.

Comment 20 Michael Herger 2008-01-28 09:20:25 UTC

Earl - Dan (who wrote that code) suggested not doing this change without thorough testing by QA. This won't happen before 7.0. Targetting for 7.0.1. Thanks for your understanding.

Comment 21 quiet.dragon 2008-01-28 13:21:57 UTC

(In reply to comment #20)
> Earl - Dan (who wrote that code) suggested not doing this change without
> thorough testing by QA. This won't happen before 7.0. Targetting for 7.0.1.
> Thanks for your understanding.

Thanks for the update.

I'm ok with that. I know the source of the problem and can work around it for my setup.

Comment 22 Michael Herger 2008-01-30 04:07:40 UTC

One more question: are you using a 64 bit distribution? How did you install SC on your box? I just noticed that we're including Encode::Detect::Detector with SC7 (in the CPAN folder). You should not have needed to install this manually.

The only reason why this should have failed for you is either you use a non x86 platform or you installed form the *-nocpan.* archive.

Comment 23 quiet.dragon 2008-01-30 07:14:16 UTC

(In reply to comment #22)
> One more question: are you using a 64 bit distribution? How did you install SC
> on your box? I just noticed that we're including Encode::Detect::Detector with
> SC7 (in the CPAN folder). You should not have needed to install this manually.
> 
> The only reason why this should have failed for you is either you use a non x86
> platform or you installed form the *-nocpan.* archive.

Ahh ... interesting.

No, I'm using a 32 bit installation.

My latest installation is indeed squeezecenter-7.0-16473-noCPAN.tgz.

You're right, Encode::Detect::Detector fixes the problem.

I still think

    $utf8_re_bits = join "|", map { latin1toUTF8(chr($_)) } (127..255);

is broken (or at least ineffective) because it effectively only matches
[\x{7f}..\x{ff}].


Look at latin1toUTF8():

sub latin1toUTF8 {
        my $data = shift;

        if ($] > 5.007) {

                $data = eval { Encode::encode('utf8', $data, $FB_QUIET) } || $da
ta;

        } else {

                $data =~ s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0
x3F)/eg;
        }

        return $data;
}

Each character code from \x80 to \xff is mapped to *two* octets.


Now look at looks_like_utf8():

sub looks_like_utf8 {
        use bytes;

        return 1 if $_[0] =~ /^\xef\xbb\xbf/;
        return 1 if $_[0] =~ /($utf8_re_bits)/o;
        return 0;
}

The "use bytes" pragma says to ignore encoding, and just look at the raw octets.
That means that $utf8_re_bits is looking for *pairs of octets* that encode
\x{7f}..\x{ff}.

I note the difference between \xff and \x{ff}:

    \x1B	hex char              (example: ESC)
    \x{263a}	long hex char         (example: Unicode SMILEY)


I'm pretty sure now the reason that looks_like_utf8() fails is that it will
not match all the *other* wide characters from \x{100}..\x{ffff} because
the octet pairs it's looking for are just those in the range \x{7f}..\x{ff}
(constructed via latin1toUTF8 using chr(0x7f) .. chr(0xff)).

Fundamentally looks_like_utf8() is flawed because $utf8_re_bits() misses
the vast majority of encoded characters.

Comment 24 Michael Herger 2008-01-30 09:00:19 UTC

we'll require Encode::Detect::Detector in 7.0.1

Comment 25 Michael Herger 2008-03-07 18:08:35 UTC

change 17825 in trunk - Encode::Detect::Detector is now a mandatory module, as it give us more reliable results in Unicode handling

Comment 26 James Richardson 2008-05-07 13:37:06 UTC

Earl: please open a new bug if you find more instances of UTF-8 errors

Comment 27 James Richardson 2008-05-15 12:27:16 UTC

This bug has recently been fixed in the latest release of SqueezeCenter 7.0.1

Please try that version, if you still see the error, then reopen this bug.

To download this version, please navigate to: http://www.slimdevices.com/su_downloads.html

Comment 28 Chris Owens 2009-07-31 10:16:20 UTC

Reduce number of active targets for SC