Bug 9868 – MP3::Info should use all tag fields to determine charset

Bug 9868 - MP3::Info should use all tag fields to determine charset

Summary:

MP3::Info should use all tag fields to determine charset

Status:	NEW

Product:	Logitech Media Server
Classification:	Unclassified
Component:	Tagging
Version:	unspecified
Platform:	PC All

Importance:	-- normal with 1 vote (vote)
Target Milestone:	Future
Assigned To:	Unassigned bug - please assign me!

URL:
Keywords:	charset_issues

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2008-11-01 06:17 UTC by Kensaku Yamaguchi
Modified:	2008-12-22 15:54 UTC (History)
CC List:	3 users (show)

See Also:
Category:	---

Attachments
patch for lib/MP3/Info.pm (1.89 KB, patch) 2008-11-01 06:17 UTC, Kensaku Yamaguchi	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Kensaku Yamaguchi 2008-11-01 06:17:53 UTC

Created attachment 4193 [details]
patch for lib/MP3/Info.pm

Some artist, album and track names in my library with Japanese characters get displayed incorrectly by SqueezeCenter and SqueezeBox Controller.
For example, an artist tag that says "Ozaki Yutaka" (see http://en.wikipedia.org/wiki/Yutaka_Ozaki for actual kanji characters) in Shift_JIS encoding will get displayed in SqueezeCenter with garbled characters.

It appears that the MP3::Info module used to extract tags from MP3 files is not guessing the correct character set for tag values.
MP3::Info uses Encode::Detect::Detector (if available) to detect the charset for each individual tag string, but a single artist or album tag is usually too short for it to reliably detect the charset.
A typical Japanese artist name consists of only four characters, and I suppose the situation is similar for Chinese names, too.

I wish to suggest that, MP3::Info could identify tag charsets more accurately if it used all of the tags in a track to detect which charset is being used in those tags.
This fix won't solve the problem entirely, because some tracks use CJK characters only in the artist tag (for example) and Latin characters in the other tags, but for most files it would be a great improvement.

I've attached a patch for Info.pm that works for me.
However, I'm not sure which tag fields ought be used to detect the charset.
(The patch currently uses all tags whose IDs begin with a 'T' or "COM".)
Also, it only works if Encode::Detect::Detector is being used, and not Encode::Guess.

Comment 1 Chris Owens 2008-11-10 09:16:27 UTC

cc'ing Dan per Andy

Comment 2 Chris Owens 2008-12-22 09:38:10 UTC

Dan, did you ever see this patch?  Do you have an opinion?

Comment 3 Chris Owens 2008-12-22 15:54:08 UTC

Some feedback came in that this patch would add a lot of processing for the common case, and might cause unforseen bugs.

I'll continue to keep an eye on this to see if it grows in popularity.