Bug 8585 – CLI search and foreign letters

Bug 8585 - CLI search and foreign letters

Summary:

CLI search and foreign letters

Status:	CLOSED FIXED

Product:	Logitech Media Server
Classification:	Unclassified
Component:	CLI
Version:	7.0.1
Platform:	PC Windows XP

Importance:	P5 normal (vote)
Target Milestone:	7.x
Assigned To:	Michael Herger

URL:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2008-06-27 11:25 UTC by Barry Gordon
Modified:	2009-07-31 10:23 UTC (History)
CC List:	3 users (show)

See Also:
Category:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Barry Gordon 2008-06-27 11:25:18 UTC

I am working with a Pronto PRO remote using the CLI interface.

I am having a bit of trouble with the CLI search command. I have a clients album in my collection for testing that has a Swedish Character set.

When I pull the album data I get what I expect; a URI encoded stream containing a UTF-8 encoded sequence for the Swedish characters. I decode them into the Basic multilingual plane (BMP) and they show fine.

Now I wish to allow for a search using the foriegn language characters which are in the BMP. I can not get SC7 to find the album. I have tried raw text, UTF-8 encoded text, URI encoded text and URIencoded(UTF-8encoded) text. I felt the last case should work. The data I am sending is as follows:

UTF-8 encoded = search 0 500 term:N(C3)(83)(C2)(A4)
URI encoded = search 0 500 term:N%C3%A4
UTF-8 and then URI encoded = search 0 500 term:N%C3%83%C2%A4

In the first example, UTF-8 encoded, the items in parens e.g.(C3) are a single charachter and its hex value is shown in the parens.

None of the above seem to work. What should I be sending?

As per a request from Fred re the web interface,

I tried to search swedish artists from Squeezecenter and I could't find them if I use åäö. If I for example search with "Lars" the result is

"Lars Winnerbäck" but if I put in "Lars Winnerbäck" I get no result.

However if I replace
å,ä, á with a
ö with o
é with e
the correct artist is found

Replacing å with a etc. doesn't work If I change the
language in Squeezecenter to swedish instead of english. And it's still
not working with åäö...

Fred then asked me to try it with the charset:utf8

Here is a trace from my system as to 4 cases of trying to get the search command to work properly with some Swedish characters; the character in question is the lower case a with two dots over it. It is in the multilingual Bit Plane at hex E4. I am sending an N followed by the "Funny character"

I have tried all combinations of utf8 encodeing and URL encoding. With the charset directive. In two cases it finds nothing and in two cases it finds 25 items. There is only one item in the collection that has that letter combination, but here are 25 items that have the combination Na.

I thought that the carset directive only affected what was returned and did not alter the interpretation of what was sent.

As I pointed out earlier the web interface has the same/similar problems.

case 1
SEND: search 0 500 charset:utf8 term:N%E4
RCVD: search 0 500 charset%3Autf8 term%3AN%C3%AF%C2%BF%C2%BD count%3A0

case 2
SEND: search 0 500 charset:utf8 term:N%C3%83%C2%A4
RCVD: search 0 500 charset%3Autf8 term%3AN%C3%83%C2%83%C3%82%C2%A4 count%3A0

case 3
SEND: search 0 500 charset:utf8 term:N%C3%A4
RCVD: search 0 500 charset%3Autf8 term%3AN%C3%83%C2%A4 count%3A25 contributors_count%3A2 contribut

case 4
SEND: search 0 500 charset:utf8 term:N(C3)(83)(C2)(A4)
RCVD: search 0 500 charset%3Autf8 term%3AN%C3%83%C2%A4 count%3A25 contributors_count%3A2 contribut

Comment 1 Michael Herger 2008-06-28 05:10:56 UTC

Probably a long standing issue: as our devices don't allow for accented characters we never stumbled across this :-).

BTW: what would happen if you replaced the characters with their (don't know how you call them correctly) non-accented version? Eg. Winnerback instead of Winnerbäck?

And I'm not aware of any issue with the web UI. I can search "Björk", "Touré Kunda" without any problem.

Comment 2 Barry Gordon 2008-06-28 21:20:28 UTC

If I search for songs containing Na (the standard english a) then all things that have Na are found including all things haveing N and the a with two dots over it.

That would not be the result desired if I were looking for Na where a is the a with two dots over it (hex code E4)

When I search on the Web interface using Na the same thing happens.  I have not tried to search on the web interface using N and the doubledotted a as I do not know how to enter that from the keyboard.  Probably need to switch languge to Swedish.

I feel the real problem lies in the decoding on the CLI interface.  The first question is what should be sent utf8 encoded, utf8 encoded and URI escaped or just URI escaped (N%E4). Nothing seems to work over the CLI interface as I described in my original bug report.

A lot of people in Europe are now turned onto the Slim products because of the 9600, I have sold about 120 copies of the Pronto Pro code for controlling squeezecenter and at least half of them are in Europe. I have added all of the code on the search screens to allow for foreign language characters for languages that use the Roman (Latin) set (no Cyrillic, Japanese, Chinese, Hebrew or Arabic), but I can not make it work as I can see no way to encode it so that Squeezecenter will do a proper search

Comment 3 Fred 2008-06-29 03:12:38 UTC

Here, mac and web interface, 'Myle' matches 'Mylè', but searching 'Mylè' finds nothing. This generates SQL 'where xxx like 'Mylè%', the same sort of query the CLI does.

Questions for Logitech, then:

A. What is the desired behaviour with regards to search and accented letters
B. What must be done at SQL level
C. What must be done at the upper levels.

Right now, it seems B is the culprit when it comes to returning all forms of the letter, accented or not, when searching "myle". Sounds right to me but Barry disagrees - hence A.

Regarding C, it seems the CLI may have an issue but sending it correct utf8 does not seem to work - hence the B question. Looking at the DB in hex from a tool here it seems it stores Latin or ISO something, not UTF, so would 'like "<utf8>" ' match non utf8 strings ?

Comment 4 Barry Gordon 2008-06-29 04:29:08 UTC

Reading over Fred's reply I did a little more research on diacritic markings using Wikipedia. Whether a diacritic is treated as a new letter or as the underlying letter for purposes of spelling and collation is language dependent. Since searching is generally collation dependent the search to be absolutely proper should follow the languages collation rules. E.g. in French the diacriticly marked vowels are treated the same as the unmarked vowel, while in Finnish and Swedish they are treaded as distinct letters.

This being a language dependent issue makes the search issue a little more complex.  The following might be an acceptable solution if the aim is simplicity (language independence) and a reasonable compromise:

1) If an unmarked vowel is used in the search string it should return all matches in the collection containing either marked or unmarked instances of the vowel

2) If a marked vowel is used then it should only match items in the collection with identically marked vowels.  In this case guidance on how the marked vowel is to be represented is needed (e.g. URIescaped UTF8 )

IMHO the proper solution is to have the search obey the language setting in Squeezecenter.  That is, if French were selected then marked and un marked vowels are treated the same no matter which one is used, while if Swedish were selected then an unmarked a and it's diacriticly marked versions are different and distinct matching only their identical instances in the collection.

Comment 5 Barry Gordon 2008-07-10 09:59:45 UTC

Hey guys; Any traction on this one?  Not driving me crazy but I do have some foreign users who would like to better control searches.  Fred summed it up pretty well

Comment 6 Michael Herger 2008-10-08 05:30:55 UTC

Barry - I tried this again with SC 7.3. As long as I correctly uri encode the search string, this is working for me:

titles 0 10 search:c%E2n

finds a track titled "Cân...". Do you still see this issue?

Comment 7 Barry Gordon 2008-10-08 08:34:38 UTC

I will have to get into this again.  My old man memory tends to drop things  I do uri encode the sequences before sending and I use UTF-8 encoding so I assume in your test case %E2 is the uri escaped UTF-8 character for the a with the ^ over it

Any particular build version of 7.3? I was planning on installing 7.3 in my test bed, I am now testing against a version of 7.2.1 build 23323. What version of 7.3 has this fixed.

By the way it would be nice if you expanded the version tag of status to include the build.  Would make for easier compatability issues and testing. perhaps something like version:7.2.1 23323 so version and build could be parsed on a blank.

Any idea on release dates for offical 7.2.1 or 7.3? is the search 'fixed' in 7.2.1 xxx or only 7.3

Comment 8 Barry Gordon 2008-10-13 11:04:32 UTC

I finally got back to this.  I am running against 7.3

The album is titled "Långa Nätter"  I believe the "å" is hex code E5 and the "ä" is hex code E4.  This is from my basic understanding of the standard
multi-language plane.

There is an album in the collection as follows:
Långa Nätter which in HEX would look like  4C E5 6E 67 61 20 4E E4 74 74 65 72

On a search for Langa (the english a) it is found which it should probably not be.
SEND: Sending search 0 100 term:Langa
RCVD: Data Rcvd=search 0 100 term:Langa count:1 albums_count:1 album_id:131 album:L%C3%83%C2%A5nga%20%C3%83%C2%A4)tter(CR)

on a search for Lå     using L%E5
SEND: Sending search 0 100 term:L%E5
RCVD: Data Rcvd=search 0 100 term:L%C3%83%C2%A5 count:0(CR)

Based upon what is coming back I thought that perhaps a utf8 endcoding would solve the problem so  sent:

SSEND: search 0 500 charset:utf8 term:L%C3%83%C2%A5
RCVD: search 0 500 charset%3Autf8 term%3AL%C3%83%C2%83%C3%82%C2%A5 count%3A0

and here it is with out the charset:UTF8 clause.
SEND: search 0 500 term:L%C3%83%C2%A5
RCVD: search 0 500 term%3AL%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%A5 count%3A0

I am now totally confused, especially about the double, triple, .... encoding of the return!

I do not ubderstand what I am sending incorrectly

Comment 9 Barry Gordon 2008-10-29 08:21:08 UTC

Why is this now such a low priority.  It should not be hard to get fixed

Comment 10 Andy Grundman 2008-10-29 08:23:11 UTC

OK, so provide a patch. :)

Comment 11 Barry Gordon 2008-10-29 08:26:15 UTC

Touche Andy, If I knew enough about the architecture of SC and programmed in Perl I would, but I don't, so I can't

Comment 12 Michael Herger 2008-11-04 07:55:20 UTC

change 23810 - could you please give this a try? A few minor tweaks to the character encoding which could help your issues.

Comment 13 Barry Gordon 2008-11-04 08:05:56 UTC

Glad to do that just need a little info 7.3?  Current nightly?

What should I send?  Simple URI encoded, or UTF-8 then URI encoded? Or should I try them all?

Comment 14 Michael Herger 2008-11-04 08:13:09 UTC

not in the nightly yet. utf8 should be fine. If it doesn't work, try the others :-)

Comment 15 Barry Gordon 2008-11-04 10:00:29 UTC

Okay, just let me know when it is in the nightly so i can try it

Thanks

Comment 16 Michael Herger 2008-11-04 23:06:37 UTC

They will be in tonight's (nov. 5) builds

Comment 17 Michael Herger 2008-11-05 01:28:12 UTC

This should be fixed. Have a CLI based iP* application here which now does successfully return results for eg. "björk" or "blöd" (it didn't before).

Comment 18 Barry Gordon 2008-11-05 07:00:00 UTC

Just checked the nightly builds and no statement re bug fix 8585 (this one) so I will wait till tomorrow and check again.  I am set up to test it so it won't take me long.  I assume we are talking 7.3 and not the 7.2.1 fix or 7.2.2

Comment 19 Michael Herger 2008-11-05 07:53:13 UTC

> Just checked the nightly builds and no statement re bug fix 8585 (this one) so

I don't understand. What are you missing?

Yes, this is 7.3 only.

Comment 20 Michael Herger 2008-11-05 23:22:51 UTC

I hate to say so, but while my change might fix the search, it breaks all sort of other things. I'll have to revert this and investigate a more specific fix.

Comment 21 Michael Herger 2008-11-12 01:28:47 UTC

This bug seems to be related to Slim::Utils::Text being utf8 encoded instead of iso-8859-1. Dan wrote in revision 1809: "The latin1 diacriticals really need to be in latin1 and not utf8 to work.". If I change the file's encoding, using Slim::Utils::Text::matchCase() on the search string does fix this issue for me. Please give it a try.

change 23907 - transcribe umlauts/accented characters before querying the DB. Please give it a try.


For the records: Andy mentioned not touching that file unless it's broken. I think it is. I accidentally changed the encoding when I did mass-copyright updates. Looking at where matchCase() is being used, we luckily just didn't hit its brokenness, because it's rarely used:

- Slim::Display::Lib::TextVFD::doubleSize(): used on old SliMP3/SB1 only. Probably breaks double sized font on these old players. Bug might be gone unseen thanks to the rather exotic use case

- Slim::Music::Info::sortFilename(): used in BMF. Uncritical, as at worst the sort order would be slightly broken (ab < az < äb instead of ab < äb < az)

- Slim::Player::Playlist::reshuffle(): the function is used to assign a variable which is never(!?!) used?

- Slim::Web::Pages::anchor(): this function is commented "TODO: find where this is used?". And indeed I couldn't find any call to it.

Comment 22 Barry Gordon 2008-11-12 07:18:45 UTC

Michael,  I am not sure what you want me to do.  If you want me test it let me know which build. and If I have to do any special encoding befor issuing the CLI search

Comment 23 Michael Herger 2008-11-12 07:27:19 UTC

The latest 7.3 build:
http://downloads.slimdevices.com/nightly/latest/7.3/

generally any build with the same or a bigger revision number than the change as noted here.

Comment 24 Barry Gordon 2008-11-14 05:48:35 UTC

Squeezecenter is 7.3 29317

The collection contains an album by Melissa Horn with the title "Långa Nätter"

If I do a search for Na many albums are retruned, however Långa Nätter is one of those returned and it should not be.

SEND: search 0 500 charset:utf8 term:Na
RCVD: search 0 500 charset%3Autf8 term%3ANa count%3A25 contributors_count%3A2 contributor_id%3A88
DONE: 12:56:4 Page=SSsearch, TCPIP Caller=5, Clink=, Cret=SearchDone

I am not sure what I should be sending for the case where i logically want to send "Nä".  Assuming that 

the ä character is at colleating position E4 HEX do I send:

SEND: search 0 500 charset:utf8 term:N%E4           or
SEND: search 0 500 charset:utf8 term:Nä             or
SEND: search 0 500 charset:utf8 term:N%F0%E4        or
SEND: search 0 500 charset:utf8 term:N%EF%83%E4

None of the above seem to work

Comment 25 Michael Herger 2008-11-17 03:02:43 UTC

Could you please try running these queries from a simple telnet session?

search 0 10 term:dupré
search 0 10 term%3Adupr%C2%82 count%3A1 contributors_count%3A1 contributor_id%3A48 contributor%3ADupr%C3%A9%2C%20M

Still working fine for me (as does Squedgy, one of the iPod apps)

Comment 26 Michael Herger 2008-11-17 10:10:29 UTC

We're approaching code freeze. Will keep looking into this issue for 7.3.1

Comment 27 Barry Gordon 2008-11-17 15:32:34 UTC

The condition that I am getting is that the search is not limiting properly.

There is an album in the collection as follows:
Långa Nätter which in HEX would look like  4C E5 6E 67 61 20 4E E4 74 74 65
72

If I search for Nätt that is the only thing found. If I search for Natt it is the only thing found 

If I search for Nät (or Nat) it finds the album Långa Nätter as it should but also finds everything with the word Nat as in Nat King cole; Second Nature; National Orchestra . . .

Comment 28 Michael Herger 2008-11-17 23:23:35 UTC

> The condition that I am getting is that the search is not limiting properly.

Consider this a feature: as eg. the player or Controller UIs don't allow for accented characters, SC treats ä the same as A internall, é like E etc.

But you now do find "Nätt"? Then this issue is fixed?

Comment 29 Barry Gordon 2008-11-18 07:11:10 UTC

Where you ever one of my students?  I used to teach two things:

1) If it ain't broke don't fix it
2) If you can't fix it feature it

Being an English speaking person with the only diacritically marked thing normally in my library are works by the contributor Beyoncé so I have no issue.  By the way, a search on the full name Beyoncé works but a search on the partial name ncé fails.

The real issue is foreign sales. The product I develop is for the Philips Pronto PRO remote line of controls which has a large following in Europe.  Denmark, Sweeden, Finland, Germany, Norway, netherlands, Germany, all use diacritically marked characters. Al lot of Slim players have been sold due to the availability of the control program on the Pronto. 

It is those customers of ours that would like to see it fixed properly. For me and my music listening, it is not an issue.

Bottom Line, I do not consider this resolved. However Squeezecenter is your product and they are your (and mine) customers.

Comment 30 Michael Herger 2008-11-18 07:35:10 UTC

> Where you ever one of my students?

I doubt it :-)

> Bottom Line, I do not consider this resolved. However Squeezecenter is your
> product and they are your (and mine) customers.

I still don't understand. Your search does work, but you don't want the additional results? Or it doesn't return the Beyoncé album at all?

Comment 31 Barry Gordon 2008-11-18 07:49:36 UTC

The Search Does "work", but I feel incorrectly and I understand your issue.

Lets take it from the perspective of an anal compulsive music lover, not a computer scientist or programmer. (I assume the former applies to neither of us)  

If I do a search for Nät using my collection as the music repository then the ONLY thing that should be matched is the album "Långa Nätter", as that is the only item in my entire collection with those three characters in sequence.  It should NOT match on:

Nat King Cole
National Orchestra of the ORTF...
National Folk
Brasil Native (Bonus Track)
Nature of the Game
Puer natus es nobis . . .
Puer natus in Bethlehelm . . .
Second Nature

Which it does. From a laymans perspective not knowing the idiosynchracies of computer searching and matching it does not work,  The filter (that is all a search is) is letting things through that it should not.

One answer is use Nätt to do the search which works properly.  With regard to Beyoncé I want to look further but I believe I am correct

Comment 32 Michael Herger 2008-11-18 08:14:44 UTC

Ok, I've opened new bug 10049 which should cover the remaining issue. I'd like to close this bug, as the search does work "as designed", if poorly designed. But that's something we best reconsider with the new schema rework. 

Feel free to re-open once again if I've totally misunderstood this bug. Thanks!

Comment 33 Barry Gordon 2008-11-18 08:21:43 UTC

Ahhh My third teaching lecture, The Manana rule (needs a diacritic marking)

Put off tll tomorrow what is too hard to do today


I agree with your approach.

Comment 34 James Richardson 2008-12-22 11:36:30 UTC

This bug has been fixed in the 7.3.1 release version of SqueezeCenter!

Please download the new version from http://www.slimdevices.com/su_downloads.html if you haven't already.  

If you are still experiencing this problem, feel free to reopen the bug with your new comments and we'll have another look.

Comment 35 Chris Owens 2009-07-31 10:23:23 UTC

Reduce number of active targets for SC