Re: [Rhythmbox-devel] discussion question



On Fri, 2003-08-08 at 16:47, in7y118@public.uni-hamburg.de wrote:
> Instead of replying to every post one by one, I'll do it in a batch. So here it 
> goes...
> 
> You might want to collect similar names into the same group. You'll get quite a 
> bit of problems when defining "similar" though. Do you just use strings that 
> are the same? Do you only do uppercase/lowercase? Is "R.E.M." similar to "REM"?
> Is "Britney Spears" the same as "Spears, Britney" or even "Spears Britney"?
> I spent quite some time on reading about/implementing fuzzy matching 
> algorithms, I even sent one to this list once.
> The biggest problem with this is that this has to be i18n-safe. So all your 
> optimizations must work for people in China or whatever, too. And they probably 
> think different about some English-specific optimizations. And I waouldn't want 
> to get into doing language-specific stuff.
> What I did however was rely on using Unicode-spcific character information to 
> simplify a name by stripping/changing characters. This allows for example to 
> remove signs ("R.E.M." => "REM") or make everything uppercase (when there is a 
> corresponding uppercase character - German "ß" doesn't have one). The advantage 
> is that all of this works within glib, so you don't have to put information 
> into the algorithms.
> So my advice would be: Use the most sophisticated algorithms that are possible 
> with the information you get, but don't put more information into the lib. So 
> leave out rules for pattern matching (like "$X and $Y" == "$X & $Y" or "$Y, $X" 
> == "$X $Y"). And be sure to use stuff that's not language-specific.
> 
> You get very very far with that. All my searches worked satisfactory, even when 
> I wrote stuff as wrong as I could imagine.

My take on this:
- Hard-code a bunch of easy ones with no false positives.
- Let people add new ones for different languages (Die Toten Hosen ==
Toten Hosen, Die == Toten Hosen ...)

-- 
/Bastien Nocera
http://hadess.net

#2  0x4205a2cc in printf ("Oh my %s\n", preferred_deity) from
/lib/i686/libc.so.6 printf ("Oh my %s\n", preferred_deity);
Segmentation fault




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]