Re: [Tracker] Syntax of searchings



Currently it is not possible and it is not normal... I do not know if
QDBM (which stores file names associated with keywords) can be set to
split string like "ziegler-nichols" into "ziegler" and "nichols"
automatically for searching or if we need to split strings ourselves.

This question were raised before, regarding filenames with dashes and underscores in them. That time, the reply were that in C-code, dashes and underscores often have a meaning. I think we'll need to be context sensitive in this case, where regular documents and filenames usually require word-splitting, while sourcecode usually don't. (However, in c, the string "difference=alpha-beta" actually have three interesting lexemes and a dash should neither here create the lexeme "alpha-beta".)

What I also dislike with libstemmer (which aims to "reduce" strings to
radicals to ignore plural for instance) is that it does not ignore
accentuated characters, so if I have a file which contains "ÃlÃphant",
then "Ãlephant" or "elephant" will not be found. "ÃlÃphant" is the
correct orthography but it happens very often that french people miss
some accents or add superflus ones... and it is the same problem in
other languages.

Unfortunately, this is not always applicable. For instance in Swedish, there's a big difference in the words "Ãst" and "ost", where the meanings is "east" and "cheese", respectively. However, "cafÃ" is often spelled "cafe", with the same meaning.

I'm not sure at all how to handle this.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]