[tracker] libtracker-fts: Only let stop words go through on prefix queries
- From: Carlos Garnacho <carlosg src gnome org>
- To: commits-list gnome org
- Cc:
- Subject: [tracker] libtracker-fts: Only let stop words go through on prefix queries
- Date: Sat, 16 Sep 2017 19:52:28 +0000 (UTC)
commit 7789e3ac0bdb3bafd68795f7bc2f381bcfa04cfd
Author: Carlos Garnacho <carlosg gnome org>
Date: Fri Sep 8 23:03:16 2017 +0200
libtracker-fts: Only let stop words go through on prefix queries
Commit 63e507865 made stop words go through when tokenizing FTS5 query
search terms, in order to still provide matches for incompletely typed
search terms that happen to match a stop word.
This however brought the side effect that searching for a stop word in
combination with other terms renders the latter ineffective. As the stop
word has no tokens in the FTS5 table to match with, the whole query brings
no results.
Since that commit, SQLite fixed FTS5_TOKENIZE_PREFIX to work as advertised,
so limit the bypass to prefix queries (e.g. "onto*"), since it only makes
sense there. Also, invert the way we look for stop words (i.e. always lookup
those in search terms as per config, and do the bypass once we know we deal
with a stop word) for the sake of readability.
https://bugzilla.gnome.org/show_bug.cgi?id=787452
src/libtracker-fts/tracker-fts-tokenizer.c | 18 ++++++++++--------
1 files changed, 10 insertions(+), 8 deletions(-)
---
diff --git a/src/libtracker-fts/tracker-fts-tokenizer.c b/src/libtracker-fts/tracker-fts-tokenizer.c
index 7b0af2d..a58a8ba 100644
--- a/src/libtracker-fts/tracker-fts-tokenizer.c
+++ b/src/libtracker-fts/tracker-fts-tokenizer.c
@@ -102,24 +102,21 @@ tracker_tokenizer_tokenize (Fts5Tokenizer *fts5_tokenizer,
TrackerTokenizer *tokenizer = (TrackerTokenizer *) fts5_tokenizer;
TrackerTokenizerData *data = tokenizer->data;
const gchar *token;
- gboolean stop_word, ignore_stop_words = data->ignore_stop_words;
+ gboolean stop_word, is_prefix_query;
int n_tokens = 0, pos, start, end, len;
int rc = SQLITE_OK;
if (length <= 0)
return rc;
- /* When tokenizing the query, we don't want to ignore stop words,
- * we might ignore otherwise valid matches.
- */
- if (flags & FTS5_TOKENIZE_QUERY)
- ignore_stop_words = FALSE;
+ is_prefix_query = ((flags & (FTS5_TOKENIZE_QUERY | FTS5_TOKENIZE_PREFIX)) ==
+ (FTS5_TOKENIZE_QUERY | FTS5_TOKENIZE_PREFIX));
tracker_parser_reset (tokenizer->parser, text, length,
data->max_word_length,
data->enable_stemmer,
data->enable_unaccent,
- ignore_stop_words,
+ data->ignore_stop_words,
TRUE,
data->ignore_numbers);
@@ -133,7 +130,12 @@ tracker_tokenizer_tokenize (Fts5Tokenizer *fts5_tokenizer,
if (!token)
break;
- if (stop_word && data->ignore_stop_words)
+ /* When tokenizing prefix query tokens we don't want to
+ * mistake incomplete words as stop words (eg. "onto" when
+ * typing "ontology"), we thus let them go through
+ * even if the parser marked it as a stop word.
+ */
+ if (stop_word && !is_prefix_query)
continue;
rc = token_func (ctx, 0, token, len, start, end);
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]