[tracker] libtracker-fts: Only let stop words go through on prefix queries

From: Carlos Garnacho <carlosg src gnome org>
To: commits-list gnome org
Cc:
Subject: [tracker] libtracker-fts: Only let stop words go through on prefix queries
Date: Sat, 16 Sep 2017 19:52:28 +0000 (UTC)
commit 7789e3ac0bdb3bafd68795f7bc2f381bcfa04cfd
Author: Carlos Garnacho <carlosg gnome org>
Date:   Fri Sep 8 23:03:16 2017 +0200

    libtracker-fts: Only let stop words go through on prefix queries
    
    Commit 63e507865 made stop words go through when tokenizing FTS5 query
    search terms, in order to still provide matches for incompletely typed
    search terms that happen to match a stop word.
    
    This however brought the side effect that searching for a stop word in
    combination with other terms renders the latter ineffective. As the stop
    word has no tokens in the FTS5 table to match with, the whole query brings
    no results.
    
    Since that commit, SQLite fixed FTS5_TOKENIZE_PREFIX to work as advertised,
    so limit the bypass to prefix queries (e.g. "onto*"), since it only makes
    sense there. Also, invert the way we look for stop words (i.e. always lookup
    those in search terms as per config, and do the bypass once we know we deal
    with a stop word) for the sake of readability.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=787452

 src/libtracker-fts/tracker-fts-tokenizer.c |   18 ++++++++++--------
 1 files changed, 10 insertions(+), 8 deletions(-)
---
diff --git a/src/libtracker-fts/tracker-fts-tokenizer.c b/src/libtracker-fts/tracker-fts-tokenizer.c
index 7b0af2d..a58a8ba 100644
--- a/src/libtracker-fts/tracker-fts-tokenizer.c
+++ b/src/libtracker-fts/tracker-fts-tokenizer.c
@@ -102,24 +102,21 @@ tracker_tokenizer_tokenize (Fts5Tokenizer *fts5_tokenizer,
        TrackerTokenizer *tokenizer = (TrackerTokenizer *) fts5_tokenizer;
        TrackerTokenizerData *data = tokenizer->data;
        const gchar *token;
-       gboolean stop_word, ignore_stop_words = data->ignore_stop_words;
+       gboolean stop_word, is_prefix_query;
        int n_tokens = 0, pos, start, end, len;
        int rc = SQLITE_OK;
 
        if (length <= 0)
                return rc;
 
-       /* When tokenizing the query, we don't want to ignore stop words,
-        * we might ignore otherwise valid matches.
-        */
-       if (flags & FTS5_TOKENIZE_QUERY)
-               ignore_stop_words = FALSE;
+       is_prefix_query = ((flags & (FTS5_TOKENIZE_QUERY | FTS5_TOKENIZE_PREFIX)) ==
+                          (FTS5_TOKENIZE_QUERY | FTS5_TOKENIZE_PREFIX));
 
        tracker_parser_reset (tokenizer->parser, text, length,
                              data->max_word_length,
                              data->enable_stemmer,
                              data->enable_unaccent,
-                             ignore_stop_words,
+                             data->ignore_stop_words,
                              TRUE,
                              data->ignore_numbers);
 
@@ -133,7 +130,12 @@ tracker_tokenizer_tokenize (Fts5Tokenizer *fts5_tokenizer,
                if (!token)
                        break;
 
-               if (stop_word && data->ignore_stop_words)
+               /* When tokenizing prefix query tokens we don't want to
+                * mistake incomplete words as stop words (eg. "onto" when
+                * typing "ontology"), we thus let them go through
+                * even if the parser marked it as a stop word.
+                */
+               if (stop_word && !is_prefix_query)
                        continue;
 
                rc = token_func (ctx, 0, token, len, start, end);
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]