[tracker/wip/sam/meson] fts: Do not apply stop-words when tokenizing query search terms

From: Sam Thursfield <sthursfield src gnome org>
To: commits-list gnome org
Cc:
Subject: [tracker/wip/sam/meson] fts: Do not apply stop-words when tokenizing query search terms
Date: Sun, 25 Sep 2016 13:15:26 +0000 (UTC)

commit c52bf1b60bbe9829b13a1223205888e53c19b723
Author: Carlos Garnacho <carlosg gnome org>
Date:   Tue Jun 7 00:49:15 2016 +0200

    fts: Do not apply stop-words when tokenizing query search terms
    
    FTS5 notifies of the purpose in tokenization in their xTokenize vfunc,
    check for the FTS5_TOKENIZE_QUERY flag indicating that this is the
    tokenization of the search terms in a query and do not apply the stop
    words list in this case.
    
    One example where this is potentially harmful are "search as you type"
    UIs. eg. typing the word "ontology" would have you type the word "onto",
    which is an ignored word. Only after typing the next character you'd get
    matches, which seems irregular behavior.

 src/libtracker-fts/tracker-fts-tokenizer.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)
---
diff --git a/src/libtracker-fts/tracker-fts-tokenizer.c b/src/libtracker-fts/tracker-fts-tokenizer.c
index 26764aa..e055029 100644
--- a/src/libtracker-fts/tracker-fts-tokenizer.c
+++ b/src/libtracker-fts/tracker-fts-tokenizer.c
@@ -95,18 +95,24 @@ tracker_tokenizer_tokenize (Fts5Tokenizer *fts5_tokenizer,
        TrackerTokenizer *tokenizer = (TrackerTokenizer *) fts5_tokenizer;
        TrackerTokenizerData *data = tokenizer->data;
        const gchar *token;
-       gboolean stop_word;
+       gboolean stop_word, ignore_stop_words = data->ignore_stop_words;
        int n_tokens = 0, pos, start, end, len;
        int rc = SQLITE_OK;
 
        if (length <= 0)
                return rc;
 
+       /* When tokenizing the query, we don't want to ignore stop words,
+        * we might ignore otherwise valid matches.
+        */
+       if (flags & FTS5_TOKENIZE_QUERY)
+               ignore_stop_words = FALSE;
+
        tracker_parser_reset (tokenizer->parser, text, length,
                              data->max_word_length,
                              data->enable_stemmer,
                              data->enable_unaccent,
-                             data->ignore_stop_words,
+                             ignore_stop_words,
                              TRUE,
                              data->ignore_numbers);

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]