[Tracker] Anaylsis of Sqlite FTS 3

From: Jamie McCracken <jamie mccrack googlemail com>
To: Tracker-List <tracker-list gnome org>
Subject: [Tracker] Anaylsis of Sqlite FTS 3
Date: Tue, 29 Jul 2008 14:49:18 -0400

Heres a quick analysis of the full text extension that can be found in
recent versions of sqlite 3

You can find the source for this extention in the ext/fts3 directory in
the sqlite 3.6.0 tarball

http://www.sqlite.org/sqlite-3.6.0.tar.gz

This extension is not built into sqlite and exists externally - its a
dynamically loaded module. It is unlikely to be in distro's which means
we can include it in our source tree

how to use:
http://www.sqlite.org/cvstrac/wiki?p=FtsOne

FTS version 3 includes prefix searches (eg match 'tra*')

Advantages:

*fast and scalable

*Stores position and byte offset of each word - makes creating snippets
fast (as we have the byte offset in the text). Also provides a near
function and could be used to improve ranking of documents where
mutliple terms are close or next to each other

*Compact b-tree+ structure - much more space efficient than our qdbm
hashtable

* query language the same as xesam user search language - supports exact
phrases, prefixes and specific field searching (match 'title:kill')

* all access is done via sql - makes it more maintainable and easy to
use in things like xesam query

* is supported by google and is a key part of google gears

* no size limitations

* supports multiple fields

* supports sqlite transactions

* uses variable length integers (varints) in the index to keep positions
and offset storage to a minimum

* significantly reduce amount of code in tracker




Disadvantages:

* in built parser not adequate (only supports porter stemmer and no
language stop words)

* no ranking or weighting

* stores full text in a separate contents table and with no compression

* no support for non full text columns - these must use sqlite db
instead and join




All the above disadvantages can be sorted by forking FTS 3 and
including:

1) the tracker-parser.c to be use instead of porter stemmer

2) adding a rank to each word per doc. Fields will need a weight against
them so that we can add the rank and store it in the index

3) providing an extern function so that the extension can access the
text of any field or metadata item. That way FTS will not need to store
the text and just do the indexing (it only needs the text to handle
deletions and updates to the index and the included snippet function
also needs it)

I will likely try and integrate this when we implement xesam db changes
as that will require a reindex anyhow

due to the wide changes involved I dont see much point in supporting
multiple indexer backends

jamie

Follow-Ups:
- Re: [Tracker] Anaylsis of Sqlite FTS 3
  - From: Ivan Frade

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]