Re: [Tracker] FTS4 branch review
- From: Martyn Russell <martyn lanedo com>
- To: tracker-list <tracker-list gnome org>
- Cc: Jürg Billeter <juerg billeter codethink co uk>
- Subject: Re: [Tracker] FTS4 branch review
- Date: Thu, 14 Feb 2013 19:29:12 +0000
On 05/02/13 17:07, Martyn Russell wrote:
Hello all,
So Carlos recently finished the fts4 branch for review. For those who
don't know what this is, there is a nice blog from Carlos here:
http://blogs.gnome.org/carlosg/2013/01/28/snippets-in-trackers-full-text-search-results/
So one of the things that we've done with the FTS4 branch is to remove
the tracker:fulltextNoLimit property. To counter this, we've also
started indexing ALL content, not just those words >= min-word-length
(which is a configuration option defaulting to 3 characters).
We've done this because we now use the upstream fts4 module and at the
level we tokenise the data, we can't check the property configuration to
know if we should be indexing ALL words or just words of a certain length.
I've recently done some quick analysis of FTS4 vs 0.14.5 to make sure
that we're not causing a serious performance or query regressions with
this and here is what I have found...
So the data set:
Tags 30
Contacts 329
Audios 8243
Documents 73
Files 10188
Folders 956
Images 669
Applications 285
Videos 499
Albums 1139
Music Tracks 7744
Photos 433
The tracker-stats output for those interested in the details is below,
but NOTE, the stats might be slightly different due to extraction
failures I noticed and mention later on. These are the stats for the
FTS4 work.
$ tracker-stats
Statistics:
mfo:Action = 1
mlo:LandmarkCategory = 15
mto:State = 6
mto:TransferMethod = 2
mtp:ScanType = 6
nao:Tag = 30
nco:AuthorizationStatus = 3
nco:Contact = 329
nco:Gender = 3
nco:IMCapability = 8
nco:PersonContact = 1
nco:PresenceStatus = 9
nco:Role = 2354
nfo:Audio = 8294
nfo:DataContainer = 1120
nfo:Document = 73
nfo:Equipment = 3
nfo:Executable = 285
nfo:FileDataObject = 10188
nfo:Folder = 956
nfo:Image = 667
nfo:Media = 8961
nfo:MediaList = 1134
nfo:Orientation = 8
nfo:PaginatedTextDocument = 37
nfo:PlainTextDocument = 36
nfo:RegionOfInterestContent = 5
nfo:Software = 285
nfo:SoftwareApplication = 285
nfo:SoftwareCategory = 164
nfo:TextDocument = 73
nfo:Video = 499
nfo:Visual = 1166
nie:DataObject = 10188
nie:DataSource = 4
nie:InformationElement = 15295
nmm:Artist = 2025
nmm:Flash = 2
nmm:MeteringMode = 7
nmm:MusicAlbum = 1134
nmm:MusicAlbumDisc = 1265
nmm:MusicPiece = 7795
nmm:Photo = 431
nmm:RadioModulation = 2
nmm:Video = 499
nmm:WhiteBalance = 2
nmo:DeliveryStatus = 5
nmo:PhoneMessageFolder = 5
nmo:ReportReadStatus = 3
nrl:InverseFunctionalProperty = 3
rdf:Property = 629
rdfs:Class = 233
rdfs:Resource = 16321
scal:AccessLevel = 3
scal:AttendanceStatus = 7
scal:AttendeeRole = 4
scal:CalendarUserType = 5
scal:EventStatus = 3
scal:JournalStatus = 4
scal:RSVPValues = 2
scal:TodoStatus = 4
scal:TransparencyValues = 2
slo:LandmarkCategory = 15
tracker:Namespace = 23
tracker:Ontology = 20
tracker:Volume = 3
The tests I did include:
a) Testing tracker-search with "foo", "love" and "martyn" to make sure
we get the same results with FTS queries.
b) Comparing the DB sizes to make sure we're not inflating our data
collective with the new FTS changes.
c) Comparing indexing time.
--
Test A (FTS4)
=============
$ tracker-search foo
Results:
file:///home/martyn/Documents/Important/%23foo.gpg%23
file:///home/martyn/Documents/tracker-tests-fts4
file:///home/martyn/Remotes/GrapeVine/Music/Santana/Shaman/Disc%201%20-%206%20-%20Foo%20Foo.mp3
$ tracker-search love|wc -l
492
$ tracker-search martyn|wc -l
32
Test A (0.14.5)
===============
EXACTLY the same.
Test B (FTS4)
=============
$ ls -lh ~/.local/share/tracker/data/ ~/.cache/tracker/
/home/martyn/.cache/tracker/:
total 27M
-rw-rw-r-- 1 martyn martyn 11 Feb 14 18:23 db-locale.txt
-rw-rw-r-- 1 martyn martyn 2 Feb 14 18:23 db-version.txt
-rw-rw-r-- 1 martyn martyn 6 Feb 14 18:33 first-index.txt
-rw-rw-r-- 1 martyn martyn 10 Feb 14 18:33 last-crawl.txt
-rw-r--r-- 1 martyn martyn 25M Feb 14 18:33 meta.db
-rw-r--r-- 1 martyn martyn 32K Feb 14 18:34 meta.db-shm
-rw-r--r-- 1 martyn martyn 1.5M Feb 14 18:34 meta.db-wal
-rw-rw-r-- 1 martyn martyn 11 Dec 24 10:24 miner-applications-locale.txt
-rw-rw-r-- 1 martyn martyn 344K Feb 14 18:23 ontologies.gvdb
/home/martyn/.local/share/tracker/data/:
total 16M
-rw-rw---- 1 martyn martyn 9.6M Feb 14 18:34 tracker-store.journal
-rw-rw---- 1 martyn martyn 5.6M Feb 14 18:23 tracker-store.ontology.journal
Test B (0.14.5)
===============
$ ls -lh ~/.local/share/tracker/data/ ~/.cache/tracker/
/home/martyn/.cache/tracker/:
total 34M
-rw-rw-r-- 1 martyn martyn 11 Feb 14 18:40 db-locale.txt
-rw-rw-r-- 1 martyn martyn 2 Feb 14 18:40 db-version.txt
-rw-rw-r-- 1 martyn martyn 6 Feb 14 18:49 first-index.txt
-rw-rw-r-- 1 martyn martyn 10 Feb 14 18:49 last-crawl.txt
-rw-r--r-- 1 martyn martyn 24M Feb 14 18:49 meta.db
-rw-r--r-- 1 martyn martyn 96K Feb 14 18:53 meta.db-shm
-rw-r--r-- 1 martyn martyn 9.8M Feb 14 18:53 meta.db-wal
-rw-rw-r-- 1 martyn martyn 11 Dec 24 10:24 miner-applications-locale.txt
-rw-rw-r-- 1 martyn martyn 344K Feb 14 18:40 ontologies.gvdb
/home/martyn/.local/share/tracker/data/:
total 16M
-rw-rw---- 1 martyn martyn 9.6M Feb 14 18:53 tracker-store.journal
-rw-rw---- 1 martyn martyn 5.7M Feb 14 18:40 tracker-store.ontology.journal
Test C (FTS4)
=============
Tracker-INFO: --------------------------------------------------
Tracker-INFO: Total directories : 1061 (107 ignored)
Tracker-INFO: Total files : 8997 (148 ignored)
Tracker-INFO: Total processed : 9804 (9804 notified, 0 with error)
Tracker-INFO: --------------------------------------------------
Tracker-INFO: Idle
Tracker-INFO: Finished mining in seconds:569.902854, total
directories:1061, total files:8997
Test C (0.14.5)
===============
Tracker-INFO: --------------------------------------------------
Tracker-INFO: Total directories : 1061 (107 ignored)
Tracker-INFO: Total files : 8997 (148 ignored)
Tracker-INFO: Total processed : 9803 (9803 notified, 58 with error)
Tracker-INFO: --------------------------------------------------
Tracker-INFO: Idle
Tracker-INFO: Finished mining in seconds:538.760078, total
directories:1061, total files:8997
Conclusions:
============
For Test A, we can see nothing has changed with our simple tests. So the
data set seems in tact for FTS searches.
For Test B, the database size for Tracker with FTS4 is much smaller. So
while we might be indexing more words (i.e. those which are smaller than
3 characters), we're still a smaller database. The reason for this could
be that we were previously duplicating data (Carlos can confirm this)
and now we're using the data only once. Either way, a smaller database
is always preferred if we can have it.
For Test C, this might not be an accurate portrayal of the situation.
First, you may notice we had errors with 0.14.5 and that means 58 items
were not indexed. That will definitely affect the time to finish
indexing. Second, ALL the music (which accounts for a majority of the
data indexed here) was being indexed over a encfs mounted directory to a
server (with a GB connection) on my local network. I was also playing
music (also on the server at the same time) and that will affect the
bandwidth available too. So I am not convinced the speed test was
entirely fair. However, if you work out an approximation for time per
item processed, it's ca. 0.058 secs (FTS4) vs 0.055 secs (0.14.5). There
isn't much in it. So performance wise, I don't think we're noticeably
worse than we were.
--
If anyone has any comments, they're welcome. I plan to release 0.15.2
tomorrow with the FTS4 work and if there are no complaints, we may
release a 0.16.0 in time for the GNOME 3.8 release.
--
Regards,
Martyn
Founder and CEO of Lanedo GmbH.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]