Re: [Tracker] Question on tracking changes..
- From: Jamie McCracken <jamiemcc blueyonder co uk>
- To: Sriram Ramkrishna <sri aracnet com>
- Cc: Tracker List <tracker-list gnome org>
- Subject: Re: [Tracker] Question on tracking changes..
- Date: Thu, 10 Aug 2006 19:29:35 +0100
Sriram Ramkrishna wrote:
On Thu, Aug 10, 2006 at 09:51:14AM +0100, Jamie McCracken wrote:
Sri Ramkrishna wrote:
So I'm working a bit on a rhythmbox backend using tracker. I want to
know when tracker has detected a new music file. Is there some kind of
DBus signal I can listen to for this? How would I do that?
not at the moment - I have disabled signals until we have live query
support.
OK. So for now, I'll have to just set it up so that it fills the database
and then worry about what happens afterwards.
yes but you can expect live query support in version after next (IE end
of the month?)
With live query support you will be able to listen by filtering on the
live_query_id (usig dbus match rules) to get new hits and also deletes
for any particular query (any dbus method that has a live_query_id
parameter)
Sounds good.
Implementing live query support is not hard but I am doing structural
changes at the moment and would like to get these completed before
adding new stuff.
Cool, no doubt a good foundation is a requirement before adding new
stories using a building analogy. :-)
I am experimenting with replacing mysql's fulltext stuff with a much
faster and more scalable hash table using the super fast QDBM
(http://qdbm.sourceforge.net/spex.html). As Hash tables are O(1) in
performance it means it will be super fast no matter how much stuff you
index and it also means I can add custom stuff like stemming and custom
ranking.
Gotcha ..
If above is successful it could also pave the way to replace mysql with
the lighter and faster sqlite (as we are only using mysql for its full
text support)
I fully approve of that. But wouldn't you have to use two database
engines? Or can sqlite use QDBM somehow? SOrry I haven't looked at the
link as of yet.
QDBM is just a hash table index - its not a database table as in
mysql/sqlite.
So you can feed it a word and it returns a list of Document IDs and
associated rank score. These results are then fed into a temporary
"in-memory" table in mysql or sqlite for further querying/processing
using SQL. (the sql converts the IDs into URI's etc or for RDF Query
support would filter the result set)
The way it works is no different than an index in mysql so its not
really two full blown databases as such - more a database and a hash
table index.
Advantages of using QDBM over mysql Fulltext:
1) Its faster and scales a lot better. Search results should take less
than 0.1 seconds regardless of how many docs/contents you have indexed.
Performance of msql fulltext nose dives when full text index cannot fit
in available RAM. On a 512 MB machine searching 1GB+ index takes approx
5 seconds with mysql. Every single indexer including google, Alta Vista,
Lucene, etc uses a hashtable for indexing (aka inverted word index)
where as mysql uses a much slower dual btree.
2) We can index text files and source code without copying their
contents into the index. Currently mysql needs the text of these files
in the database before it can index them so if you have 1GB of text
files/source code then the DB size would be 1GB + 200MB Fulltext index.
With QDBM it should be in the region of 200MB-300MB - a huge saving in
disk space.
3) I can parse stuff better and stem them (so searching for "penguin"
would return all docs with "penguins" as well as "penguin" in them.)
4) I can use custom ranking to score document hits so ranks can be
weighted according to the metadata type that was hit. EG if its a
keyword that matches we can assign it a score of 50 for each occurance.
Likewise if the match occurs in the file's content we can assign it a
score of 1 per occurance of the word so to outrank a keyword a doc would
have to contain 51 occurances of a word in its text contents.
Disadvantages:
1) Slower indexing. Hash tables are a lot slower to update than btrees
due to the file relocations involved when the stored values grow in
size. The process of indexing files will therefore be significantly slower.
2) Cannot delete from Hashtable as we dont store the words of all the
text files (even if we did it would be too slow to manually delete every
occurence and an update would involve a delete followed by an insert).
Instead we never delete but use an incrementing ID instead for updates.
To clear dud hits, a sweep must be done periodically to remove them
(after every 10,000 updates or so we can sweep incrementally). Duds do
not return false hits when searching as the IDs would only count if they
match whats in the Database.
3) Fragmentation. Hash tables are bigger than btrees but can fragment
quite badly after loads of updates so a periodic resize is needed (after
every 10,000 updates or so) to reclaim lost disk space and prevent the
file from exploding in size.
--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]