Re: [Tracker] Question on tracking changes..

From: Jamie McCracken <jamiemcc blueyonder co uk>
To: Sriram Ramkrishna <sri aracnet com>
Cc: Tracker List <tracker-list gnome org>
Subject: Re: [Tracker] Question on tracking changes..
Date: Thu, 10 Aug 2006 19:29:35 +0100

Sriram Ramkrishna wrote:

On Thu, Aug 10, 2006 at 09:51:14AM +0100, Jamie McCracken wrote:

Sri Ramkrishna wrote:

So I'm working a bit on a rhythmbox backend using tracker.  I want to
know when tracker has detected a new music file.  Is there some kind of
DBus signal I can listen to for this?  How would I do that?

not at the moment - I have disabled signals until we have live querysupport.


OK.  So for now, I'll have to just set it up so that it fills the database
and then worry about what happens afterwards.

yes but you can expect live query support in version after next (IE endof the month?)

With live query support you will be able to listen by filtering on thelive_query_id (usig dbus match rules) to get new hits and also deletesfor any particular query (any dbus method that has a live_query_idparameter)
Sounds good.
Implementing live query support is not hard but I am doing structuralchanges at the moment and would like to get these completed beforeadding new stuff.
Cool, no doubt a good foundation is a requirement before adding new
stories using a building analogy. :-)
I am experimenting with replacing mysql's fulltext stuff with a muchfaster and more scalable hash table using the super fast QDBM(http://qdbm.sourceforge.net/spex.html). As Hash tables are O(1) inperformance it means it will be super fast no matter how much stuff youindex and it also means I can add custom stuff like stemming and customranking.
Gotcha ..
If above is successful it could also pave the way to replace mysql withthe lighter and faster sqlite (as we are only using mysql for its fulltext support)
I fully approve of that.  But wouldn't you have to use two database
engines? Or can sqlite use QDBM somehow? SOrry I haven't looked at thelink as of yet.

QDBM is just a hash table index - its not a database table as inmysql/sqlite.

So you can feed it a word and it returns a list of Document IDs andassociated rank score. These results are then fed into a temporary"in-memory" table in mysql or sqlite for further querying/processingusing SQL. (the sql converts the IDs into URI's etc or for RDF Querysupport would filter the result set)

The way it works is no different than an index in mysql so its notreally two full blown databases as such - more a database and a hashtable index.


Advantages of using QDBM over mysql Fulltext:

1) Its faster and scales a lot better. Search results should take lessthan 0.1 seconds regardless of how many docs/contents you have indexed.Performance of msql fulltext nose dives when full text index cannot fitin available RAM. On a 512 MB machine searching 1GB+ index takes approx5 seconds with mysql. Every single indexer including google, Alta Vista,Lucene, etc uses a hashtable for indexing (aka inverted word index)where as mysql uses a much slower dual btree.

2) We can index text files and source code without copying theircontents into the index. Currently mysql needs the text of these filesin the database before it can index them so if you have 1GB of textfiles/source code then the DB size would be 1GB + 200MB Fulltext index.With QDBM it should be in the region of 200MB-300MB - a huge saving indisk space.

3) I can parse stuff better and stem them (so searching for "penguin"would return all docs with "penguins" as well as "penguin" in them.)

4) I can use custom ranking to score document hits so ranks can beweighted according to the metadata type that was hit. EG if its akeyword that matches we can assign it a score of 50 for each occurance.Likewise if the match occurs in the file's content we can assign it ascore of 1 per occurance of the word so to outrank a keyword a doc wouldhave to contain 51 occurances of a word in its text contents.


Disadvantages:

1) Slower indexing. Hash tables are a lot slower to update than btreesdue to the file relocations involved when the stored values grow insize. The process of indexing files will therefore be significantly slower.

2) Cannot delete from Hashtable as we dont store the words of all thetext files (even if we did it would be too slow to manually delete everyoccurence and an update would involve a delete followed by an insert).Instead we never delete but use an incrementing ID instead for updates.To clear dud hits, a sweep must be done periodically to remove them(after every 10,000 updates or so we can sweep incrementally). Duds donot return false hits when searching as the IDs would only count if theymatch whats in the Database.

3) Fragmentation. Hash tables are bigger than btrees but can fragmentquite badly after loads of updates so a periodic resize is needed (afterevery 10,000 updates or so) to reclaim lost disk space and prevent thefile from exploding in size.


--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/

References:
- [Tracker] Question on tracking changes..
  - From: Sri Ramkrishna
- Re: [Tracker] Question on tracking changes..
  - From: Jamie McCracken
- Re: [Tracker] Question on tracking changes..
  - From: Sriram Ramkrishna

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]