Re: [Tracker] Question on tracking changes..



Sriram Ramkrishna wrote:
On Thu, Aug 10, 2006 at 09:51:14AM +0100, Jamie McCracken wrote:
Sri Ramkrishna wrote:
So I'm working a bit on a rhythmbox backend using tracker.  I want to
know when tracker has detected a new music file.  Is there some kind of
DBus signal I can listen to for this?  How would I do that?

not at the moment - I have disabled signals until we have live query support.

OK.  So for now, I'll have to just set it up so that it fills the database
and then worry about what happens afterwards.


yes but you can expect live query support in version after next (IE end of the month?)


With live query support you will be able to listen by filtering on the live_query_id (usig dbus match rules) to get new hits and also deletes for any particular query (any dbus method that has a live_query_id parameter)

Sounds good.

Implementing live query support is not hard but I am doing structural changes at the moment and would like to get these completed before adding new stuff.

Cool, no doubt a good foundation is a requirement before adding new
stories using a building analogy. :-)

I am experimenting with replacing mysql's fulltext stuff with a much faster and more scalable hash table using the super fast QDBM (http://qdbm.sourceforge.net/spex.html). As Hash tables are O(1) in performance it means it will be super fast no matter how much stuff you index and it also means I can add custom stuff like stemming and custom ranking.

Gotcha ..
If above is successful it could also pave the way to replace mysql with the lighter and faster sqlite (as we are only using mysql for its full text support)

I fully approve of that.  But wouldn't you have to use two database
engines? Or can sqlite use QDBM somehow? SOrry I haven't looked at the link as of yet.

QDBM is just a hash table index - its not a database table as in mysql/sqlite.

So you can feed it a word and it returns a list of Document IDs and associated rank score. These results are then fed into a temporary "in-memory" table in mysql or sqlite for further querying/processing using SQL. (the sql converts the IDs into URI's etc or for RDF Query support would filter the result set)

The way it works is no different than an index in mysql so its not really two full blown databases as such - more a database and a hash table index.

Advantages of using QDBM over mysql Fulltext:

1) Its faster and scales a lot better. Search results should take less than 0.1 seconds regardless of how many docs/contents you have indexed. Performance of msql fulltext nose dives when full text index cannot fit in available RAM. On a 512 MB machine searching 1GB+ index takes approx 5 seconds with mysql. Every single indexer including google, Alta Vista, Lucene, etc uses a hashtable for indexing (aka inverted word index) where as mysql uses a much slower dual btree.

2) We can index text files and source code without copying their contents into the index. Currently mysql needs the text of these files in the database before it can index them so if you have 1GB of text files/source code then the DB size would be 1GB + 200MB Fulltext index. With QDBM it should be in the region of 200MB-300MB - a huge saving in disk space.

3) I can parse stuff better and stem them (so searching for "penguin" would return all docs with "penguins" as well as "penguin" in them.)

4) I can use custom ranking to score document hits so ranks can be weighted according to the metadata type that was hit. EG if its a keyword that matches we can assign it a score of 50 for each occurance. Likewise if the match occurs in the file's content we can assign it a score of 1 per occurance of the word so to outrank a keyword a doc would have to contain 51 occurances of a word in its text contents.

Disadvantages:

1) Slower indexing. Hash tables are a lot slower to update than btrees due to the file relocations involved when the stored values grow in size. The process of indexing files will therefore be significantly slower.

2) Cannot delete from Hashtable as we dont store the words of all the text files (even if we did it would be too slow to manually delete every occurence and an update would involve a delete followed by an insert). Instead we never delete but use an incrementing ID instead for updates. To clear dud hits, a sweep must be done periodically to remove them (after every 10,000 updates or so we can sweep incrementally). Duds do not return false hits when searching as the IDs would only count if they match whats in the Database.

3) Fragmentation. Hash tables are bigger than btrees but can fragment quite badly after loads of updates so a periodic resize is needed (after every 10,000 updates or so) to reclaim lost disk space and prevent the file from exploding in size.

--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]