Re: [Tracker] Running tracker on an Ubuntu server box?



On Tue, 2009-04-21 at 14:20 +0200, tigerf wrote:

I should mention, that for Tracker 0.6.x, there is a hard limitation on
the amount of data you can store for full text searching. We currently
use QDBM for the index and it has a 2Gb file size limit.

High time to the overcome this limitation in times where 2 TB drives
cost less than 250 Euros.

Full agreement. This week we have been working on integrating SQLite FTS
with Tracker's master repository.

We don't hope for FTS to be faster, but we do hope for it to scale
better and allow larger datasets. We also expect less corruption.



Another reason, which is a more technical one mostly relevant for our
implementation, is that it doesn't require us to pause and flush one
process, and reopen in the other process. FTS is multi-process safe.

QDBM is supposed to be multi-process safe, but adaptations to make it
perform better done earlier removed this capability from our internally
used QDBM. Which means that we have to flush it in the indexer before we
can use it in the daemon. This creates roundtrips between daemon and
indexer, and on top we loose the time to flush, during interactive usage
(which is most important for the end-user).

Add these reasons together and that's why we are moving towards FTS.

This means that once you get to a sizeable index, it will not index any
further. This has recently lead to this "Can not index word" error we
have been seeing in bug reports.

In the 0.7 branch which we are working on in parallel, we are using
SQLite instead of QDBM. This should extend the possibilities here not to
mention add partial match searching (i.e. "foo*" finds "foobar") which
is another feature missing.

Unfortunately, I couldn't say if 60k files would actually be reaching
the limit or not because it really does depend on how many words are in
those files. My estimate is that you wouldn't be far off the limit with
that many files though. Perhaps others which have had this QDBM error
can comment on how many files they have to give some rough estimation here.

Thanks for mentioning this. It doesn't hurt too much because for the
moment it's not more than 13.000 files, but the solution needs potential
for much more during the server's lifetime of 5-20 years.

I like the SQLite idea, because PHP offers a proven interface to SQLite,
and SQL is known nowadays. Is it thinkable that I query the database via
PHP in a read-only manner while the tracker deamon is updating it "from
the other side"?

No, this is unthinkable. The reasons are:

 - We have a decomposed schema, this isn't at all what you expect if you
   are into normalized database schemas. Your SQL queries will be
   insanely hideously difficult

 - We have longstanding transactions. This means that your process will
   very often see its sqlite3_next() yield SQLITE_BUSY. In fact, it'll
   yield that results the majority of the times. This means that your
   webserver (if you exec the sqlite API from API in process with
   apache) will be constantly waiting for us to release the transaction.
   And we hold transactions for as much time as possible)

 - We have internal caching, too. Direct access to the database will
   often simply yield incorrect and inconsistent results.

The reason why we have longstanding transactions is because this
aggressively improves INSERT performance of SQLite. Differently put, if
we wouldn't do this, then SQLite would be aggressively slow.

Instead we provide you with a SPARQL query interface:

 - Currently over DBus

 - We might someday provide a thin API to call SPARQL queries avoiding
   DBus involvement. This would improve performance a very small amount,
   mostly because of DBus marshaling not having to be performed.

   DBus is indeed not a very good IPC to transfer a lot of data between
   processes. Such a thin API would be most beneficial for use-cases
   where you do queries that'll yield large result sets. For which atm
   Tracker is in general not designed to be honest (Tracker is aiming
   more towards desktop usage, which means fetching pages of results per
   round trip, instead of the entire result-set in one round trip)

 - We have a mechanism in place that'll tell you about changes that
   might (or will) require your client-side to synchronize itself.

Instead of a SQL schema we provide you with Nepomuk as ontology to use
in your SPARQL queries. We have also added a few specialized ontologies
and we have plans to make it possible for application developers to
extend Nepomuk's ontologies with their application specific custom
ontologies.



This would shortcut the the overhead of accessing tracker's index quite
a bit.

Is there somewhere a how-to or is the whole idea simply unrealistic?

I don't think so, Tracker just might not be able to cope with the
volumes for now. Of course, the said limit I mentioned above is per
user, if you are doing this on a multiple user level, things get
trickier but the QDBM limit is less of a problem.

When will be (or: is) 0.7 reliable enough for tests? I'm not in a hurry
here beacuse I have this half-baked solution with find, as I mentioned.

I like the tracker -> SQLite -> PHP/Apache -> webbrowser -> client
filesystem approach, because it is very flexible and modular. It would
allow very nice, platform independant document retrival solutions with
little effort, once it runs. ;)

As I mentioned, I'm a linux noob but I have some C/SQL/PHP/HTML/JS
knowledege. So the first step for me would be to install tracker and
make it indexing on my ubuntu server. Any idea where to start?


-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]