Re: Proposing Tracker for inclusion into GNOME 2.18



Joe Shaw wrote:
Hi,


I didn't mean to imply that files were written to or that Tracker might
accidentally delete them; I don't think that's the case.

What I meant is that Tracker has to read and process all that
information.  One thing I've learned from Beagle is that there is a lot
of broken data out there, or that our code to process that data was
broken.  (This has particularly been a problem with non-free file
formats.)

complete non-issue for tracker as all text and metadata is extracted out of process ergo it cannot cause leaks or crashes in tracker.

        * The scope of Tracker isn't clear.  Is the point to be fast
        search for files, or for all the user's data?  To what end has
        Tracker achieved its goal?
It should be clear - to be the best

To be the best what?  This is exactly my point.  Is the idea to index
everything or is the idea to index some subset?  Is the idea to store
all application data, or specific metadata?  Part of the problem here I
think is that it isn't clear to people how we use Tracker to improve our
desktop experience.


To be the best first class object database, indexer and search tool.



tracker is mature at indexing files and does a much better job than some leading contenders including spotlight and other commercial indexers which all seem to consume significant resources. Tracker is different and arguably better in this area.

Yes, the key phrase here is "indexing files".  There are two problems
with searching only files:

        * It doesn't lend itself to a generic design, suitable for
        searching all different types of data.  A spec or an abstraction
        layer is only any good if it has at least two implementations.

our database layer is independent of files - we have preliminary email support but its disabled in this release as it needs to get the correct uri's from Evo and other stuff thats not documented - its not hard but time consuming so have punted til next release

* It doesn't reflect a large portion of the user's data,
        including (most importantly) email, IM conversations,
        addressbook and calendar items, notes, etc. etc.  If that's not
        the focus of Tracker that's okay, but it's not clear.


The focus is on all first class objects so the stuff you listed will apply. Other more exotic stuff like mind maps are out im afraid! (though I can be convinced otherwise)

As to the resource usage, it's difficult to measure this, and I
encourage others to do their own testing, but from my indexing runs with
Tracker today while it did use substantially less memory than Beagle, it
used a lot more CPU, thrashed the disk a lot more, and generally made
the system slower over time.  It would be helpful to profile this and
quantify the usage.

if you are not running in turbo mode, tracker will periodically fsync to flush data to prevent a system pause from dirty buffers filling up too quickly. tracker runs at nice+19 and will use spare cpu cycles to speed up indexing so if its at 90% cpu its because your system is idle and not because tracker is a hog. To prove my point, it is possible to play quake while its indexing and see no side effects or pauses.

It should not make the system significantly slower at all (I have tested it on over 30GB of data). indexing will get slower as the file based hash table grows though as is the case with them all.

Out of interest, was tracker's indexing faster than Beagle?


Not yet but I will be doing the bookmark history stuff in Epiphany using this so time will tell on this. I note there is no other metadata server available to Gnome and its a huge missing part of the platform so it should not be dismissed out of hand. Im also a database expert with over 10 years of designing and optimising relational databases so Im confident I can deliver here (especially as its a piece of cake for me!).

I don't think that "just trust me" is a valid argument here.

I wholeheartedly agree that the lack of a larger metadata plan is a
problem for the platform.  Without anyone using Tracker for this
purpose, I think it's premature to approve it.

Well thats chicken and egg!

I cant use it in Epiphany without some approval for either dependency or getting in the desktop. If the maintainer of Epiphany gives the go ahead it should be allowed in my book...


this is the thin wrapper around the dbus method - see the introspect file for details :
http://cvs.gnome.org/viewcvs/tracker/data/tracker-introspect.xml?rev=1.12&view=markup

This helps a little, but it's still not comprehensive.

(As an aside, I thought there was consensus that all new modules had to
be fully documented for acceptance?)
        * It's hard to tell for certain because of the above point, but
        the search APIs appear to have a major usability problem in that
        you can't search for both text and metadata at the same time
        using freeform text entered by the user.  (Think Google here,
        which searches both document content and metadata.)  This will
        be a problem when searching emails, for instance, because people
        will type "Joe Shaw eggplant" and expect it to match from the
        author field and the body of the message.
of course you can - have you tried it?

Ok, I apologize then.  I have not tried writing to the APIs, I got the
opposite impression from the different APIs and your recent blog entry:

        http://jamiemcc.livejournal.com/3782.html
In any case, if the searches hit only one source, that's a good thing.

Also rdf query allows for easy and effortless mixing of data searched from inverted word index and the sqlite database so there are no limits here

But this means you can't do a freeform search, right?  You have to say,
I want text "foo" from the index AND value "bar" from metadata key
"quux".

it depends but one of the reasons for reusing g-s-t was so we could reuse their extra search options facility (the expander part) for specifying extra metadata to search. The free text stuff is also doable with a parser but the latter is more user friendly and users are familiar with how g-s-t handles this.


Considering it uses the gnome-search-tool source which is very stable and I have simply removed some code and added a little here and there then its not as exaggerated as you make out. The application is pretty stable but the only way to confirm this is to try it out.

Ok, I was looking only at the CVS history.  If it's built upon g-s-t
that's a good start.

Tracker is miles ahead on the database side (which is non-existant in Beagle bar your backup for systems without EA's)- we have tags/keywords, extensible metadata, first class object storage etc.

The database is an implementation detail.  Beagle from the very
beginning has extracted metadata from documents and allowed people to
search them.

Not really. There's a ton of stuff you can do in a relational database thats not possible in a dedicated indexer like Lucene (Eg Extensible metadata, tag database, using it as a common metadata database etc)



As for "first class object storage", I don't really know what that
means.  What is an object in this case?  It's not like you throw a
GObject at Tracker and it stores it.  Short of that, it's just a schema
that you define and it's not really anything different than a document.
Beagle has this too (as long as there is a URI to associate with it).

No these will be predfined initially like a Note will have a set of metadata associated with it which can be stored in the DB instead of as a seperate file (aka persistent storage). Anyway this all requires a relational DB to implement.


Beagle is only ahead on what it indexes and heres the big point - tracker's goal is not to index everything under the sun but only the important stuff.

What is the important stuff?  The uncertainty of scope is one of the
murkiest things in my mind.  On the one hand you seem to want everything
in the desktop to use it to store information, etc. but on the other you
only want to focus on indexing a subset of the user's data.  I perceive
a conflict here.

No I have specified all the objects we are interested in on the web page. Currently Files is the main stuff but emails are on their way...


Well Im proposing it to replace gnome-search-tool with tracker-search-tool and nothing you said above has any bearing on this (ignoring the FUD).

No FUD intended.

It fills in vital holes in our platform (tagging, extensible metadata and persistent storage)

"Extensible metadata" is a lot larger realm than just Tracker or Beagle
or any one piece of software can address.  How is metadata propagate
between copies?  How does metadata propagate between users?  This is a
large problem, but orthogonal to this discussion.

VFS should solve that - probably with sidecar companion files


However, as Beagle is not being proposed, cannot get into the platform ( C is only allowed in the platform) and is plagued by a significant no of problems (all of which do not apply to tracker), I dont see how comparing beagle to tracker is relevant to this discussion?

In the end maybe it isn't.  But considering Beagle is pretty widely
deployed today and used in both GNOME and KDE environments, Tracker
would need to exceed Beagle in terms of developer and user experience.

why? If it performs as well if not better and is just as stable or more so than surely that does not matter. I dont think you can judge two architecturally different projects in that way.


(Also, I hardly think that Beagle is "plagued by a significant number of
problems", but whatever.)

FWIW though, if tracker and beagle can share a common dbus interface then letting tracker in would benefit everyone - thats the way I see it!

Yeah, we'll prototype something and see.

Great - hopefully that can remove much of the conflict here and get a search framework into Gnome 2.18. Strigi also use Dbus so it might be cool to liase with them also.

I take it you dont have a problem with tracker being used as a stand alone metadata DB in conjuction with beagle?
(I can add a compile option to remove the indexing parts if it will help?)


--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]