Re: Finding and Reminding, tech issues, 3.0 and beyond

From: Owen Taylor <otaylor redhat com>
To: jamie mccrack gmail com
Cc: gnome-shell-list gnome org, desktop-devel-list gnome org
Subject: Re: Finding and Reminding, tech issues, 3.0 and beyond
Date: Sat, 10 Apr 2010 17:10:50 -0400

On Sat, 2010-04-10 at 11:43 -0400, Jamie McCracken wrote:
> On Fri, 2010-04-09 at 18:09 -0400, Owen Taylor wrote:
> 
> > Tracker
> > =======
> > 
> > In some testing, Tracker 0.8 seems enormously better behaved
> > than Tracker 0.6. It has very significant optimizations in how
> > it stores the tracker database on disk, and also, by default,
> > only indexes defined subdirs of $HOME. So, as of right now,
> > system-impact of Tracker isn't a big concern of mine, as it
> > would be for 0.6.
> > 
> > Possible concerns and considerations with Tracker:
> > 
> >  * RDF + SPARQL + a large collection of ontologies does present
> >    a significant new barrier to someone coming to the GNOME
> >    platform. While the basic concepts of RDF are quite simple,
> >    RDF serialization formats and SPARQL are new learning people
> >    will have to do, and there are some intimidating terms
> >    like "ontology"
> 
> ontology = schema, so just semantics really :)
> 
> >    RDF is also popularly (and perhaps unfairly) seen as
> >    yesterday's fad.
> 
> RDF is indeed not as nice or succinct as it could be but what would the
> alternative be?
> 
> Also bear in mind that nepomuk ontology, which tracker uses, is shared
> with Kde and hence it should result in some nice freedesktop coperation
> and allow tracker to meet the needs of KDe apps as well as Gnome.
> 
> It would also be unwise to store shareable metadata without some
> onto/schema which all apps can agree on

Well, certainly tracking and indexing file metadata doesn't *require*
anything as complex, or general purpose as RDF. I have some concerns
about the complexity, but as long as we don't get to the point where
understanding RDF and ontologies is a prerequisite for developing a
GNOME app, we're probably fine.

> >    
> >  * There is a large abstraction barrier between the application
> >    and the underlying data storage. It's very hard to decipher
> >    or influence how storing data in RDF and running SPARQL queries
> >    maps into low-level database operations.
> 
> Not sure why that is relevant?
> 
> FYI, resources are stored as individual tables and all properties of a
> resource are fields in those tables so the end result would not be much
> different from a traditional sql database. 
> 
> Indeed its nothing like an off the shelf triple store which stores
> individual properties as rows in a gigantic table which has poor
> scalability (although great extensibility). Ergo tracker should be seen
> as an optimised SQL database but which uses RDF/Sparql as its table
> schema and query language  rather than SQL

The reason I consider storage relevant is that throwing data into "an
optimized SQL database" where you don't have any ability to control what
is indexed or understanding how query plans are executed is usually a
recipe for application performance disaster. There are many people who
make an excellent living going in and fixing these sorts of application
performance disasters.

Now, to the extent that we're building GNOME and Tracker together as a
system and we know what queries we need make fast - what standard
properties need to be indexed - we're OK. For our "Finding and
Reminding" plans I don't see a problem.

But if we go beyond that and start encouraging people to start putting
all sorts of application data into Tracker and relying on Tracker to do
efficient queries on it, then it definitely is a concern. Based on how
Tracker is mapping RDF into SQL tables, some SPARQL queries are going to
be fast, some are going to be dead slow and people need to be able to
come to an understanding of which are which.

> > Zeitgeist
> > =========
> > 
> > The "properties of files" approach of Tracker works for a lot
> > of things. However, it is pretty much unsuitable for storing
> > time-based histories of actions. We can store the last time
> > a file was edited as a Tracker property. It's slightly harder
> > to store all the times the file was edited. It's considerably
> > harder to store all the times the file was edited including
> > the editing application for each access.
> > 
> > (Of course, anything can be stored in RDF; it's a perfectly
> > general format; however, the more that we have to create
> > anonymous nodes, the more different structures that we are
> > storing in the tracker triple store, the harder it is going
> > to be to optimize, and the less suitable a straightforward
> > implemention of the triple-store backed by a sqlite database
> > is.)
> > 
> > My understanding is that the Tracker people have disclaimed
> > the log storage problem. 
> 
> Not really. Storing timeline info is not a big deal for tracker. just
> like a file can have many tags in tracker, it could also have many
> histories or audit trails. It could also simply be just a multi-value
> date property if all you stored was the datetime stamp
> 
> We would want a timeline ontology to be part of nepomuk if possible so
> discussions with them would be needed first. Failing that, a tracker
> specific timeline property could easily be added to all objects
>
> tracker is definitely the right place to add timeline info if you intend
> to do queries like "get me all music files I played last week" or "get
> me all documents I viewed recently with author blah". I have heard of
> one project which uses tracker to get data but then uses zeitgeist to
> filter it for timeline info which is clearly not a good solution and
> wont scale if the tracker results were huge
>
> General Event logging could also be added to tracker but its usefulness
> is not as great as say timelining and we dont really want to stamp on
> the zeitgeist teams feet at this point. In the future, zeitgeist may
> well decide to use tracker as an event logging framework but it is their
> decision at end of the day

What do you see as the distinction between between "general event
logging" and "timeline info"?

This kind of "we can do some sorts of event logging, but other event
logging is Zeitgeist" distinction is, generally speaking, Not Helpful. 
Everybody needs to have a solid idea of what the components are and how
they fit together.

I certainly do have a concern that if some data is stored in an event
log that Zeitgeist maintains and some data is stored in the tracker
database that queries will be inefficient. If displaying a list of the
last 100 files tagged as "download" by time requires getting a list of
all files tagged as "download", querying Zeitgeist for all events on all
those files, then sorting by time in the application, that potentially
will suck.

If in the future Zeitgeist is using Tracker as a backend and is
primarily about *writing* information into Tracker, and applications can
query that information directly through the Tracker API, then that
problem does largely go away. If Tracker is only used behind the scenes
for storage in an opaque way, that wouldn't help because the app would
still have to fetch two separate sorts of information and integrate
them. Even if they were under the hood coming from the same place.

- Owen

Follow-Ups:
- Re: Finding and Reminding, tech issues, 3.0 and beyond
  - From: Jamie McCracken
- Re: Finding and Reminding, tech issues, 3.0 and beyond
  - From: Zeeshan Ali (Khattak)
- Re: Finding and Reminding, tech issues, 3.0 and beyond
  - From: Martyn Russell

References:
- Finding and Reminding, tech issues, 3.0 and beyond
  - From: Owen Taylor
- Re: Finding and Reminding, tech issues, 3.0 and beyond
  - From: Jamie McCracken

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]