Reviving Semantic Relationships in Beagle



Some of you may remember Max and his 2006 GSoC project to implement a
separate metadata store for Beagle. Hours of hard work later, it
became apparent that it wasn't so much the storage of metadata that
was important (lucene stores Properties, or 'Fields' just fine) but
relationships between data.

The Beagle++ project has attempted to utilize some of these
relationships by building a RDF store and querying that, while such a
system does have its benifits, and may even have been a choice to
consider when Beagle was first written, at this point, Beagle is
locked into its Lucene based backend, and we don't want to give up our
lightning fast searches. However, a RDF graph, or map/hierarchy of
indexed entities has some appeal when we look at situations like
Archives, Mail Attachments, Downloads etc. which all utilize the idea
of 'Parent' or 'Children' sources. Lucene is not particularly well
suited to representing such relationships, and while Beagle has built
a system to handling such cases, it is far from perfect, or universal.

What I am proposing is a universal (as in backend-independent) rdf
graph of uri's. While graphing and storing all of Beagle's metadata in
RDF graphs not only makes querying more difficult, but results in data
duplication, and reworking a system which (for all intents and
purposes) is fine in its current state.

The new RDF map would be useless without the API elements to access
it, so I propose the following means of 'hooking up' a RDF store to
Beagle.

-New Query_Part which allows a rdf type query (raw) against the store.
-Wire into LuceneQueryDriver and LuceneIndexDriver to store new
relationships in RDF store and query them upon creation of a Hit.
-Add a more accessible API to Filter for Adding Parents/Children to
indexables. (I'm thinking add addParent(Uri) addChild(Uri) methods,
but its a first thought, the issue is most of the time, these
relationships are only visible on a higher level, not as each item is
filtered for indexing, but noticing that a document in my home
directory is the same as a attachment in my inbox, and linking the
too, a difficult use case to work with. )

It is also important to note that the RDF store is _only storing uniqe
URI's in a relationship graph_ like the following sketch.(uri1 is an
e-mail, uri2 is a contact and uri3 and uri4 are oo.org documents)

uri1
    \
     |-uri3
     |-uri2
     |     \
     |      |-uri4

 Both the contact that sent the e-mail and the attachment are
children, and the contact has sent 1 other document to us, hence uri2
has a child of that document.

While this seems like we are replicating much of the data in the
Lucene Fields, this is actually something completely different, we are
referencing an exact entity, not just a name, or subject. As a result
of this tree, not only can we adjust our scoring to account for
related items, but we can provide right-click options like 'See all
files by this author' etc. in a more intelligent mannor.

I'm interested to see what people think, and what (if any) experience
people have had with similar work.
-- 
Cheers,
Kevin Kubasik
http://kubasik.net/blog



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]