Re: Metadata Store



Hey Enrico,

On Wed, 2006-08-09 at 12:21 +0200, Enrico Minack wrote:
> So, how is going? Any news concerning implementation, design or used 
> libraries?

Max has been working on it as part of his Summer of Code project.  He's
been putting a lot of his progress up on the wiki:

        http://beagle-project.org/Metadata_Browser
        
And also doing weekly status reports on the beagle-soc-2006 mailing
list:

        http://groups.google.com/group/beagle-soc-2006

> I put some effort in getting Sesame and Sesame2 [1] working under C# and it 
> worked quite well. 

Using IKVM or a source translation?

> I was able to write into an Sesame native store (which is 
> quite fast! see [2]). I created 100.000 simple triples and stored them into 
> the local repository with
> - 1,47k triples/second on my laptop @ 800 Mhz
> - 1,93k triples/second on my laptop @ 2000 Mhz
> - 7,14k triples/second on our server @ 2000 Mhz (Athlon 64bit 3800+)
> Surprisingly, the memory consumption of the C# test program was lower than 
> the Java version. But the Java program was faster :-(. Maybe IKVM can make 
> some improvements on that.

The more interesting memory benchmark for us is Sesame vs. no Sesame.
What are the differences there?  Maybe Max can provide some rough
numbers for SemWeb vs. no SemWeb as well.

> Now lets come to some technical resp. implementational questions:
> How do you plan to integrate the rdf store into Beagle's architecture?
> - Hard-coded like the Lucene indexes or dynamically linked like the Filters 
> and the Queryables?

Yes.  The Lucene index and the metadata store will be paired and (at
least initially) be one each per backend.

> I could imagine an implementation where possible RDF stores share a common 
> API (as all Filters do), and they are compiled against Beagle and stored in 
> a specific folder where Beagle recognizes its presence. Via configuration 
> the preferred RDF store can be selected. Therefore one could easily replace 
> the RDF store with any kind of implementation: file-based, rdbms-based, 
> remote server, different libraries as semweb, Jena, sesame, yars, kowari, 

Pluggable filters and backends make sense because people can drop in or
remove the ones they want to use.  There is a concrete end user benefit
there.  I don't see that with a pluggable RDF store.  You create
potential on-disk file format incompatibilities and put in (IMO)
unnecessary work creating an abstraction layer for other Beagle
developers.

> How about the Ontology used within the store?
> - Do the Filters have to comply to one?
> - Does every filter have its own way to describe metadata?

Exactly how we define the ontology hasn't been defined, but this is
largely an implementation detail.  The most important thing, I think, is
consistency between the backends and filters, so I think they should try
to comply to one as best as possible.

> How shall the metadata be queried?
> - Full-text search on the attributes using the query keywords?
> - special queries like "metadata:..."?

All text searches will be done through lucene.  All metadata searches
will be done through the store.

> - what about paths of metadata like "document of author X received as 
> attachment via email from Y" which matches
>      document hasAuthor X
>      document isAttachmentOf EMail
>      EMail from Y

You can already programmatically create queries with the Beagle API, so
I think this would just be a matter of representing the query correctly.
Using an alternative syntax or perhaps providing an alternative API to
easily walk the graph would be in order here.

> How are results ranked if they are found in the rdf store but not in the 
> lucene index?
> - how can these scores merged with lucene scores?

Beagle doesn't really Lucene scores today, as seeing as we search 2n
Lucene indexes (where n is the number of backends), we don't normalize
the scores.  On the user interface side of things, we found through user
testing that sorting by date is generally much more useful.

Presumably, though, any metadata match would be exact, so the score
would be 1.0.  (We could probably add an importance factor to alter the
score, but that would be pretty arbitrary.)

Joe




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]