Re: [Tracker] [cc-devel] License Metadata Extraction and Search, Summer of Code

From: Jason Kivlighn <jkivlighn gmail com>
To: Luke Hoersten <luke hoersten gmail com>
Cc: CC Developer Mailing List <cc-devel lists ibiblio org>, tracker-list gnome org
Subject: Re: [Tracker] [cc-devel] License Metadata Extraction and Search, Summer of Code
Date: Wed, 21 Mar 2007 17:25:19 -0700

I think I've settled on Tracker.  I got an okay from them as well as
someone who volunteered to mentor me with Tracker code while working
under Creative Commons.

I like the idea of separating it into two parts.  Since there's so many
indexers out there, separating the parser means we have an
application/library that any indexer can use.  Looking at Tracker's
infrastructure it should work nicely.  Even using Tracker, cc-sharp may
come in handy, since Tracker can call external processes to extract the
search data.  Here's the list of formats I was hoping to support: MP3,
OGG, RSS, SVG, HTML, XML, JPEG,  PDF, SMIL.  The big problem I see with
cc-sharp is working with C#.  I'd consider myself fairly fluent in
C,C++,Java, and Python.

I notice that ccPublisher already attaches licenses, and ccLookup reads
licenses in anything with RDF metadata as well as in mp3s.  In response
to your second email, Luke, it might work to extend ccLookup to support
more formats and then have the Tracker extractor call this program. 
Then I'm sticking with a  high-level language I'm familiar with. 
However, I'm not sure if that will bode well for performance, though. 
The extraction process needs to be fast, so a C library might be a
better option.  Given the scope of formats, our extractor would be run
quite often for the typical desktop.

The Tracker code base from what I've seen looks very manageable, but I
hope to get more feedback from the Tracker folks soon.

Cheers,
Jason

Jason,
I did something similar to this last year for SoC and it resulted in a
new CC library called cc-sharp:
http://code.google.com/p/cc-sharp/

So your project could have two parts: the 1) license handling and then
2) integrating that data with the desktop search application. If you
wanted to use C# (Beagle), I'd help flesh out cc-sharp with you and
you could work on the integration.

The other C# CC lib around is CCLicenseLib which hasn't been developed
in four years.
http://workspaces.gotdotnet.com/cclib

It contains object representations of the older CC licenses. It would
be nice to make one condensed lib for CC stuff in C# so developers for
other projects could easily integrate with their software. I see it
being laid out as such:

- Attaching licenses to media
- Reading licenses from meda
- Verifying licenses

This desktop search idea would primarily use reading and verifying.
Right now all cc-sharp does is verify because I was originally working
on Banshee. Banshee already had read the metadata from the MP3 via my
patch so all my lib really was, was an abstraction of the
verification. Since verification is done over the Internet, that's not
really something you want to include by default in core application
code.

I'd like to abstract license reading so we can just "plug" support for
different file types to be read whether they are images, audio, etc.
Kind of like vfs.

What are your thoughts?

-Luke

On 3/21/07, Jason K <jkivlighn gmail com> wrote:

Hi,

I'm looking into adding support for searching/indexing licenses for a
service such as Tracker, Beagle, or Strigi for a Google SoC project.  My
first hurdle though, is picking which indexer.  The ideal service would
be cross-desktop, to avoid implementing extraction filters over and over
again for different indexers.  It also needs to be widely adopted.

Tracker is looking like a good candidate, given that it is a
Freedesktop.org project, is desktop-neutral, and appears to have the
intention of following standards as well as creating standards for other
search services to use.  I get the impression GNOME will be including
this soon.

Strigi is also desktop-neutral, though favored by KDE and is going to be
used by KDE 4.  It doesn't rely on KDE, though.  In fact, Strigi's only
requirements are are the stdc++ libraries, while Tracker is glib-based.

And for Beagle, Mono is one significant reason I'm shying away from it.
Tracker or Strigi appear more interoperable and look to be getting wider
adoption.

Formats I plan to include are:
  HTML, SVG, SMIL, XML in general (RDF)
  PDF, JPEG, other images (XMP)
  MP3, OGG, other audio/video
  RSS

From what I've seen, most license data is either in RDF or XMP form.

MP3, OGG, and RSS are exceptions.  For all these formats, I would follow
the embedding specification on the Creative Commons website, at
http://creativecommons.org/technology/usingmarkup

Since most licenses are placed in RDF or XMP, that code can be separated
and reused from various extraction modules.

So enough rambling... thoughts?

-Jason Kivlighn
_______________________________________________
cc-devel mailing list
cc-devel lists ibiblio org
http://lists.ibiblio.org/mailman/listinfo/cc-devel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]