Re: [Tracker] Extracting the extractors



Hey Sam :),

On Sat, Apr 9, 2016 at 1:39 AM, Sam Thursfield <ssssam gmail com> wrote:
Hi all

I've always felt like Tracker's extractors should be reusable outside
Tracker. The design makes that possible but right now they output their
results as a series of slightly non-standard SPARQL update commands,
which I don't think is useful for many folk. Lots of people aren't using
SPARQL databases at all, believe it or not :-)

Just to provide some context, extractors used to generate pieces of
sparql that was passed to tracker-miner-fs, which composed the entire
sparql update(s) out of tracker-extract's and its own extracted
information. Nowadays, tracker-extract is a miner itself, and those
pieces of sparql are executed more or less as-is there.

Today, we can indeed do without this split, tracker extract modules
don't need to logically split the updates performed, those may be
accumulated/executed at once.


The whole point of RDF is to make data interchange easy so I think we
can do better than that. I've been looking at making the extractors
optionally output their results in JSON-LD[1] format instead. The cool
thing about JSON-LD is that if you squint, it's just good old JSON that
everyone's familiar with. If you look closely it's also Linked Data,
but in a more human-friendly serialization format than any of the more
traditional RDF formats.

Cool :), will read up on the JSON-LD format, I see thought that it's
yet another json format compared to application/sparql-results+json as
described in https://www.w3.org/TR/2013/REC-sparql11-results-json-20130321/
. I guess it just caters for a different usecase, but these w3 guys
could sit and breathe before churning out a new format :P


The catch here is that Tracker's extractor modules are all hardwired to
generate SPARQL using TrackerSparqlBuilder. To be honest I've never
liked this approach, it's pretty incomprehensible to newcomers and
overly verbose, especially where we explicitly generate DELETE queries
to go along with the INSERT queries.

Wholeheartedly agree. It only caters for a very specific subset of
insertion scenarios, not even sufficient for some real cases in
tracker extract modules, and barely useful at all outside
tracker(-extract). eg. things like
https://git.gnome.org/browse/tracker/tree/src/tracker-extract/tracker-extract-gstreamer.c#n1066


so, inspired by something in the Python RDFLib library, I came up with a
TrackerResource class that the extractors can use instead. This is a
work in process, but I have a branch in git.gnome.org that adds
TrackerResource, and converts some of the extractors to use it. The
TrackerResource class can serialize either to SPARQL update commands or
to JSON-LD. The branch also adds the `tracker extract` command from
<https://bugzilla.gnome.org/show_bug.cgi?id=751991> so you can try out
the extractors easily and specify `-o json` or `-o sparql` as you prefer.

Nice! Should it have a turtle serializer too? Do you think this can be
possibly used in the tracker store side to serialize contents?

I'm also wondering these days about exposing better backups/restores.
Tracker has been traditionally used to store data that could be easily
re-extracted, if a database is reset, worse that would happen usually
is that a few nao:Tags are lost. But stored data might be sensibly
harder, or impossible to restore [1].

I know we have org.freedesktop.Tracker1.Backup and the journal, but
some fields (most notably nie:plainTextContent) are taken away when
saving the journal files, so when restored it gives you a randomly
trimmed down database we can't easily recover further from. IIRC it
was done for backup size concerns, but kinda defeats its purpose :(.

So I was wondering whether it should be possible to serialize certain
contents (per class? per graph?) into some format we could restore
from, I first thought good old turtle, but could this be a nice
alternative? It could also be just enough if we make the backup dbus
call include the usually ignored properties, after all it's just doing
what the user asked for...

Sorry to drift off a bit :), I thought it's mildly related.


[1] See eg. https://git.gnome.org/browse/tracker-miner-chatlog/


The results for extractors I've converted so far is promising in terms
of reducing
code size:

     src/tracker-extract/tracker-extract-abw.c       |  51 ++--
     src/tracker-extract/tracker-extract-bmp.c       |  18 +-
     src/tracker-extract/tracker-extract-dvi.c       |  17 +-
     src/tracker-extract/tracker-extract-epub.c      | 131 +++-----
     src/tracker-extract/tracker-extract-gstreamer.c | 910
++++++++++++++++++-------------------------------------
     src/tracker-extract/tracker-extract-mp3.c       | 378
++++++++---------------
     6 files changed, 511 insertions(+), 994 deletions(-)

Looks like a nice reduction :)


Here's an example of auto-generated SPARQL for an MP3 extraction:

<snip>


Note there are a lot more DELETE statements than before. I figured that
anywhere we want to replace the existing data we need a DELETE
statement, and the reason we don't normally do it is because previously
it had to be done manually. That said, the TrackerResource class does
have a way of avoiding this. If you ever call _set_value() for a property then
it assumes you want to *overwrite* it, and will generate a DELETE. If you
only use _add_value() then it will assume you want to *add* to it, and won't
generate a DELETE. The latter case is needed for stuff like nao:hasTag.
I may be misunderstanding things here of course, I didn't actually write any
of the extractors myself.

Sounds good :), It seems to me that the generated sparql already
ensures some correctness, which is great. The difference between set
and add makes sense, given that we have to deal with single and
multivalued properties. The only potentially harmful combination would
be doing add_value() on a single valued property, is there any way
that could raise a warning in tracker-extract, rather than being
caught late due to the failed insert?


Here's a example of JSON-LD output:

{
  "nie:comment" : "Free download from
http://www.last.fm/music/Best+Coast and http://MP3.com";,
  "nmm:trackNumber" : 1,
  "nmm:performer" : {
    "@id" : "urn:artist:Best%20Coast",
    "nmm:artistName" : "Best Coast",
    "@type" : "nmm:Artist"
  },
  "nfo:averageBitrate" : 128000,
  "nmm:musicAlbum" : {
    "@id" : "urn:album:The%20Only%20Place",
    "nmm:albumTitle" : "The Only Place",
    "@type" : "nmm:MusicAlbum",
    "nmm:albumArtist" : {
      "@id" : "urn:artist:Best%20Coast",
      "nmm:artistName" : "Best Coast",
      "@type" : "nmm:Artist"
    }
  },
  "nfo:channels" : 2,
  "nmm:dlnaProfile" : "MP3",
  "nmm:musicAlbumDisc" : {
    "@id" : "urn:album-disc:%C0:L%01:Disc1",
    "nmm:setNumber" : 1,
    "nmm:albumDiscAlbum" : {
      "@id" : "urn:album:The%20Only%20Place",
      "nmm:albumTitle" : "The Only Place",
      "@type" : "nmm:MusicAlbum",
      "nmm:albumArtist" : {
        "@id" : "urn:artist:Best%20Coast",
        "nmm:artistName" : "Best Coast",
        "@type" : "nmm:Artist"
      }
    },
    "@type" : "nmm:MusicAlbumDisc"
  },
  "nfo:duration" : 164,
  "nfo:codec" : "MPEG",
  "nmm:dlnaMime" : "audio/mpeg",
  "nfo:sampleRate" : 44100,
  "nie:title" : "The Only Place"
}

We can actually do much better than this, right now there's no
@context so it kind of misses the point of JSON-LD. I need to
finish writing a NamespaceManager class that can track all of the
prefixes and generate a suitable JSON-LD context, so that instead
of stuff like "nie:title", it can just say "title" and then the @context
will link that to
<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#title>

The code is in branch wip/sam/resource:
<https://git.gnome.org/browse/tracker/log/?h=wip/sam/resource>.

It's still of course a work in progress but I think it's pretty much taken
shape, so please have a look and give feedback on whether you think
this is a sane approach!

The improvements to extract modules sound really promising :), I'm not
too thrilled by the yet another format around, but I'm sure there are
other places where serialization could be useful.

Cheers,
  Carlos


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]