Re: [Tracker] Extracting the extractors

From: Philip Van Hoof <philip codeminded be>
To: Sam Thursfield <ssssam gmail com>
Cc: Tracker mailing list <tracker-list gnome org>
Subject: Re: [Tracker] Extracting the extractors
Date: Mon, 11 Apr 2016 15:06:31 +0200

Hi Sam,

So, what happens when I read a blog like this, and I find this:

        Decision 3: Kick RDF in the Nuts

        RDF is a shitty data model. It doesn’t have native support for
        lists. LISTS for fuck’s sake! The key data structure that’s used
        by almost every programmer on this planet and RDF starts out by
        giving developers a big fat middle finger in that area. Blank
        nodes are an abomination that we need, but they are applied
        inconsistently in the RDF data model


Is the following. My mind thinks: this guy is ranting. I used to be like
this guy. I have better things to do. And: Oh my god, not another guy
who wants to create a standard.

As said, it's fine to add an output format for tracker-extract, but
between the processes tracker-extract, tracker-miner-fs and
tracker-store there's absolutely no need, whatsoever, to have JSON.

TTL is the format that we focus on, and that we parse without effort
given that it's part of SPARQL. JSON is not.

I think that something that converts from our output format to JSON-LD
is probably the task of a AngularJS or cgi-bin frontend for some web
server. That this web-server contacts tracker-extract's IPC instead of
tracker-store is something we right now don't support. But that doesn't
mean it wouldn't be a good idea to create such a nice tracker-extract
API (just note that you will have to become the maintainer of that API).

I think your frontend thingy could convert it to this format:
https://www.w3.org/TR/sparql11-results-json/ , or in JSON-LD, however,
tracker-extract could return it in TTL and/or using the FD passing
technique which is also in use between libtracker-sparql and
tracker-store.

And then tomorrow we'll all read another ranter's blog and instead of
JSON-LD we will use that instead in the frontend thingy. Fine.


Kind regards,

Philip



On Sun, 2016-04-10 at 22:15 +0100, Sam Thursfield wrote:

Thanks for the quick feedback!

You're right that I should have implemented Turtle output. I've done
that now, this is the result (as you'd expect):

<urn:artist:Best%20Coast> nmm:artistName "Best Coast" ;
  rdf:type nmm:Artist .

<urn:album:The%20Only%20Place> nmm:albumTitle "The Only Place" ;
  rdf:type nmm:MusicAlbum ;
  nmm:albumArtist <urn:artist:Best%20Coast> .

<urn:album-disc:The%20Only%20Place:Disc1> nmm:setNumber 1 ;
  nmm:albumDiscAlbum <urn:album:The%20Only%20Place> ;
  rdf:type nmm:MusicAlbumDisc .

<file:///home/sam/Downloads/Best%20Coast%20-%20The%20Only%20Place.mp3>
nie:comment "Free download from http://www.last.fm/music/Best+Coast
and http://MP3.com"; ;
  nmm:trackNumber 1 ;
  nmm:performer <urn:artist:Best%20Coast> ;
  nfo:averageBitrate 128000 ;
  nmm:musicAlbum <urn:album:The%20Only%20Place> ;
  nfo:channels 2 ;
  nmm:dlnaProfile "MP3" ;
  nmm:musicAlbumDisc <urn:album-disc:The%20Only%20Place:Disc1> ;
  rdf:type nmm:MusicPiece , nfo:Audio ;
  nfo:duration 164 ;
  nfo:codec "MPEG" ;
  nmm:dlnaMime "audio/mpeg" ;
  nfo:sampleRate 44100 ;
  nie:title "The Only Place" .


I'm still kinda interested in JSON-LD, because JSON (though not
JSON-LD) has such a massive user base already. Phillip, JSON-LD *is* a
W3C standard: <https://www.w3.org/TR/json-ld/>. The great thing about
standards is there are so many!

That said all the W3C's previous attempts at RDF-in-JSON are quite
bad, I think JSON-LD is definitely an improvement. There's a great
blog post from the main guy behind the standard called "JSON-LD and
Why I Hate the Semantic Web" which I recommend reading :-)
<http://manu.sporny.org/2014/json-ld-origins-2/>

Anyway, for my purposes, Turtle output from the extractors is fine
(and a big improvement on SPARQL). I'll keep the JSON-LD stuff around
in a separate commit.


On Sat, Apr 9, 2016 at 12:49 PM, Carlos Garnacho <carlosg gnome org> wrote:

Hey Sam :),

so, inspired by something in the Python RDFLib library, I came up with a
TrackerResource class that the extractors can use instead. This is a
work in process, but I have a branch in git.gnome.org that adds
TrackerResource, and converts some of the extractors to use it. The
TrackerResource class can serialize either to SPARQL update commands or
to JSON-LD. The branch also adds the `tracker extract` command from
<https://bugzilla.gnome.org/show_bug.cgi?id=751991> so you can try out
the extractors easily and specify `-o json` or `-o sparql` as you prefer.


Nice! Should it have a turtle serializer too? Do you think this can be
possibly used in the tracker store side to serialize contents?


I hadn't thought of that, but it's definitely possible. You could have
a `tracker serialize-the-whole-database` command :-)

In terms of backups, part of me things we should use an efficient
binary format.. but then it's hard to trust a backup that is an opaque
binary format. If we could serialize to Turtle or JSON-LD then you
could tell just by looking whether it was valid or not. We can just
gzip it to make it small.

...


Here's an example of auto-generated SPARQL for an MP3 extraction:


<snip>



Note there are a lot more DELETE statements than before. I figured that
anywhere we want to replace the existing data we need a DELETE
statement, and the reason we don't normally do it is because previously
it had to be done manually. That said, the TrackerResource class does
have a way of avoiding this. If you ever call _set_value() for a property then
it assumes you want to *overwrite* it, and will generate a DELETE. If you
only use _add_value() then it will assume you want to *add* to it, and won't
generate a DELETE. The latter case is needed for stuff like nao:hasTag.
I may be misunderstanding things here of course, I didn't actually write any
of the extractors myself.


Sounds good :), It seems to me that the generated sparql already
ensures some correctness, which is great. The difference between set
and add makes sense, given that we have to deal with single and
multivalued properties. The only potentially harmful combination would
be doing add_value() on a single valued property, is there any way
that could raise a warning in tracker-extract, rather than being
caught late due to the failed insert?


I don't think that's possible because libtracker-sparql doesn't have
any knowledge of the ontologies. We could move a bunch of code from
libtracker-data to libtracker-sparql to make it happen, but I actually
think it's a good design to have libtracker-sparql separate from
Tracker's own database and Tracker's own ontologies.

Sam
_______________________________________________
tracker-list mailing list
tracker-list gnome org
https://mail.gnome.org/mailman/listinfo/tracker-list

Attachment: signature.asc
Description: This is a digitally signed message part

Follow-Ups:
- Re: [Tracker] Extracting the extractors
  - From: Sam Thursfield

References:
- [Tracker] Extracting the extractors
  - From: Sam Thursfield
- Re: [Tracker] Extracting the extractors
  - From: Carlos Garnacho
- Re: [Tracker] Extracting the extractors
  - From: Sam Thursfield

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]