Re: [Tracker] Extracting the extractors



On 09/04/16 00:39, Sam Thursfield wrote:
Hi all

Hi Sam, :)

I've always felt like Tracker's extractors should be reusable outside
Tracker. The design makes that possible but right now they output their
results as a series of slightly non-standard SPARQL update commands,
which I don't think is useful for many folk. Lots of people aren't using
SPARQL databases at all, believe it or not :-)

Yea.

I also thought that while it was easy enough for Tracker developers to read the SPARQL output from tracker-extract, it would be easier for others (and us) to read a simple key/value output (or other formats).

For debugging purposes, it was always a bit awkward using tracker-extract (not in /usr/bin, daemon based, etc) and I would have liked it to be more command line ready.

The whole point of RDF is to make data interchange easy so I think we
can do better than that. I've been looking at making the extractors
optionally output their results in JSON-LD[1] format instead. The cool
thing about JSON-LD is that if you squint, it's just good old JSON that
everyone's familiar with. If you look closely it's also Linked Data,
but in a more human-friendly serialization format than any of the more
traditional RDF formats.

I've been using JSON recently a lot with Android and PHP/DB based development and quite like it. Philip is right, it's used a lot these days (likely because Google and other big players push it).

JSON is great and there are quite some conversion tools to/from other formats too. So I like what you're suggesting here.

The catch here is that Tracker's extractor modules are all hardwired to
generate SPARQL using TrackerSparqlBuilder. To be honest I've never
liked this approach, it's pretty incomprehensible to newcomers and
overly verbose, especially where we explicitly generate DELETE queries
to go along with the INSERT queries.

I kind of agree with you.

I did the last conversion from what we had to SPARQL on most of the extractors (with Carlos IIRC).

Sadly, the SPARQL / RDF approach we use currently is used for its relational qualities. I suppose JSON could also fit as a replacement here but it sounds like you're not suggesting to change the extractors like that below:

My thought here is that if we were (hypothetically) to move to something that makes the extractors more useful, the format we use should work in a relational way and give us maximum format conversion opportunities.

so, inspired by something in the Python RDFLib library, I came up with a
TrackerResource class that the extractors can use instead. This is a
work in process, but I have a branch in git.gnome.org that adds
TrackerResource, and converts some of the extractors to use it. The
TrackerResource class can serialize either to SPARQL update commands or
to JSON-LD. The branch also adds the `tracker extract` command from
<https://bugzilla.gnome.org/show_bug.cgi?id=751991> so you can try out
the extractors easily and specify `-o json` or `-o sparql` as you prefer.

Pretty cool :)

The results for extractors I've converted so far is promising in terms
of reducing
code size:

      src/tracker-extract/tracker-extract-abw.c       |  51 ++--
      src/tracker-extract/tracker-extract-bmp.c       |  18 +-
      src/tracker-extract/tracker-extract-dvi.c       |  17 +-
      src/tracker-extract/tracker-extract-epub.c      | 131 +++-----
      src/tracker-extract/tracker-extract-gstreamer.c | 910
++++++++++++++++++-------------------------------------
      src/tracker-extract/tracker-extract-mp3.c       | 378
++++++++---------------
      6 files changed, 511 insertions(+), 994 deletions(-)

That's good, less code to maintain :)

Here's an example of auto-generated SPARQL for an MP3 extraction:

     DELETE {
     }
     WHERE {
     <file:///home/sam/Downloads/Best%20Coast%20-%20The%20Only%20Place.mp3>
nie:comment ?nie_comment ;
          nmm:trackNumber ?nmm_trackNumber ;
          nmm:performer ?nmm_performer ;
          nfo:averageBitrate ?nfo_averageBitrate ;
          nmm:musicAlbum ?nmm_musicAlbum ;
          nfo:channels ?nfo_channels ;
          nmm:dlnaProfile ?nmm_dlnaProfile ;
          nmm:musicAlbumDisc ?nmm_musicAlbumDisc ;
          rdf:type ?rdf_type ;
          nfo:duration ?nfo_duration ;
          nfo:codec ?nfo_codec ;
          nmm:dlnaMime ?nmm_dlnaMime ;
          nfo:sampleRate ?nfo_sampleRate ;
          nie:title ?nie_title .
     }
     DELETE {
     }
     WHERE {
     <urn:artist:Best%20Coast> nmm:artistName ?nmm_artistName ;
          rdf:type ?rdf_type .
     }
     INSERT {
     <urn:artist:Best%20Coast> a nmm:Artist ;
          nmm:artistName "Best Coast" .
     }
     DELETE {
     }
     WHERE {
     <urn:album:The%20Only%20Place> nmm:albumTitle ?nmm_albumTitle ;
          rdf:type ?rdf_type ;
          nmm:albumArtist ?nmm_albumArtist .
     }
     INSERT {
     <urn:album:The%20Only%20Place> a nmm:MusicAlbum ;
          nmm:albumTitle "The Only Place" ;
          nmm:albumArtist <urn:artist:Best%20Coast> .
     }
     DELETE {
     }
     WHERE {
     <urn:album-disc:%D0:%06%02:Disc1> nmm:setNumber ?nmm_setNumber ;
          nmm:albumDiscAlbum ?nmm_albumDiscAlbum ;
          rdf:type ?rdf_type .
     }
     INSERT {
     <urn:album-disc:%D0:%06%02:Disc1> a nmm:MusicAlbumDisc ;
          nmm:setNumber 1 ;
          nmm:albumDiscAlbum <urn:album:The%20Only%20Place> .
     }
     INSERT {
     <file:///home/sam/Downloads/Best%20Coast%20-%20The%20Only%20Place.mp3>
a nmm:MusicPiece , nfo:Audio ;
          nie:comment "Free download from
http://www.last.fm/music/Best+Coast and http://MP3.com"; ;
          nmm:trackNumber 1 ;
          nmm:performer <urn:artist:Best%20Coast> ;
          nfo:averageBitrate 128000 ;
          nmm:musicAlbum <urn:album:The%20Only%20Place> ;
          nfo:channels 2 ;
          nmm:dlnaProfile "MP3" ;
          nmm:musicAlbumDisc <urn:album-disc:%D0:%06%02:Disc1> ;
          nfo:duration 164 ;
          nfo:codec "MPEG" ;
          nmm:dlnaMime "audio/mpeg" ;
          nfo:sampleRate 44100 ;
          nie:title "The Only Place" .
     }

My only concern with auto-generated SPARQL, is that the Tracker DB/engine is quite sensitive to missing properties that are required. We have many work-arounds for this sort of thing.

Note there are a lot more DELETE statements than before. I figured that
anywhere we want to replace the existing data we need a DELETE
statement, and the reason we don't normally do it is because previously
it had to be done manually. That said, the TrackerResource class does
have a way of avoiding this. If you ever call _set_value() for a property then
it assumes you want to *overwrite* it, and will generate a DELETE. If you
only use _add_value() then it will assume you want to *add* to it, and won't
generate a DELETE. The latter case is needed for stuff like nao:hasTag.
I may be misunderstanding things here of course, I didn't actually write any
of the extractors myself.

Here's a example of JSON-LD output:

{
   "nie:comment" : "Free download from
http://www.last.fm/music/Best+Coast and http://MP3.com";,
   "nmm:trackNumber" : 1,
   "nmm:performer" : {
     "@id" : "urn:artist:Best%20Coast",
     "nmm:artistName" : "Best Coast",
     "@type" : "nmm:Artist"
   },
   "nfo:averageBitrate" : 128000,
   "nmm:musicAlbum" : {
     "@id" : "urn:album:The%20Only%20Place",
     "nmm:albumTitle" : "The Only Place",
     "@type" : "nmm:MusicAlbum",
     "nmm:albumArtist" : {
       "@id" : "urn:artist:Best%20Coast",
       "nmm:artistName" : "Best Coast",
       "@type" : "nmm:Artist"
     }
   },
   "nfo:channels" : 2,
   "nmm:dlnaProfile" : "MP3",
   "nmm:musicAlbumDisc" : {
     "@id" : "urn:album-disc:%C0:L%01:Disc1",
     "nmm:setNumber" : 1,
     "nmm:albumDiscAlbum" : {
       "@id" : "urn:album:The%20Only%20Place",
       "nmm:albumTitle" : "The Only Place",
       "@type" : "nmm:MusicAlbum",
       "nmm:albumArtist" : {
         "@id" : "urn:artist:Best%20Coast",
         "nmm:artistName" : "Best Coast",
         "@type" : "nmm:Artist"
       }
     },
     "@type" : "nmm:MusicAlbumDisc"
   },
   "nfo:duration" : 164,
   "nfo:codec" : "MPEG",
   "nmm:dlnaMime" : "audio/mpeg",
   "nfo:sampleRate" : 44100,
   "nie:title" : "The Only Place"
}

Yea, it's nice and much clearer than SPARQL in my opinion. This could make quite some difference for people with the learning curve into Tracker too. SPARQL has always been problematic with that in mind for the project.

We can actually do much better than this, right now there's no
@context so it kind of misses the point of JSON-LD. I need to
finish writing a NamespaceManager class that can track all of the
prefixes and generate a suitable JSON-LD context, so that instead
of stuff like "nie:title", it can just say "title" and then the @context
will link that to
<http://www.semanticdesktop.org/ontologies/2007/01/19/nie#title>

The code is in branch wip/sam/resource:
<https://git.gnome.org/browse/tracker/log/?h=wip/sam/resource>.

It's still of course a work in progress but I think it's pretty much taken
shape, so please have a look and give feedback on whether you think
this is a sane approach!

Seriously cool stuff.

Added to that that I like my command lines (as you can see from the work I started with the 'tracker' command) and like to have things easily accessible there, I think this would go a lone way towards making data available for other apps.

I like it a lot Sam, thanks again.

--
Regards,
Martyn


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]