Re: [Tracker] A use-case for SPARQL and Nepomuk



Op 26/06/2013 0:24, Philip Van Hoof schreef:

I can't resist writing down my ideas on this part. I must warn that it's very crazy. Heh ;-)

- The FS miner and tracker-extract join together, tracker-extract becomes a user of libtracker-extract
  - By default has a backend that does SPARQL UPDATE on a libtracker-sparql setup with the desktop-nepomuk package (not very different from now, just separate)

- libtracker-extract gets extended with public API that aids miner writers with metadata extraction from streams, files and buffer (offset in a mmap, for example)
Basically in libstreamanalyzer you make a Strigi::IndexWriter that you use as behaviour for Strigi::StreamAnalyzer (the setIndexWriter method on StreamAnalyzer takes what I call a strategy behaviour). The strategy can for example implement merging together Nepomuk statements to form a SPARQL UPDATE sentence. This is also what the tracker-topanalyzer.cpp does (when it actually worked).

With streams this can usually be done while streaming of the stream is taking place. Because of that I agree with Jos's design for stream based access, like for files and archive file types - in my opinion it's more capable than libtracker-extract and tracker-extract).

Because it's more difficult to add file format support for it, the Tracker project (during Nokia times) didn't pick libstreamanalyzer and instead we wrote our own extractor modules for tracker-extract. I must say that it was a close call: I discussed it a few times with libstreamanalyzer's maintainer, Jos, but we didn't have hard recuirements from Nokia to recurse into archive filetypes (notwithstanding we all agreed Strigi was *really* cool stuff). Not important for the customer in Scrum means no sprint task. No sprint task means that it never happens (unless the open source community picks it up, in which case it would have happened: this is also why I was allowed to experiment during one sprint-task with that tracker-topanalyzer.cpp).

The world didn't stop. So when I would get a carte blanche today, to design a new metadata extraction framework, I'd take the following into consideration (the world of metadata has changed massively):

METADATA:
  1. Is sometimes minable from online resources (social media, services, etc) (You do a SPARQL UPDATE from a web miner which gets it);
  2. Can arrive packaged together the dataobject with many synchronization solutions and use-cases (You do a SPARQL UPDATE from sync apps);
  3. Is indeed often minable from local resources (files) (which is why the FS miner exists, strigi's libstreamanalyzer does this too);
  4. Must sometimes be mined by recursively going into an archives (MIME documents, ZIP and tar archives) (main libstreamanalyzer's use-case);
  5. Must sometimes aggregate from multiple resources to turn meaningful (for relationships between domains);
    1. Metadata is everywhere, their dataobjects are everywhere. Informationelements ideally relate to other informationelements;
    2. For example a Contact's geographical location is available on a web resource, his address was filled in in the contacts application by the user, his phone number got mined from the phone, his photo was taken using a camera (and his face was recognized and fingerprinted by the camera);
    3. Ideally all of that metadata is found and then inserted as one transactional SPARQL INSERT;
      1. The utopian "I have a dream"
      2. This helps applications consuming contacts a lot: all info is instantly available at transaction commit, a single change signal, etc

I totally prefer Jos's design of libstreamanalyzer over libtracker-extract because it's clearly much more capable for #4 and #2. It'll be slower than simple open/read or mmap for #3 (Stream abstractions make things possible, but they don't make things faster) and more difficult because file format libraries usually don't offer a stream based API: Try libpng with a 'stream' pointing to a .png file in a ZIP file -> you have to write a stream based PNG metadata extractor from scratch whereas libtracker-extract can just use libpng. I salute Jos for trying to inspire people to "just do that".

However, I don't think it's necessarily a good fit for #1 and #5.

What I have in mind for a use-case like 5.2 is something like this:

We found out that there is a contact to mine
We find what the resources are that are available for this contact to get metadata from (very often we can introspect that from the contact's initial metadata)
We visit those resources and get the metadata

Example

Jos is a contact. Jos's metadata resources are:
   - Camera for photo relationships (is on the photo)
   - Website for geographical location (is in this city)
   - Or phone for geographical location (is at these coordinates, using phone's GPS)
   - Certain contact details in a .vcf file
   - Other details in a forward MIME part of an E-mail in INBOX of an known IMAP server where it is a MIME part that is a zip file that contains a .vcf file (Strigi!!)

So, using a metadata collector visitor we "Visit" contact. At this first visit we learned that the contact contains other visitable elements "Camera", "GEO Website", "GEO Phone", "VCF File", "E-Mail" which the visitor all also visits.

Basically instead of (or together with) libstreamanalyzer's strategy with streams and decorator: visitor. I also think all those projects should run in their own process. So that'll be visitor's Visit() and Accept() over FD passed IPC. I think they should go to their own process because I simply don't trust each and every visitor implementation (just like why tracker-extract runs in its own process).
 

- Miners are separate projects (GNOME has already started this here https://git.gnome.org/browse/gnome-online-miners) (the FS miner should also be just like this)

So these Miners all become visitables, implementing an IPC method Accept(Visitor : v), with Visitor being something that keeps state about an information element to extract metadata for, and Visitor being the one that'll collect Nepomuk statements and finally merge them together to form a SPARQL UPDATE. Which it calls on a libtracker-sparql setup that uses the nepomuk-desktop ontology package (which will be globally unique on a GNOME and/or KDE desktop).

There. That's my idea. Which I know is probably totally insane and crazy.

Kind regards,

Philip



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]