Op 26/06/2013 0:24, Philip Van Hoof
schreef:
I can't resist writing down my ideas on this part. I must warn that it's very crazy. Heh ;-) - The FS miner and tracker-extract join together, tracker-extract becomes a user of libtracker-extractBasically in libstreamanalyzer you make a Strigi::IndexWriter that you use as
behaviour for Strigi::StreamAnalyzer
(the setIndexWriter method on StreamAnalyzer takes what I call a
strategy behaviour). The strategy can for example implement merging
together Nepomuk statements to form a SPARQL UPDATE sentence. This
is also what the tracker-topanalyzer.cpp does (when it actually
worked).With streams this can usually be done while streaming of the stream is taking place. Because of that I agree with Jos's design for stream based access, like for files and archive file types - in my opinion it's more capable than libtracker-extract and tracker-extract). Because it's more difficult to add file format support for it, the Tracker project (during Nokia times) didn't pick libstreamanalyzer and instead we wrote our own extractor modules for tracker-extract. I must say that it was a close call: I discussed it a few times with libstreamanalyzer's maintainer, Jos, but we didn't have hard recuirements from Nokia to recurse into archive filetypes (notwithstanding we all agreed Strigi was *really* cool stuff). Not important for the customer in Scrum means no sprint task. No sprint task means that it never happens (unless the open source community picks it up, in which case it would have happened: this is also why I was allowed to experiment during one sprint-task with that tracker-topanalyzer.cpp). The world didn't stop. So when I would get a carte blanche today, to design a new metadata extraction framework, I'd take the following into consideration (the world of metadata has changed massively): METADATA:
I totally prefer Jos's design of libstreamanalyzer over libtracker-extract because it's clearly much more capable for #4 and #2. It'll be slower than simple open/read or mmap for #3 (Stream abstractions make things possible, but they don't make things faster) and more difficult because file format libraries usually don't offer a stream based API: Try libpng with a 'stream' pointing to a .png file in a ZIP file -> you have to write a stream based PNG metadata extractor from scratch whereas libtracker-extract can just use libpng. I salute Jos for trying to inspire people to "just do that". However, I don't think it's necessarily a good fit for #1 and #5. What I have in mind for a use-case like 5.2 is something like this: We found out that there is a contact to mine We find what the resources are that are available for this contact to get metadata from (very often we can introspect that from the contact's initial metadata) We visit those resources and get the metadata Example Jos is a contact. Jos's metadata resources are: - Camera for photo relationships (is on the photo) - Website for geographical location (is in this city) - Or phone for geographical location (is at these coordinates, using phone's GPS) - Certain contact details in a .vcf file - Other details in a forward MIME part of an E-mail in INBOX of an known IMAP server where it is a MIME part that is a zip file that contains a .vcf file (Strigi!!) So, using a metadata collector visitor we "Visit" contact. At this first visit we learned that the contact contains other visitable elements "Camera", "GEO Website", "GEO Phone", "VCF File", "E-Mail" which the visitor all also visits. Basically instead of (or together with) libstreamanalyzer's strategy with streams and decorator: visitor. I also think all those projects should run in their own process. So that'll be visitor's Visit() and Accept() over FD passed IPC. I think they should go to their own process because I simply don't trust each and every visitor implementation (just like why tracker-extract runs in its own process). - Miners are separate projects (GNOME has already started this here https://git.gnome.org/browse/gnome-online-miners) (the FS miner should also be just like this) So these Miners all become visitables, implementing an IPC method Accept(Visitor : v), with Visitor being something that keeps state about an information element to extract metadata for, and Visitor being the one that'll collect Nepomuk statements and finally merge them together to form a SPARQL UPDATE. Which it calls on a libtracker-sparql setup that uses the nepomuk-desktop ontology package (which will be globally unique on a GNOME and/or KDE desktop). There. That's my idea. Which I know is probably totally insane and crazy. Kind regards, Philip |