Re: [Tracker] wip/passive-extraction (and API cleanup?)
- From: Martyn Russell <martyn lanedo com>
- To: Carlos Garnacho <carlosg gnome org>, tracker-list gnome org
- Subject: Re: [Tracker] wip/passive-extraction (and API cleanup?)
- Date: Thu, 09 Jan 2014 19:20:12 +0000
I've CCd a few people here due to some API changes we're making that
might affect you guys out there.
On 09/01/14 12:48, Carlos Garnacho wrote:
Hey all,
I've talking about this branch on #tracker, but now that most work is
done there it is worth raising to the ML. In that branch there are two
extra objects in libtracker-miner:
* TrackerDecorator is a TrackerMiner that implements a passive
indexing pattern, instead of being expected to feed data
directly to tracker-store, it listens for GraphUpdated signals,
so when an item eligible for indexing is added/updated and is
still missing a nie:dataSource specific to the decorator, it is
queued for processing. On startup it also queries for all
elements of the eligible rdf:types that are missing that
nie:dataSource, so all elements are ensured to be indexed.
* TrackerDecoratorFS is a file-specific implementation of that
object, which basically adds volume monitoring, so indexing
within just added volumes is resumed if interrupted previously,
or having the elements removed from the queue if the volume is
removed.
I quite like keeping the volume stuff separate and I think this fits
nicely with what Xavier and other embedded users have in mind - which is
heavy removable device use.
In that branch, tracker-extract does use these features, it is been
turned into a full-blown standalone miner using TrackerDecorator, while
miner-fs stopped calling it. On one hand, this leads to a greatly
simplified indexing in tracker-miner-fs, as the task is a lot less prone
to failure now. On the other hand, this brings in the 2-pass indexing
that was being requested, miner-fs promptly crawls and fetches GFile
info, and tracker-extract goes soon after filling in extra information.
Sounds great. Do we have any numbers on if this improves or degrades
experience out of interest?
Current caveats
===============
It is worth noting though that in the branch not much has been done yet
about handling extraction failures:
* extractor modules blocking or taking too much time
* crashes in extractor modules
Yea, PDFs come to mind. I should back out the work I did now that we
fixed the alarm() stuff... which I presumed didn't fully work when I
wrote it. That's PDF specific though and these cause us some of the most
problems.
Possible solutions go through adding cancellability of extract tasks
and/or having all extraction go into a subprocess that we can watch on,
Well, extraction is already its own process precisely for this reason...
So we can just start up a new extractor process if we have a rogue
file/3rd party library/extractor implementation/etc.
As for cancellations ... not sure that really helps if it's IO blocking
on something out of our control. I've always thought we need proper
signal handling on the OS level for this sort of thing and bulldozer to
clean up the mess :)
so the dbus service itself doesn't go away and doesn't need to be
restarted. The latter could also help with Phillip's idea to run
extraction in containers. But about these changes...
This really is the weakest part of Tracker I would say, the fact that we
know we will have crashes and extractor problems.
Luckily, with this work, it should have even less effect on standard
file information (name, size, mtime, etc).
Future plans?
=============
I'm very seriously proposing to make libtracker-extract private
altogether, the usefulness of having 3rd party extractors is dubious, as
neither allowing them to reimplement extraction for a famous mimetype
nor implementing support for a mimetype we don't know well enough is
positive, it potentially affects tracker stability and user perception,
and helps avoid the point that if a mimetype has enough traction, it
should be in the tracker tree. Its API is also a mishmash of utility
functions that have little to do with the rest of Tracker, and written
in not a quite future-safe way.
Not sure I agree here. Which functions in particular did you think of?
I know some of the "guessing" functions like
tracker_guarantee_title_from_file() and tracker_encoding_guess_enca().
Safety wise, I guess you mean the iptc/xmp/etc type functions with
public structs right?
I agree with you there. I just couldn't bring myself to write all those
accesser functions :P should have used Vala :)
Moreover, goggling for "tracker_extract_get_metadata" (the function that
modules must implement), I just see 3 pages of references to Tracker
code, backtraces, and logs, very little references to external
extractors. This API is 1/3 of the Tracker public API, yet it's been
mostly unused externally for the 3 years it's been on.
Or mired by all the crashes ;)
So, I think Tracker should offer API to help integrate with Tracker, as
such this API falls over, I propose to keep it in private land, and
encourage the use of TrackerDecorator, which is also nice in the way
that multiple sources add up information, unlike extract modules which
are individually responsible of filling in every piece of information.
I don't feel too strongly about this.
I do think you can bring some stability sure but, I don't think it makes
all that much difference. In my experience, most of the problems we have
a due to difficult files rather than the extractor being fundamentally
broken. We've had some years to improve them I guess.
I would like it if people could speak up if they are using 3rd party
extractors and depend on this feature though!
CCd Jonatan here as I suspect they might be at Pelagicore.
CCd Ralph here as I suspect he might be.
Actually, I'd like to think we can make 1.0 soon (we technically could
ASAP, we've remained feature stable for quite some time now) and make
longer stability promises than we do currently (having every gnome
module depending on Tracker bump .pc file versions every 6 months is a
PITA), IMO the main milestone is getting the API to a point where we can
think of forward compatibility, and doing this would help greatly.
I agree.
The bumping thing actually isn't necessary. I started it back when the
changes between versions were a bit bigger (e.g. 0.6, 0.7, 0.8, etc.)
we've mellowed a bit since then :)
Phew, long email,
Thanks Carlos, comments to follow...
One other thing that occurred to me is, if the patch that Philip is
trying to get in from Jolla hits master, it might be problematic because
it allows the GraphUpdated signal delay to be altered from 1 second
(default now) to anything ... Technically this is also (kind of) an ABI
break because we now won't be signalling when _ALL_ the data is
available, but only the basic file data.
As a side note the class name "sniffer" has a slightly negative meaning
:) Perhaps "watcher" or "listener" ? Just a thought...
--- a/data/ontologies/90-tracker.ontology
+++ b/data/ontologies/90-tracker.ontology
@@ -70,3 +70,6 @@ tracker:forceJournal a rdf:Property ;
fts: a tracker:Namespace ;
tracker:prefix "fts" .
+
+tracker:extractor-data-source a nie:DataSource ;
+ rdfs:label "Tracker extractor data source" .
You forgot to update the date at the top of the file ;)
+tracker-marshal.h: tracker-marshal.list
+ $(AM_V_GEN)$(GLIB_GENMARSHAL) $< --prefix=tracker_marshal --header > $@
+
+tracker-marshal.c: tracker-marshal.list
+ $(AM_V_GEN)(echo "#include \"tracker-marshal.h\""; \
+ $(GLIB_GENMARSHAL) $< --prefix=tracker_marshal --body) > $@
I am probably going to use Xavier's patch and remove all of these, but I
will do it AFTER your branch I think:
http://cgit.collabora.com/git/user/xclaesse/tracker.git/commit/?h=marshal&id=b2036b5960c832a947b25ffb09e9b0e71e2838a5
+ if (error) {
+ g_critical ("Could not query updates: %s", error->message);
+ g_error_free (error);
+ return;
+ }
Whitespace error ;)
+ while (tracker_sparql_cursor_next (cursor, NULL, NULL)) {
+ const gchar *urn, *url;
+
+ urn = tracker_sparql_cursor_get_string (cursor, 0, NULL);
+ url = tracker_sparql_cursor_get_string (cursor, 1, NULL);
+ g_signal_emit (user_data, signals[UPDATED], 0, urn, url);
+ }
Not worried about async cursor function use? Not that I expect this to
be a long operation, but ...
+static void
+tracker_data_sniffer_constructed (GObject *object)
+{
+ TrackerDataSnifferPrivate *priv;
+
+ priv = TRACKER_DATA_SNIFFER (object)->priv;
+ g_assert (priv->data_source);
I know the data source is a deal breaker if it's broken, but we should
probably avoid asserts in libraries. Thoughts?
+ query = g_strdup_printf ("SELECT ?urn ?url { "
+ " ?urn nie:url ?url ; "
+ " tracker:available true . "
+ " FILTER (tracker:id(?urn) IN (%s) && "
+ " NOT EXISTS { ?urn nie:dataSource <%s> })"
+ "}", id_string->str, priv->data_source);
So, a few things...
1. I wonder if tracker:available is useful here, I guess there might a
race condition where a volume becomes unavailable and we then can skip
files which are now !tracker:available - was that you're thinking here?
2. Would it make sense to use ?urn a nfo:FileDataObject here? Do we
expect to be updating anything other than files with this approach? This
way we could avoid triggering further actions for non-file based
resources in the database.
+void tracker_data_sniffer_finish (TrackerDataSniffer *sniffer,
+ GAsyncResult *res,
+ const gchar *query);
I don't see this defined in the .c files anywhere?
Hanging definition? Just wondering what it does :)
+static void
+mount_point_removed_cb (TrackerStorage *storage,
+ const gchar *uuid,
+ const gchar *mount_point,
+ gpointer user_data)
+{
+ /* FIXME: remove queued elements from processing */
+}
Still need finishing?
static void
+miner_files_add_rdf_types (TrackerSparqlBuilder *sparql,
+ GFile *file,
+ const gchar *mime_type)
+{
+ GStrv rdf_types;
+ gint i = 0;
+
+ rdf_types = tracker_extract_module_manager_get_fallback_rdf_types (mime_type);
+
+ if (!rdf_types || !rdf_types[0])
+ return;
+
+ tracker_sparql_builder_predicate (sparql, "a");
+
+ while (rdf_types[i]) {
+ tracker_sparql_builder_object (sparql, rdf_types[i]);
+ i++;
+ }
+}
I thought you needed the predicate AND object for each object case? If
this works as is then ignore me.
@@ -238,20 +238,39 @@ GStrv
tracker_extract_module_manager_get_fallback_rdf_types (const gchar *mimetype)
{
Looks like this function doesn't g_hash_table_unref(rdf_types).
+ for (i = 0; r_info->fallback_rdf_types[i]; i++) {
+ g_hash_table_insert (rdf_types,
+ r_info->fallback_rdf_types[i],
+ r_info->fallback_rdf_types[i]);
Not sure why you add r_info->fallback_rdf_types[i] for both key/value
since the value is unused?
You maybe want g_hash_table_replace() too given that's what you're
really doing here logically, even though it makes no actual difference
here AFAICS.
+FallbackRdfTypes=nfo:Image;nmm:Photo;
I love this btw... it's exactly what I was worried about before I did
the review (i.e. finding photos if you just know the basic file data).
Very well considered Carlos! :)
I am confused as to why an MS Document is a nfo:Document and a PDF is a
nfo:PaginatedDocument though ;)
I guess this means we can remove all of those statements in the
extractors too right? Currently each extract re-adds this ?resource a
nfo:Foo stuff. Didn't see that in your branch yet.
Other than that, it looks great Carlos! :)
--
Regards,
Martyn
Founder & Director @ Lanedo GmbH.
http://www.linkedin.com/in/martynrussell
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]