Re: [Tracker] wip/passive-extraction (and API cleanup?)



I've CCd a few people here due to some API changes we're making that might affect you guys out there.


On 09/01/14 12:48, Carlos Garnacho wrote:
Hey all,

I've talking about this branch on #tracker, but now that most work is
done there it is worth raising to the ML. In that branch there are two
extra objects in libtracker-miner:

       * TrackerDecorator is a TrackerMiner that implements a passive
         indexing pattern, instead of being expected to feed data
         directly to tracker-store, it listens for GraphUpdated signals,
         so when an item eligible for indexing is added/updated and is
         still missing a nie:dataSource specific to the decorator, it is
         queued for processing. On startup it also queries for all
         elements of the eligible rdf:types that are missing that
         nie:dataSource, so all elements are ensured to be indexed.
       * TrackerDecoratorFS is a file-specific implementation of that
         object, which basically adds volume monitoring, so indexing
         within just added volumes is resumed if interrupted previously,
         or having the elements removed from the queue if the volume is
         removed.

I quite like keeping the volume stuff separate and I think this fits nicely with what Xavier and other embedded users have in mind - which is heavy removable device use.

In that branch, tracker-extract does use these features, it is been
turned into a full-blown standalone miner using TrackerDecorator, while
miner-fs stopped calling it. On one hand, this leads to a greatly
simplified indexing in tracker-miner-fs, as the task is a lot less prone
to failure now. On the other hand, this brings in the 2-pass indexing
that was being requested, miner-fs promptly crawls and fetches GFile
info, and tracker-extract goes soon after filling in extra information.

Sounds great. Do we have any numbers on if this improves or degrades experience out of interest?

Current caveats
===============

It is worth noting though that in the branch not much has been done yet
about handling extraction failures:
       * extractor modules blocking or taking too much time
       * crashes in extractor modules

Yea, PDFs come to mind. I should back out the work I did now that we fixed the alarm() stuff... which I presumed didn't fully work when I wrote it. That's PDF specific though and these cause us some of the most problems.

Possible solutions go through adding cancellability of extract tasks
and/or having all extraction go into a subprocess that we can watch on,

Well, extraction is already its own process precisely for this reason... So we can just start up a new extractor process if we have a rogue file/3rd party library/extractor implementation/etc.

As for cancellations ... not sure that really helps if it's IO blocking on something out of our control. I've always thought we need proper signal handling on the OS level for this sort of thing and bulldozer to clean up the mess :)

so the dbus service itself doesn't go away and doesn't need to be
restarted. The latter could also help with Phillip's idea to run
extraction in containers. But about these changes...

This really is the weakest part of Tracker I would say, the fact that we know we will have crashes and extractor problems. Luckily, with this work, it should have even less effect on standard file information (name, size, mtime, etc).

Future plans?
=============

I'm very seriously proposing to make libtracker-extract private
altogether, the usefulness of having 3rd party extractors is dubious, as
neither allowing them to reimplement extraction for a famous mimetype
nor implementing support for a mimetype we don't know well enough is
positive, it potentially affects tracker stability and user perception,
and helps avoid the point that if a mimetype has enough traction, it
should be in the tracker tree. Its API is also a mishmash of utility
functions that have little to do with the rest of Tracker, and written
in not a quite future-safe way.

Not sure I agree here. Which functions in particular did you think of?
I know some of the "guessing" functions like tracker_guarantee_title_from_file() and tracker_encoding_guess_enca().

Safety wise, I guess you mean the iptc/xmp/etc type functions with public structs right?

I agree with you there. I just couldn't bring myself to write all those accesser functions :P should have used Vala :)

Moreover, goggling for "tracker_extract_get_metadata" (the function that
modules must implement), I just see 3 pages of references to Tracker
code, backtraces, and logs, very little references to external
extractors. This API is 1/3 of the Tracker public API, yet it's been
mostly unused externally for the 3 years it's been on.

Or mired by all the crashes ;)

So, I think Tracker should offer API to help integrate with Tracker, as
such this API falls over, I propose to keep it in private land, and
encourage the use of TrackerDecorator, which is also nice in the way
that multiple sources add up information, unlike extract modules which
are individually responsible of filling in every piece of information.

I don't feel too strongly about this.

I do think you can bring some stability sure but, I don't think it makes all that much difference. In my experience, most of the problems we have a due to difficult files rather than the extractor being fundamentally broken. We've had some years to improve them I guess.

I would like it if people could speak up if they are using 3rd party extractors and depend on this feature though!

CCd Jonatan here as I suspect they might be at Pelagicore.
CCd Ralph here as I suspect he might be.

Actually, I'd like to think we can make 1.0 soon (we technically could
ASAP, we've remained feature stable for quite some time now) and make
longer stability promises than we do currently (having every gnome
module depending on Tracker bump .pc file versions every 6 months is a
PITA), IMO the main milestone is getting the API to a point where we can
think of forward compatibility, and doing this would help greatly.

I agree.

The bumping thing actually isn't necessary. I started it back when the changes between versions were a bit bigger (e.g. 0.6, 0.7, 0.8, etc.) we've mellowed a bit since then :)

Phew, long email,

Thanks Carlos, comments to follow...

One other thing that occurred to me is, if the patch that Philip is trying to get in from Jolla hits master, it might be problematic because it allows the GraphUpdated signal delay to be altered from 1 second (default now) to anything ... Technically this is also (kind of) an ABI break because we now won't be signalling when _ALL_ the data is available, but only the basic file data.

As a side note the class name "sniffer" has a slightly negative meaning :) Perhaps "watcher" or "listener" ? Just a thought...

--- a/data/ontologies/90-tracker.ontology
+++ b/data/ontologies/90-tracker.ontology
@@ -70,3 +70,6 @@ tracker:forceJournal a rdf:Property ;

 fts: a tracker:Namespace ;
        tracker:prefix "fts" .
+
+tracker:extractor-data-source a nie:DataSource ;
+       rdfs:label "Tracker extractor data source" .

You forgot to update the date at the top of the file ;)

+tracker-marshal.h: tracker-marshal.list
+       $(AM_V_GEN)$(GLIB_GENMARSHAL) $< --prefix=tracker_marshal --header > $@
+
+tracker-marshal.c: tracker-marshal.list
+       $(AM_V_GEN)(echo "#include \"tracker-marshal.h\""; \
+                   $(GLIB_GENMARSHAL) $< --prefix=tracker_marshal --body) > $@

I am probably going to use Xavier's patch and remove all of these, but I will do it AFTER your branch I think:

http://cgit.collabora.com/git/user/xclaesse/tracker.git/commit/?h=marshal&id=b2036b5960c832a947b25ffb09e9b0e71e2838a5


+        if (error) {
+                g_critical ("Could not query updates: %s", error->message);
+                g_error_free (error);
+               return;
+       }

Whitespace error ;)

+       while (tracker_sparql_cursor_next (cursor, NULL, NULL)) {
+               const gchar *urn, *url;
+
+               urn = tracker_sparql_cursor_get_string (cursor, 0, NULL);
+               url = tracker_sparql_cursor_get_string (cursor, 1, NULL);
+               g_signal_emit (user_data, signals[UPDATED], 0, urn, url);
+       }

Not worried about async cursor function use? Not that I expect this to be a long operation, but ...

+static void
+tracker_data_sniffer_constructed (GObject *object)
+{
+       TrackerDataSnifferPrivate *priv;
+
+       priv = TRACKER_DATA_SNIFFER (object)->priv;
+       g_assert (priv->data_source);

I know the data source is a deal breaker if it's broken, but we should probably avoid asserts in libraries. Thoughts?

+       query = g_strdup_printf ("SELECT ?urn ?url { "
+                                "  ?urn nie:url ?url ; "
+                                "       tracker:available true . "
+                                "  FILTER (tracker:id(?urn) IN (%s) && "
+                                "          NOT EXISTS { ?urn nie:dataSource <%s> })"
+                                "}", id_string->str, priv->data_source);

So, a few things...

1. I wonder if tracker:available is useful here, I guess there might a race condition where a volume becomes unavailable and we then can skip files which are now !tracker:available - was that you're thinking here?

2. Would it make sense to use ?urn a nfo:FileDataObject here? Do we expect to be updating anything other than files with this approach? This way we could avoid triggering further actions for non-file based resources in the database.

+void          tracker_data_sniffer_finish          (TrackerDataSniffer *sniffer,
+                                                   GAsyncResult       *res,
+                                                   const gchar        *query);

I don't see this defined in the .c files anywhere?
Hanging definition? Just wondering what it does :)

+static void
+mount_point_removed_cb (TrackerStorage *storage,
+                       const gchar    *uuid,
+                       const gchar    *mount_point,
+                       gpointer        user_data)
+{
+       /* FIXME: remove queued elements from processing */
+}

Still need finishing?

 static void
+miner_files_add_rdf_types (TrackerSparqlBuilder *sparql,
+                           GFile                *file,
+                           const gchar          *mime_type)
+{
+       GStrv rdf_types;
+       gint i = 0;
+
+       rdf_types = tracker_extract_module_manager_get_fallback_rdf_types (mime_type);
+
+       if (!rdf_types || !rdf_types[0])
+               return;
+
+       tracker_sparql_builder_predicate (sparql, "a");
+
+       while (rdf_types[i]) {
+               tracker_sparql_builder_object (sparql, rdf_types[i]);
+               i++;
+       }
+}

I thought you needed the predicate AND object for each object case? If this works as is then ignore me.

@@ -238,20 +238,39 @@ GStrv
 tracker_extract_module_manager_get_fallback_rdf_types (const gchar *mimetype)
 {

Looks like this function doesn't g_hash_table_unref(rdf_types).

+               for (i = 0; r_info->fallback_rdf_types[i]; i++) {
+                       g_hash_table_insert (rdf_types,
+                                            r_info->fallback_rdf_types[i],
+                                            r_info->fallback_rdf_types[i]);

Not sure why you add r_info->fallback_rdf_types[i] for both key/value since the value is unused?

You maybe want g_hash_table_replace() too given that's what you're really doing here logically, even though it makes no actual difference here AFAICS.

+FallbackRdfTypes=nfo:Image;nmm:Photo;

I love this btw... it's exactly what I was worried about before I did the review (i.e. finding photos if you just know the basic file data). Very well considered Carlos! :)

I am confused as to why an MS Document is a nfo:Document and a PDF is a nfo:PaginatedDocument though ;)

I guess this means we can remove all of those statements in the extractors too right? Currently each extract re-adds this ?resource a nfo:Foo stuff. Didn't see that in your branch yet.


Other than that, it looks great Carlos! :)

--
Regards,
Martyn

Founder & Director @ Lanedo GmbH.
http://www.linkedin.com/in/martynrussell


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]