Re: [Tracker] REVIEW: external-crawler branch

From: Martyn Russell <martyn lanedo com>
To: Sam Thursfield <ssssam gmail com>
Cc: Tracker mailing list <tracker-list gnome org>
Subject: Re: [Tracker] REVIEW: external-crawler branch
Date: Thu, 07 Aug 2014 12:36:18 +0100

On 06/08/14 22:22, Sam Thursfield wrote:

hi Martyn!

Hi Sam,

I reviewed the branch before comprehending your overview of what it
does, so there are some comments on patches that are later completely
overwritten. I've left them in case they are helpful in some way. It'd
be nice to squash a bunch of the commits in the branch together though
before merging, if you get time, so the history is clearer!

I've attached some actual comments, as a text file because I imagine
gmail will mangle it otherwise (I should really stop using gmail :)

Thanks for the review, my reply below:

Most of my complaints are about naming here. Code looks solid.

Thanks,

I didn't like the name 'external-crawler' but actually it is quite appropriate,
if we consider a crawler a specific kind of miner which traverses a tree of
resources.

Yea, the scope has changed a bit since the start of the work, so Iagree, it's not perfect, but close enough.

This branch allows adding crawlers for arbitrary URI schemes, but it doesn't
seem to support monitoring at all. Is that planned? Seems that you could

No you're right, it doesn't.
I suppose that's a likely extension of this work.

There is nothing planned right now, but while developing this branch Irealised it's quite likely that others are making use of this kind ofwork would eventually want something like this.

subclass TrackerMonitor to implement stuff like monitoring RSS feeds for change
notifications, or whatever else the data source requires.

I agree and it's probably quite simple to set up.

I notice that this 'generic tree crawler' functionality is kind of similar to
the generic 'browse media source' functionality that Grilo provides ...
https://developer.gnome.org/grilo/unstable/GrlSource.html#grl-source-browse

I didn't know that to be honest.

I also looked at the crawler that GIO provides with the GFileEnumeratorand I almost used that approach instead, but it's quite closely attachedto the backend for a GFile (e.g. GLocalFile - which is normally thedefault). For remote file systems, implementing an entire GFile backendinterface is or can be just a lot of extra work, so I've avoided that here.

I wonder if we could make use of Grilo plugins in implementing miners for some
data sources? Although Grilo already wraps Tracker, so we start to end up in
danger of disappearing up our own fundamentals ...

Possibly, though I would rather not create the dependency problems Iwould expect from that.

Actually, I've been thinking more and more to break up Tracker. So themining side could indeed be some combination of Tracker (MinerFS)/Grillousing the store as a separate project that's untouched and lower in thestack. That's way down the line though and now we're API stable, I don'twant to "rock the boat" for users just yet.

diff --git a/src/libtracker-miner/tracker-miner-fs.c b/src/libtracker-miner/tracker-miner-fs.c
index 40bb2bf..9e1e9fb 100644
--- a/src/libtracker-miner/tracker-miner-fs.c
+++ b/src/libtracker-miner/tracker-miner-fs.c
@@ -3943,20 +3943,51 @@ tracker_miner_fs_get_indexing_tree (TrackerMinerFS *fs)
  * tracker_miner_fs_manually_notify_file:
  * @fs: a #TrackerMinerFS
  * @file: a #GFile
+ * @error: a #GError
  *
- * Returns: %TRUE if successful, %FALSE otherwise.
+ * This API is only useful where the @fs was created using the
+ * #TrackerMinerFS:external-crawler property set to %TRUE. By default
+ * this is %FALSE. This allows 3rd party developers to 'push' their
+ * data into Tracker to be processed, rather than requiring Tracker to
+ * index the content itself by crawling the file system and monitoring
+ * changes.
+ *
+ * This is also very helpful for cases where @file is not based on the
+ * local file system (for example a cloud implementation) and you want
+ * to tell Tracker about content to index more directly.
+ *
+ * Returns: %TRUE if successful, otherwise %FALSE and @error will be
+ * set if a pointer is supplied.
  *
  * Since: 1.2.
  **/
 gboolean

Be clearer about what "processed" means: my understanding is in this case that
a file enqueued for "processing" by this method will go through the filesystem
miner, which adds basic stuff about mtime, filename etc. and then added to the
store. One or more extractor/decorator miners may then add additional data. For
file:/// URIs of certain MIME types the built-in tracker-extract procss will
be one of those.

Good question.

Actually, I think "terminology" in Tracker really could do with beingclearer and perhaps put on some wiki page. I know even I inter-changeterms with different meanings some times.

So "processed" here means, we've populated the TrackerSparqlBuilderobject for the GFile and we're now telling the TrackerMinerFS (which isreally a queue management state machine - badly named and not related tofile systems any more) that we can go ahead and do the database COMMIT.I did ask Carlos about renaming this but it's part of our stable API nowsadly so...

How is tracker_miner_fs_manually_notify_file() different from the D-Bus calls
IndexFile() and the proposed (GB#680834) IndexFileForProcess() ? Could one
not be implemented in terms of the other?

For those interested in how the "file_notify" API works here is somebackground:

So these functions are a bit misleading. It's name suggests what yousay. But this is really ONLY to be used by data miners usinglibtracker-miner not by external apps wanting to mine data about a file.Let me give you the expected topological view of how this works:

  1. Miner A adds a directory to be processed 'file:///foo' to the
     TrackerIndexingTree.
  2. The TrackerFileSystem / TrackerIndexingTree call the
     TrackerCrawler, (which now calls the TrackerEnumerator interface
     API) to get children of 'file:///foo'
  3. Any hits founds are then cached and signalled up the stack through
     TrackerFileNotifier and TrackerMinerFS listens to those signals
     and queues work to be done.
  4. When 'file:///foo/bar.txt' (a file found) is handled in
     TrackerMinerFS, it calls the ::process-file vtable function.
  5. Miner A implements this ::process-file function and adds specific
     metadata that it knows about intimately. IMPORTANT: this can be
     sync OR async.
  6. When Miner A is done adding the "intimate" metadata about
     'bar.txt', it calls tracker_miner_fs_file_notify_(), this allows
     the TrackerMinerFS state machine to continue with the COMMIT on the
     database now it has collected everything it needs.

However, I should add, this functiontracker_miner_fs_manually_notify_file() is not in the final API on thatbranch. It might be easier to use:

  $ git diff origin/master

That way you won't see things that were later changed in subsequent commits.

+tracker_file_notifier_signal_file  (TrackerFileNotifier     *notifier,
+                                    TrackerMinerFSQueue      queue_type,
+                                    GFile                   *file,
+                                    GFileType                file_type)

Slightly odd spacing here.

Likely to be in line with other functions, I use Owen Taylor's egtkemacs functions to line things up. It's part of our coding style IIRC.

+typedef enum {
+       TRACKER_MINER_FS_QUEUE_NONE,
+       TRACKER_MINER_FS_QUEUE_CREATED,
+       TRACKER_MINER_FS_QUEUE_UPDATED,
+       TRACKER_MINER_FS_QUEUE_DELETED,
+       TRACKER_MINER_FS_QUEUE_MOVED
+} TrackerMinerFSQueue;
+

I think TrackerMinerFSEventType would be a much better name. The queues are an
implementation detail.

Sorry, again, this is not in the final diff. I changed the approach.

diff --git a/src/libtracker-miner/tracker-file-system.c b/src/libtracker-miner/tracker-file-system.c
index 7d9a236..dd81f49 100644
--- a/src/libtracker-miner/tracker-file-system.c
+++ b/src/libtracker-miner/tracker-file-system.c
@@ -356,7 +356,7 @@ file_system_finalize (GObject *object)
                         NULL);
        g_node_destroy (priv->file_tree);

-       if (!priv->root) {
+       if (priv->root) {
                g_object_unref (priv->root);
        }

Definitely squash this into 'libtracker-miner: FileSystem _new() now takes GFile for root'

Thanks, I will.

So if an external crawler is used, the monitoring is disabled. How do we get
change notifications? (I'm guessing that's a separate problem).

Yep, indeed. Likely a next step too. Current work here is all internaland private, so we have all our options open to us on this point.

commit 8cdc0188e2c4d341e6e9788370e9381e085ae34b
Author: Martyn Russell <martyn lanedo com>
Date:   Fri May 16 12:11:00 2014 +0100

    libtracker-miner: Fixed finalization logic for root unref, tests were failing

Squash into 'libtracker-miner: IndexingTree _new() now takes a GFile' !

Various commits follow that could also do with being squashed.

OK, yea, I did some cleaning up, but perhaps I should be less verbose asyou say. Thanks.

+/**
+ * SECTION:tracker-enumerator
+ * @short_description: Enumerate URI locations
+ * @include: libtracker-miner/miner.h
+ *
+ * #TrackerEnumerator allows you to operate on a set of #GFiles,
+ * returning a #GFileInfo structure for each file enumerated (e.g.
+ * tracker_enumerator_get_children() will return a #GSList with the
+ * children within a parent directory or structure for the URI.
+ *
+ * This function is similar to g_file_enumerator_next_files_async(),
+ * unlike the g_file_enumerator_next() which will return one #GFile at
+ * a time, this will return all children at once.

That sounds like it could risk locking up the process, or even the system.
Some local filesystems take a LONG time to 'ls' sometimes, and chew up huge
amounts of IO while doing so. Is this definitely not going to be a performance
regression? What if the enumerator is operating on some crazy web API that
lists millions of resources?

Actually, you make a good point here. Let me get back to you on this one.

commit 6829ac1c53c995e702efb245f5777835528693d4
Author: Martyn Russell <martyn lanedo com>
Date:   Thu Jun 12 19:08:03 2014 +0100

    libtracker-miner: Make it possible to start the miner via DBus

Commit message should say why this is needed; I don't really understand why.

Right. So at the moment, you can pause, resume, etc via DBus, but notstart. I think I even left out stop() which should also be implemented.

Why should we care?

Well, currently miners are automatically started unless you set theprogress to "1.0" during initialization (IIRC). In the case where you dothat and want to control starting from another process or when you want,there is no real way to do this from DBus right now. Also, there is noway to stop a miner right now even if you automatically start, only topause a miner. For completeness, we should provide both of these functions.

commit 410923dfc7aaaff9f862d51092d2209d8dad3e13
Author: Martyn Russell <martyn lanedo com>
Date:   Tue Jul 1 10:42:31 2014 +0100

    libtracker-miner: Added tracker_indexing_tree_get_master_root() and set in constructed()

    This function gets the top level root for all roots, e.g. file:///

    We also create the root nodes in constructed() not init() because then the
    root GFile has been set in the properties when the object is initiated

I already find the use of the term "root" and "config_root" in TrackerIndexingTree
to be quite confusing. Now we have the concept of the "master root", which is
mostly called "root", but is different the other type of "root".

I think we need some better words :)

"Config roots" should perhaps become "configured index points", and the "master
root"/"root" can just be referred to as "root" everywhere.

Right. So let me clear this up:

- master root = top most level root node,
  e.g. file:///

- config root = a root node from GSettings,
  e.g. file:///home/martyn/Documents

- root = ANY root :) normally config root, but it can also apply to
  roots added for devices, which technically are not a config root or a
  master root ;)

Better terminology is a good idea, I will add the above explanationsomewhere to make it clear (in the documentation/comments).

commit 011eccbd624aa494e42482c0d809ec588ee2a671
Author: Martyn Russell <martyn lanedo com>
Date:   Tue Jul 1 10:48:11 2014 +0100

    libtracker-miner: Create nodes / cahces in constructed() not init()

Typo: caches

Thanks, will fix.

commit e5b5280b5f336b80b00d9c9671a72391ab1f0a9b
Author: Martyn Russell <martyn lanedo com>
Date:   Thu Jul 17 19:32:32 2014 +0100

    libtracker-control: Allow miner status to be gained from running processes

    Using environment variable: TRACKER_MINERS_DIR_DISABLED

That's a weird name for an environment variable ...

Yea, couldn't come up with a better one, any suggestions?

One thing is clear, we need a way to get miners from running processesthough (for dynamically started/running miners). I don't really like thewhole .desktop file approach either, but I don't see a better way toknow about miners not running.

Overall the branch looks very promising! The main use I see is allowing Tracker
to mirror metadata of remote content provided by various internet services. I'm

That's what it's being used for ;)

not quite sure which services it actually makes sense to do this for ... I
guess only the ones which we want to be able to locally search, run queries on
or display in applications like GNOME Documents, GNOME Music etc.

To query content on multiple remote and local systems basically.

While you could theoretically write a TrackerCrawler for Spotify, I don't
particularly want to mirror the entire Spotify database on my laptop! What

Well, I would expect your list of interested content on Spotify isactually not that big, compared to indexing content of real files onyour local file systems. But anyway, it provides a nice way tocentralise and query your content if wanted.

might be more interesting is if we could reuse TrackerEnumerator
implementations in other contexts. I mentioned Grilo above already: they have
grl_source_browse() and grl_source_search() methods that are implemented by
plugins and basically return lists of results for a given query. If they
exposed a TrackerEnumerator interface then Tracker could then crawl and save
metadata from any data source supported by Grilo ... which would be neat!

I think before we get there we need to break up tracker. I've beentrying to make the miner-fs much more stream lined and less an"all-in-one" local indexer / data miner.

But yea, we should work together more.

Thanks again for your comments Sam!
Much appreciated,

--
Regards,
Martyn

Founder & Director @ Lanedo GmbH.
http://www.linkedin.com/in/martynrussell

Follow-Ups:
- Re: [Tracker] REVIEW: external-crawler branch
  - From: Martyn Russell

References:
- Re: [Tracker] REVIEW: external-crawler branch
  - From: Sam Thursfield

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]