Re: [Tracker] [WIP] Application support: man pages, Tomboy, & Liferea

From: Jamie McCracken <jamiemcc blueyonder co uk>
To: Edward Duffy <eduffy gmail com>
Cc: Tracker List <tracker-list gnome org>
Subject: Re: [Tracker] [WIP] Application support: man pages, Tomboy, & Liferea
Date: Thu, 07 Dec 2006 21:37:40 +0000

Edward Duffy wrote:

This isn't quite worked out, but I want to throw this out to the group
and get some preliminary feedback.  Attached is a patch that allows us
to index system-wide and user installed man pages, Tomboy notes, and
some basic Liferea support.  The external services all use the
out-of-process mechanism used by the text filter and embded metadata
extractor.  However, there are more operations, and therefore, more
applications for each service.

it would be best to discuss this first before doing the patch (unlessyou are content to modify it quite a bit - which is fine!)


I like this in general but there are a few things:

I want it to work with third party packages so it needs to have easyinstallation and deinstallation


First, the directory structure:
in tracker/src  there now resides an "external-services" directory.
In this directory you will find one directory for each service.  The
service directories are named after their configuration key in
~/.Tracker/tracker.cfg.  This makes it easy to add new services with
out recompiling trackerd (and hopefully encourage other developers to
provide tracker support with their apps!).  For example, you'll find
the directory tracker/src/external-services/IndexManPages and a
IndexManPages key under the Services group in tracker.cfg.  Each
service has five programs:


I would prefer just "services" to "external-services"


1) check-deps
 This program is called in the very begining, if the user actives the
service's key.  This program may check for any other required programs
that is needed for this service to work.  For example, I check for
xsltproc and w3m for the Liferea indexer.  If non-zero is returned,
the indexer is disabled.

Im not sure this is needed but I suppose there's no harm in having itbut it should be optional


2) watch-list
 This program returns a list of directories to be added to trackerd's
watch list.  You must list each directory, it will not automatically
recurse all subdirectorys.  If you need all subdirs, I recommend find:
# find $basedir -type d
See IndexManPage/watch-list for an example.

not needed - I prefer to have a service file (like the dbus servicefiles or .Desktop files which specify the options needed here).

All we need is a directory like /usr/share/tracker/services to hold theservice files. This makes it easy for seperate packages to install andde-install stuff without any hassle.

At start up trackerd can simply read all these service files (+ alsowatch for new ones too!)


3) service-type

This progam returns the service type of a file being watched by thisservice.

argv[1] == the full path to the file being watched
argv[2] == the mime type of the file
I provides the file path and mime, if you need it, but I imagine this
should be constant

not needed - any file in the watch directory above would be passed tothe spawned service-handler (we can include globs in the service file tofilter certain files to pass)

generally these watched folders will all be in hidden folders (usually)so they wont conflict with the file indexer.


4) filter-text
This works very similar to the text filters you find in the
tracker/filters directory, except
argv[1] == the full path
argv[2] == the mime type of the file !!
argv[3] == the path to the filtered text !!

5) extract-metadata
Again, behaves like tracker-extract. It takes a file and splits out
Key=Value;\n pairs for each piece of metadata
argv[1] == the full path
argv[2] == the mime type of the file

I was planning on migrating the existing metadata extractors format toan xml format (our current one is quite hacky!). We also need to handlemultiple values for the same metadata type.


something like:

<extraction>
        <metadata name="Audio.Title">Moonlight Sonata</metadata>
        <metadata name="Audio.Artist">Beethoven</metadata>
</extraction>

Feel free to modify code to match above.

the filter program and metadata extractor program should be specified inthe service file so there's no need to worry about mimes.

We need a function in tracker-utils that determines if a file isassociated with a particular service by looking at its path and matchingit against any path thats registered as a watch by a service. We needthis for the emails so may as well reuse it for all services. (justneeds to call g_str_has_prefix on it)


So, like I said before, I'm including 3 implementations of this:

1) IndexManPages
The new service type is "Man Pages" and it adds a new "Man" metadata
class.  The class can tag a man page's title, section, date it was
written, source (app + version), and manual name (eg, Debian Project
for debian specific man pages).  It also provides a full text indexer.
Only thing lacking here is the language the man page was written in.
Currently, I reject any non-english directory.  It's easy to index
them all, but it's just faster for me if trackerd just ignores those.

ok great. Maybe we can use user's locale to work out which translationsto index?


2) IndexTomboy
This uses the Notes service type, and adds a Title field to the Note
metadata class.  There's obviously more I could grab from the tomboy
files, I just haven't gotten around to it yet.  Full text is
supported.

see Mikkel's tomboy indexer which he sent on this list last month - itdoes all the fields I believe. Perhaps you could use some of his code?


3) IndexLiferea
This adds a service type called "Web Channels" and a metadata class
"RSS".  This indexer sucks and I need some help on it. :(
Currently, you only get one entry in the database for each feed.  So
all the text in the feed is associated with the entire feed, instead
of an individual item.  For example, if I was to search for "tracker"
I'd expect a link to a specific post by Jamie, instead I get a link
planet gnome.  I'm not even sure what I need here, I'd like some way
to associate a file with multiple database items.  Is this possible?

not sure - will have to think. The xml above for the extractor could bemodified to support multiple sub-entities with their own uri in one go.


<extraction>
        <Entity uri="/home/jamie/music/moonlight.ogg">
                <metadata name="Audio.Title">Moonlight Sonata</metadata>
                <metadata name="Audio.Artist">Beethoven</metadata>
        </entity>

        <Entity uri="/home/jamie/music/moonlight.ogg">
                <metadata name="Audio.Title">Moonlight Sonata</metadata>
                <metadata name="Audio.Artist">Beethoven</metadata>
        </entity>
</extraction>


so in the DB, they should be separate objects "RSS" feed and "RSS Item"

You could also build the uri to include the rss file and an offset tothe item that matches and the gui can then decode it and show a viewerfor it

I'm pretty happy with the man pages indexer, I may look into having
Yelp use some time in the future.  But I'm not calling dibs, so anyone
else looking for an project to work on is more than welcome.

The tomboy indexer works as expected also.  I belive Tomboy is
dbus-ified, so if any one wants to update tracker-search-tool to
search Notes also and fire up with Tomboy when you click on a note,
that'd be awsome.

the service file can contain this - either an exec name or a dbusinterface/object name


Sample service file might look like:

[Service]
Type=Notes
WatchDirs=$HOME/.tomboy
WatchRecursive=false
WatchFilter=

[Metadata]
Exec=/usr/bin/tomboy-extractor

[TextFilter]
Exec=

[Display]
Exec=/usr/bin/tomboy


Any comments?

--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/

Follow-Ups:
- Re: [Tracker] [WIP] Application support: man pages, Tomboy, & Liferea
  - From: Edward Duffy
- Re: [Tracker] [WIP] Application support: man pages, Tomboy, & Liferea
  - From: Eyal Oren

References:
- [Tracker] [WIP] Application support: man pages, Tomboy, & Liferea
  - From: Edward Duffy

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]