Re: [Tracker] tracker as full text index/search tool for a large collection of pdf, ps, djvu, dvi documents?




On Sun, 05 Oct 2008 23:15:09 +0300
Ivan Frade <ivan frade nokia com> wrote:

Hi, Meik,

El sáb, 04-10-2008 a las 17:34 +0200, ext Meik Hellmund escribió:

 - It seems that Postscript, Dvi and Djvu documents are not fully
   indexed, only the metadata are used. How can I change this?

 You need to write a filter that prints the content of those files in
the standard output. Check the scripts in /usr/lib/tracker/filters. 

 Our convention is:
/usr/lib/tracker/filters/[mimetype]_filter

so application/pdf files are filtered with:
/usr/lib/tracker/filters/application/pdf_filter

 You need to write the filters for application/postscripts,
application/x-dvi, application/x-dvi-tar and image/vnd.djvu

 Use the pdf filter as example and it is very easy to write more.

Great. I use now:

/usr/lib/tracker/filters/application/postscript_filter:
     #!/bin/sh
     nice -n19 ps2txt  "$1" "$2"

/usr/lib/tracker/filters/application/x-dvi_filter:
     #!/bin/sh
     nice -n19 catdvi -e 0   "$1" > "$2"

(after "apt-get install catdvi") 

and it works fine for Postscript and Dvi. 
But Djvu is still not working:



 - It seems that Djvu files are classified as "images".
   This may be true in a technical sense, but djvu is a format
   especially adopted for scanned text and most djvu documents are
   scanned books and similar. 
   I think you should reclassify them as "documents".   

 In /usr/share/tracker/services/default.services you can see the
mime-types assigned to each category. Try to move the djvu mimetype to
the documents category (and reindex).

I added "image/vnd.djvu"  to the "Mimes=.." line in the
[Documents] chapter in this file, but it didn't help.


  
On Sun, 5 Oct 2008 22:32:19 +0200
"Michael Biebl" <mbiebl gmail com> wrote:

For djvu, there is already a a filter
/usr/lib/tracker/filters/text/djvu_filter

It should index the content of djvu files, but it requires the
djvulibre-bin package being installed. (The tracker deb package has a
recommends on this package).

The filter itself works. According to Ivan's explanation about filter names, I also
copied it to  /usr/lib/tracker/filters/image/vnd.djvu_filter

But it isn't used by trackerd. I still get from "trackerd -v 3 -R":

   processing /home/hellmund/PS/no_series_187.djvu with action TRACKER_ACTION_CHECK and counter 0 mime is 
image/vnd.djvu
   for /home/hellmund/PS/no_series_187.djvu file extension is djvu
   file /home/hellmund/PS/no_series_187.djvu is indexable
   file /home/hellmund/PS/no_series_187.djvu has fulltext 0 with service Images 
   Indexing /home/hellmund/PS/no_series_187.djvu with service Images and mime image/vnd.djvu (new) 
   service id for Images is 6 and sid is 1279 with mime image/vnd.djvu

So it seems to me that it is not fulltext-indexed since it is categorized as an Image.

I also did an "strace -f trackerd -R" and found that /usr/share/tracker/services/default.service
is never read by trackerd, only the /usr/share/tracker/services/*.metadata files are opened.

Any ideas?




 - How about compressed files? The documentation mentions that .gz
   files are supported. What about .bz2? Is it possible to add a
filter for other compression methods?

Let me ask this question again. I have a lot of .ps.gz and .ps.bz2 files
and at the moment they are not indexed by tracker. Of course disk space is cheap nowadays and
I could uncompress them all. But what is tracker's expected behaviour? 

Many thanks for your time & answers!

Meik

-- 
Meik Hellmund
Mathematisches Institut, Uni Leipzig
e-mail: Meik Hellmund math uni-leipzig de
http://www.math.uni-leipzig.de/~hellmund



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]