Re: [Tracker] tracker as full text index/search tool for a large collection of pdf, ps, djvu, dvi documents?

From: Ivan Frade <ivan frade nokia com>
To: ext Meik Hellmund <Meik Hellmund math uni-leipzig de>
Cc: tracker-list gnome org
Subject: Re: [Tracker] tracker as full text index/search tool for a large collection of pdf, ps, djvu, dvi documents?
Date: Sun, 05 Oct 2008 23:15:09 +0300

Hi, Meik,

El sÃb, 04-10-2008 a las 17:34 +0200, ext Meik Hellmund escribiÃ:

 - It seems that Postscript, Dvi and Djvu documents are not fully
   indexed, only the metadata are used. How can I change this?


 You need to write a filter that prints the content of those files in
the standard output. Check the scripts in /usr/lib/tracker/filters. 

 Our convention is:
/usr/lib/tracker/filters/[mimetype]_filter

so application/pdf files are filtered with:
/usr/lib/tracker/filters/application/pdf_filter

 You need to write the filters for application/postscripts,
application/x-dvi, application/x-dvi-tar and image/vnd.djvu

 Use the pdf filter as example and it is very easy to write more.

 - It seems that Djvu files are classified as "images".
   This may be true in a technical sense, but djvu is a format
   especially adopted for scanned text and most djvu documents are
   scanned books and similar. 
   I think you should reclassify them as "documents".


 In /usr/share/tracker/services/default.services you can see the
mime-types assigned to each category. Try to move the djvu mimetype to
the documents category (and reindex).


 - How about compressed files? The documentation mentions that .gz
   files are supported. What about .bz2? Is it possible to add a filter
   for other compression methods?

 - Are there plans to extend the query capabilities with respect to
   the full text index? E.g., query for documents containing this but
   not that word, or containing some words in a small distance from
   each other?


 We will try to move to sqlite FTS as index engine. There we have a rich
query language with OR, AND and (not sure) proximity.

 - At the moment my collection of documents is mostly organized in a
   hierarchy of  directories. Is it possible to take this into
   account  in queries, e.g., query only for documents from a
   subtree of the indexed tree?


 I think so, but i would need to check the RDF query examples.

I know this is quite a list of questions. Any pointers to answers of
any of them are really welcome. 

Of course, tracker may simply be the wrong tool for what I want. Any
pointers to alternatives are also welcome.


I think tracker can fulfil your requirements. Testing and feedback is
always welcome, thanks!

Ivan

Follow-Ups:
- Re: [Tracker] tracker as full text index/search tool for a large collection of pdf, ps, djvu, dvi documents?
  - From: Meik Hellmund

References:
- [Tracker] tracker as full text index/search tool for a large collection of pdf, ps, djvu, dvi documents?
  - From: Meik Hellmund

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]