Re: Beagle and supported data sources


On Fri, 2006-11-17 at 14:42 +0100, Michal Pryc wrote:
> So if anyone can tell me how is with the supported data types?

It's important to separate data sources from file types.  A data source
can (in theory) produce items of any file type.  For example, the file
system data source produces text files, image files, PDFs, etc.  In your
list, you are combining the two.  Email data sources produce not only
RFC 822 emails, but also attachments of any file type contained inside
those emails.  Some data sources produce only one file type: the Gaim
log backend, for example, produces only Gaim logs.

For all of the data sources, there is custom code to extract the data.  
Only the Evolution Data Server backend uses an external library.  It
uses evolution-sharp, which wraps the evolution-data-server C APIs in

We extract metadata from all of the supported file types, and extract
full text from all supported file types that have it.  In almost all
cases these are handled at the same time, by the same code.

For the file types, many of them have custom code for parsing document
types.  I'm only going to list the ones that we use special libraries or
external programs to parse:

        * Emails - gmime-sharp
        * MS Word - wv1, optionally gsf-sharp

        * MS Excel - An external program from gnumeric called ssindex.

        * MS Powerpoint - gsf-sharp

        * PDF - We run the external pdfinfo and pdftotext programs from
        xpdf and parse the output
        * HTML - A modified HtmlAgilityPack included in the Beagle
        source tree
        * Windows help files (chm) - chmlib
        * Image files - custom code, mostly copied from F-Spot
        * Audio files - entagged-sharp, included in the Beagle source
        tree.  Plans are to move to taglib-sharp, which is what Banshee
        * Video files - Either external programs from MPlayer or Totem.
        * RPM - The rpm program itself.

The full list of filters in the source tree is here:

You can get more information on the specific .Net namespaces (ie,
System.Xml) from there.  (Hint: grep for "using")


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]