Re: Beagle roadmap.



On Thu, 2004-10-07 at 05:36, Veerapuram Varadhan wrote:
> On Tue, 2004-10-05 at 10:37 -0400, Nat Friedman wrote:
> > We have no objection to using useful, well-maintained libraries in
> > Beagle, wherever they come from.  There's no corporate agenda on that
> > point.  I doubt customers would even notice.
> > 
> Yes.  We won't mind using such libraries.
> 

Great!

> > I don't know if there are technical issues with using AbiWord libs,
> > but
> > I think in general we'd be utterly psyched to have this functionality.
> > I'll let the full-time hackers chime in if they have issues.  My guess
> > is we'd like to see the code :-).
> IINW, Martin wants to have a "running instance of Abiword" to filter
> word processor docs (??).
>   We really don't have issues if:
> 
> 1)  Instead of Abiword process, a libAbiwordFilter can be used, which
> will make beagle "sort-of" independent and beagle filters don't have to
> spawn "Abiword" process for each and every word processor document.
> (correct me if your code works in a different way)
> 
> 2)  We can filter the contents in single pass, (ie) pulling out the text
> along with the attributes in the first pass itself.  I really don't know
> how Martin's code works, however, I am just listing out things that a
> beagle filter needs. :-)
> 
> The general rules of thumb that beagle filters use are:
> 
> 1)  Try to use well-maintained, usable libraries.
> 2)  If no such libraries exists, write our own filters using file-format
> specifications, as beagle filters require very basic information from
> files/file format structures.
> 3)  If 1) and 2) doesn't exist or not found and there are certain word
> processors available that support these formats, try to extract the part
> of code and try to port it.
> 4)  As a last resort, "spawn a process", but this might not give all the
> information that "beagle" requires.
> 
> Jon.. whats your opinion on these thumb-rules???  [probably we should
> update the wiki with these thumb-rules...]
> 

OK here's what I want to do. A large fraction of the code I'm explaining
already exists.

Firstly I'll explain just a little of how abiword works so you can
understand why I suggest the following techniques.

AbiWord has a document model that is independent of the views that
interpret the document and make view of it.

Reading a document is then the process of filling the document with
valid data. AbiWord has a number of different document filters that do
this. We current have filters that do a good job of reading:

1. AbiWord (abw) Of course!
2. MS Word
3. RTF
4. Word Perfect
5. Open Office
6. Text.
7. XHTML

and a not so good job of loading many of other document types.

Now normally a view is built from the this document and the user can
edit it and print it etc. 

However we have also have a series of command line tools that take this
document and instead of creating an editable view we do other
things with it. In particular we can:
A. print it.
B. Export it via a filter to a particular Document type
C. Create a png image of the first page.

In addition to this, AbiWord has a command line plugin. If you load
this plugin then instead of jumping into the main-loop of gtk, you're
put into a command line with access to to the Abiword address
space. 

Here you can do a number of specific things like load documents, save
documents, find/replac etc...

OK so here is what I've done.

I've written some C-functions (and a little main() function to 
demonstrate the code.) that poen an abiword binary invoked with
the command-line plugin enabled.

This enables me to remotely control the abiword process via standard
unix pipes.

The idea is for the index program to pass abiword a document. AbiWord
loads it into it's model and then exports it to a beagle
friendly format. I've it it up to initially export to text. Then
beagle can index the document.

Having loaded the document into AbiWord, beagle could do a number of
other things with it. 

Like: 
A. Extract metadata (Title, author name, date)
B. Make a png snapshot of the document.

When Beagle is finished with this document, it simply sends off a new
document to the remote AbiWord binary, the old document is discarded,
the new document is loaded.

Now The C-functions present a very beagle friendly way to use abiword.

Here is the current API..

int  convertFileToText(const char * inFile, const char * outFile);
int  convertFileToPNG(const char * inFile, const char * outFile, int
iWidth, int iHeight);
int  finalizeConversions(void);

The C-functions take care of all the stuff about making sure the remote
abiword is running, setting up the communication pipes,
cleaning up afterward.

A new abiword is not spawned on every document. We reuse the currently
running one until it crashes or when you do a finalizeConversions call.
If the remote AbiWord crashes a new one is spawned.

The advantage of doing things this way is:

1. We AbiWord developers don't have to maintain a separate library just
for Beagle. As the filters for Abiword are improved in the continuous 
Open source way, Beagle gets them too.

2. You don't have to write a filter for every WP format under the sun.
Just improve the ones we already have. (Which generally happens in the
continuous open-source way anyway.)

3. When the inevitable happens and some weird document crashes 
AbiWord, Beagle is totally isolated. All that happens is that weird
document is not indexed. In fact Beagle might want to make a note of
that since it could be useful information. On the next document
presented to for conversion, a new process is spawneed and we carry on
as ever.

4. There is very little performance penalty for doing things this way.
The process of filling the AbiWord document model is far faster
than building a view. All you need to do is pass a filename then
read the resulting text file.

5. If you want to you take the additional time to take a png snapshot of
the first page you really need a full-blown Word Processor to arrange
the text to be like the user will see it.
This allows it.

6. I haven't written the code yet but it would be easy to write
additional functions to extract meta-data from the document after
loading it. 

I started to explain all this to trow at GUADEC-5 but unfortunately we
were defeated by lack of time. 

Most of this code was written about 9 months ago and has suffered from
some bit rot. I'll get the C-functions running again and post them to
the list. They're only about 200 LOC and very easily converted to C# I'm
sure.

Cheers

Martin


Best Regards,
> 
> V. Varadhan.
> 
> 




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]