Re: [Gnome-OCR] Integration of Tesseract-OCR...



Hi !

Ok, I'm more and more convinced that this project would worth it and I
think I'll spend some time on it. But before starting here are few
questions.

1) Tesseract-OCR software has been released under Apache License 2.0

This license is know to be incompatible with the GPL because more
restrictive about patents:

« Apache Software License, version 2.0

This is a free software license but it is incompatible with the GPL. The
Apache Software License is incompatible with the GPL because it has a
specific requirement that is not in the GPL: it has certain patent
termination cases that the GPL does not require. (We don't think those
patent termination cases are inherently a bad idea, but nonetheless they
are incompatible with the GNU GPL.) »

Seen on: http://www.gnu.org/licenses/license-list.html

What are the implication of this on the use of this software inside a
GPLed project ?

2) Refactoring the code

The main problem with Tesseract-OCR as it is now is that it has been
coded with C89 standards in mind and it does not comply at all with C99
view. One obvious problem is that portability to 64bits plate-forms will
require quite some work. For example:

...
typedef long INT32;
typedef unsigned int UINT32;
...

[Excerpt from ccutils/host.h]

Theses lines just demonstrate that the authors of Tesseract did apply
the (wrong) 'long is an int' belief. I can hardly resist to quote 'Henry
Spencer' here:

« Contrary to the heresies espoused by some of the dwellers on the
Western Shore, `int' and `long' are not the same type. The moment of
their equivalence in size and representation is short, and the agony
that awaits believers in their interchangeability shall last forever and
ever once 64-bit machines become common. »
						-- Henry Spencer

But that's not the only problem in Tesseract. After browsing the code
and investigating a bit (using Doxygen to generate some extra
documentation about class hierarchy), my conclusions are that:

- The code is just breaking the whole C99 type system spirit and has to
be redone from scratch if we want some 64bits compatibility;

- Looking at the (hairy) class hierarchy did not convinced me that C++
was really required here, I would really go for C instead;

- Data-structures are quite classical and should be taken from an
existing library (glib or others)... but this is contradictory with the
fact I want Desktop independence... So, for now, I just push this choice
into the stack and hopping to not have to take a decision too soon.

- As the cleaning and the refactoring of the code might take quite some
time, Alan Horkan suggested to first come with some Gnome wrapper to the
existing interface and to make the back-end evolve. This is probably the
best way to do and in the same time to be able to keep the hope to get
Tesseract in other projects.

So, does all these choices appear to be ok or am I a stupid git that
forgot something vital ? :)


Well, that about all...

As I am quite busy (and sloooow), I'll try to set up a Website and a
small SVN repository around Christmas and I'll keep you informed about
my progress.

Regards
-- 
Emmanuel Fleury              | Office: 261
Associate Professor,         | Phone: +33 (0)5 40 00 69 34
LaBRI, Domaine Universitaire | Fax:   +33 (0)5 40 00 66 69
351, Cours de la Libération  | email: emmanuel fleury labri fr
33405 Talence Cedex, France  | URL: http://www.labri.fr/~fleury



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]