You need (1) feature extraction, finding the writing, (2) OCR of
sort, to turn pictures of letters into letters, and then (3) the
linguistic Analysis.

 Hey Liam:

Thank you, and yes, I could guess the way to go would be through the
steps you
outline, but I am pretty sure some other gimp developers have trodden
paths before and may have some tips to share.

I doubt it.

There _are_ somepeople who use GIMP to clean up images preparatory to
running OCR on them, or have been in the past, but there are much
better programs for that.

I asked you about text cleansing (cleaning) because it has different
meanings in different contexts; i'm *certainly* not interested in
losing the page apparatus or hyphenation information, although in my
own work i mark them so software can skip them whe wanted.

If you're doing an academic study of a book “manifestation” such things
are important, but i had rather use the Text Encoding Initiative as a
model than Michael Hart’s flailing Gutenberg project.

I do the same kinds of things you do 

I doubt that, at least from your description, but some of it may be a
language issue in reading the tone of your message. If you are doing
natural language processing and semantic-Web-style text mining your
needs for texts overlap with my personal projects but not so much with
GIMP, which is a bitmap image editor. For example, detecting Greek
words and phrases included in a 30,000 page OCR's text by analyzing the
page images would interest me (and detecting italics for that matter);
if i ever have a spare few days i plan to try the (then) latest
Tesseract for that.

