Re: [Gimp-user] approches used for language detection on images ...
- From: Liam R E Quin <liam holoweb net>
- To: JWein <forums gimpusers com>, gimp-user-list gnome org
- Subject: Re: [Gimp-user] approches used for language detection on images ...
- Date: Wed, 29 Jan 2020 15:07:10 -0500
On Wed, 2020-01-29 at 13:52 +0100, JWein wrote:
You need (1) feature extraction, finding the writing, (2) OCR of
some
sort, to turn pictures of letters into letters, and then (3) the
linguistic Analysis.
Hey Liam:
Thank you, and yes, I could guess the way to go would be through the
steps you
outline, but I am pretty sure some other gimp developers have trodden
those
paths before and may have some tips to share.
I doubt it.
There _are_ somepeople who use GIMP to clean up images preparatory to
running OCR on them, or have been in the past, but there are much
better programs for that.
I asked you about text cleansing (cleaning) because it has different
meanings in different contexts; i'm *certainly* not interested in
losing the page apparatus or hyphenation information, although in my
own work i mark them so software can skip them whe wanted.
If you're doing an academic study of a book “manifestation” such things
are important, but i had rather use the Text Encoding Initiative as a
model than Michael Hart’s flailing Gutenberg project.
I do the same kinds of things you do
I doubt that, at least from your description, but some of it may be a
language issue in reading the tone of your message. If you are doing
natural language processing and semantic-Web-style text mining your
needs for texts overlap with my personal projects but not so much with
GIMP, which is a bitmap image editor. For example, detecting Greek
words and phrases included in a 30,000 page OCR's text by analyzing the
page images would interest me (and detecting italics for that matter);
if i ever have a spare few days i plan to try the (then) latest
Tesseract for that.
--
Liam Quin - web slave for https://www.fromoldbooks.org/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]