[Gimp-user] approches used for language detection on images ...



You need (1) feature extraction, finding the writing, (2) OCR of some
sort, to turn pictures of letters into letters, and then (3) the
linguistic Analysis.

 Hey Liam:

Thank you, and yes, I could guess the way to go would be through the steps you
outline, but I am pretty sure some other gimp developers have trodden those
paths before and may have some tips to share.
 
However, many images contain metadata in plain text (OK, XML or
whatever) that may include language and location information.

Most of the kinds of texts I work on are image based pdf files which were
scanned as images

I'm interested in the text cleansing, can you tell me more (off list
maybe?)

"text cleansing" or "text normalization" (as they also call it, but which to
most people is another phase of "cleansing", for example, making sure that the
text is "normalized", e.g., in a java.text.Normalizer.Form way) means removing
all the bsing visual distraction and the ephemeral comercial nonsense from
pages.
 
 https://www.google.com/search?q="text+cleansing";

For example, gutenberg.org, has taken the effort to textualize lots of books,
but they include some nonsensical header and footer, use breaklines (something
necessary in those times people used main frames which displays were 80
character wide, ...)

This kind of nonsense has become the new normal. I work as a teacher and I see
it as abusive specially when done to students and people who are just trying to
get something done. Companies internally block certain sites, types of content,
pages and sections of pages, it is about time that people start doing it more
aggressively on their own. Some other people tell you about "user agreements",
"morallity" and about "capitalism going down if people start doing that more
aggressively" ;-)

I do the same kinds of things you do but these times I am more interested in
texts especially if they relate to education. Mine of my research efforts
relates to a corpus of the Regents exams (going back to the 1860's). They
contain plenty of intertextual pictures and zero comma nada annotations,
frequent language switch in the texts . . .

-- 
JWein (via www.gimpusers.com/forums)


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]