Re: [orca-list] PDF readers....yet again?



While on the subject of extracting data from PDFs, anyone have any
tips for cleaning up the output of pdftotext?

Things like:

-Stripping out page headers/footers that are useful for finding
specific information in printed books but just clutter up a text file
and add interruptions where content bridges pages.
-Removing excessive white space... Making lines that only contain
whitespace characters blank, collapsing multiple blank lines into one,
removinghanging indents, etc.
-autowrapping paragraphs and lists(Nano can do this, but it assumes
everything between one blank line and the next is to be treated as a
single paragraph... not good if there's only a single new line between
paragraphs or for lists/tables with only blank lines at their
beginning and end).
-Ensuring tables survive the conversion.
-Converting mathematical expressions, especially ones using symbols
with no ascii-friendly equivalent(I have a script that uses iconv for
converting UTF-8 to ASCII since my console screen reader doesn't
handle Unicode well, but iconve will just quite mid-sentence if it
hits a character it has no ASCII-equivalent for).


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]