Re: [orca-list] PDF readers....yet again?

From: Jeffery Mewtamer <mewtamer gmail com>
To: orca-list gnome org
Subject: Re: [orca-list] PDF readers....yet again?
Date: Fri, 16 Apr 2021 19:49:27 +0000

While on the subject of extracting data from PDFs, anyone have any
tips for cleaning up the output of pdftotext?

Things like:

-Stripping out page headers/footers that are useful for finding
specific information in printed books but just clutter up a text file
and add interruptions where content bridges pages.
-Removing excessive white space... Making lines that only contain
whitespace characters blank, collapsing multiple blank lines into one,
removinghanging indents, etc.
-autowrapping paragraphs and lists(Nano can do this, but it assumes
everything between one blank line and the next is to be treated as a
single paragraph... not good if there's only a single new line between
paragraphs or for lists/tables with only blank lines at their
beginning and end).
-Ensuring tables survive the conversion.
-Converting mathematical expressions, especially ones using symbols
with no ascii-friendly equivalent(I have a script that uses iconv for
converting UTF-8 to ASCII since my console screen reader doesn't
handle Unicode well, but iconve will just quite mid-sentence if it
hits a character it has no ASCII-equivalent for).

References:
- Re: [orca-list] PDF readers....yet again?
  - From: Rynhardt Kruger
- Re: [orca-list] PDF readers....yet again?
  - From: Jonesy Cee
- Re: [orca-list] PDF readers....yet again?
  - From: Rynhardt Kruger
- Re: [orca-list] PDF readers....yet again?
  - From: Kyle

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]