Re: [orca-list] pdftotext help.



Yeah, often pdf files contain a collection of images. I believe government documents are typically released that way. For example, the Mueller report was like that. I extracted the images from the Mueller report, ran tesseract on them, and concatenated the text into a single file. But it sounds like ocrmypdf does all that by itself.


On 11/25/21 4:14 PM, orca-list gnome org wrote:
Greetings again,

I am providing a work-around solution that I just found. Rather than
use jstor's direct download link for pdf's, I am using the print
option to save a pdf to my hard drive. I suppose then that this makes
it a scan, because pdftotext could do nothing with it. Then I
installed and ran ocrmypdf on it, and practically all of the missing
content now shows!

hth for anyone else out there with a similar problem,

Hwaen Ch'uqi


On 11/25/21, Hwaen Ch'uqi <hwaenchuqi gmail com> wrote:
Greetings All,

Following up on my earlier question about searching pdf files in
firefox and other browsers, I am now using pdftotext. The results are
generally fine, but I have noticed that pdf's from a certain site that
I use quite often - namely, jstor - seem consistently to be missing
characters. It almost seems as if pdftotext assumes certain margins
that are narrower than the documents', as if a4 horizontal margins are
being assumed rather than letter size. This is just a guess. Has
anyone run across this kind of thing and come up with a solution? I
tried playing with the -x and -y flags, setting them to 0, but the
results are the same.

I realize that this isn't precisely an orca question, but I thank you
for any help.

Hwaen Ch'uqi

_______________________________________________
orca-list mailing list
orca-list gnome org
https://mail.gnome.org/mailman/listinfo/orca-list
Orca wiki: https://wiki.gnome.org/Projects/Orca
Orca documentation: https://help.gnome.org/users/orca/stable/
GNOME Universal Access guide: https://help.gnome.org/users/gnome-help/stable/a11y.html


--
###
John G. Heim, 608-263-4189, jheim math wisc edu


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]