PNG to PDF via eog - command line option?



Hello,

First, thanks for EOG.

I apologize if this is not the proper means to ask this question.  If not, please instruct me as to where / how to do so.
I am new at OCR and image and PDF manipulation.  However, I have spent the last day or two coming up to speed, but am by no means an expert.

My overall goal is to convert PNG files to PDF and use pdfsandwich with tesseract for OCR and combining into a final PDF.
I can accomplish this in a number of ways; however, the one that produces the best quality image along with highlighting and searchable text (hidden/embedded) requires that I use EOG (or I should say I have only accomplished the goal with EOG).  I wish to automate the process in a script.  Hence, my question:

If I open a PNG file in eog (image file saved from a webpage like GoogleBooks), and print to a file as PDF, I get excellent quality PDF that works perfectly with pdfsandwich, for example:
pdfsandwich -tesso -tesseract-config eogcreatedpdfrompng.pdf -o MyFinalOCRpdffile.pdf

I can find no means to execute a command on a command line or similar to obtain the same results with eog, mogrify, or convert.

I have tried the following, which all create a PDF, but have issues with use in pdfsandwich:
mogrify -format pdf mypngfile.png
convert mypngfile.png myfilefrompng.pdf

The error from pdfsandwich is:

libpng error: No IDATs written into file
Error: /VMerror in --showpage--
VM status: 3 852294 2165352
Current allocation mode is local
Last OS error: 2
GPL Ghostscript 9.05: Unrecoverable error, exit code 1
convert: Postscript delegate failed `convertreaderpng2pdf.pdf':  @ error/pdf.c/ReadPDFImage/663.
convert: missing an image filename `png:/tmp/pdfsandwich5e9c99.png' @ error/convert.c/ConvertImageCommand/3011.
sh: 1: cannot open /tmp/pdfsandwich5e0e48.html: No such file
OCR done. Writing "PDFSANDWICHED_convertreaderpng2pdf.pdf"

The lesser quality image in the PDF that is successful is accomplished as follows:
convert -monochrome -density 300 -normalize -depth 8 reader.png reader.pgm
tesseract -psm 1 reader.pgm output.pdf hocr
convert reader.pgm reader.jpg
hocr2pdf -i reader.jpg -s -o readerFinal.pdf < output.pdf.html

I would appreciate any help that one might provide in accomplishing the goal.

Note: the OCR is not very successful for this document; any tips with regards to this would be appreciated if anyone has such knowledge.

For your convenience, here are files of interest:
1.  the original 'reader.png' file downloaded from GoogleBooks
2.  the high quality PDF generated with eog and pdfsandwich
3.  the lower quality PDF generated with convert and pdfsandwich

Perhaps I can accomplish the same with convert if I was aware of the details of the file format used to generate the PDF from eog.
Or, alternately, I would be happy to learn about a command line option for eog or some other means.  FYI - I would be elated if the solution comes by means of running a python script, as I already use python to automatically process web page information, etc.

Thank you so much.
Tim

Attachment: reader.png
Description: PNG image

Attachment: Finalpdfsandwichfile.pdf
Description: Adobe PDF document

Attachment: readerFinal.pdf
Description: Adobe PDF document



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]