[Tracker] The PDF extractor and OCR

From: Philip Van Hoof <philip codeminded be>
To: tracker-list gnome org
Subject: [Tracker] The PDF extractor and OCR
Date: Sat, 24 Feb 2018 12:33:56 +0100

Hey us, the people who made Tracker,

First time in my life I actually used the same software I worked on a
few years ago. I bought myself a Medion laptop and cheap as it is, it
of course spontaneously broke. So I had to find the invoice and other
documents for it, so that I could bring it back to the store for RMA.

Tried to search my PDFs (I'm one of those 'Yes we scan'-people, who
scans all his documents before putting them in maps or bringing them to
the accountant). Of course that didn't work. Because the PDFs didn't
have OCR applied to them by my dead-tree scanner apparatus.

However, I made a little script that does that for me:

pvanhoof@lars:~/Documents$ cat /usr/local/bin/fixpdfs.sh 
for a in *pdf; do pdftk $a cat 1-endwest output ROT-$a; pdfocr -i ROT-
$a -o OCR-$a; rm ROT-$a; mv OCR-$a $a; done
pvanhoof@lars:~/Documents$

Now I was wondering: couldn't we add non-intrusive OCR to Tracker's PDF
extractor? By that I mean we could let it do an OCR first, extract it
that way, but don't write that to the original PDF (as our extractors
should not modify the files). I guess we could use tracker-writeback
(if that still works) to write the OCR into PDF files in case the user
wants that.

Given that not forgetting to run that damn script on my recently
scanned PDFs is probably more time consuming over de span of one year,
than to just add it to tracker-extractor's PDF extractor; I might
actually just do this myself. If somebody wants to beat me to it or
join the fun. Let me know.

Thoughts?

I think we'll

a) See if the PDF already has text embedded or not

b) Detect orientation and rotate the PDF to a temporary file. Else OCR
will not detect anything

c) Link with an OCR library and enrich-first and/or extract the
detected text

d) SPARQL-insert the text as nie:plainTextContent or something.

Kind regards,

Philip

Attachment: signature.asc
Description: This is a digitally signed message part

Follow-Ups:
- Re: [Tracker] The PDF extractor and OCR
  - From: Philip Van Hoof

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]