Hey us, the people who made Tracker, First time in my life I actually used the same software I worked on a few years ago. I bought myself a Medion laptop and cheap as it is, it of course spontaneously broke. So I had to find the invoice and other documents for it, so that I could bring it back to the store for RMA. Tried to search my PDFs (I'm one of those 'Yes we scan'-people, who scans all his documents before putting them in maps or bringing them to the accountant). Of course that didn't work. Because the PDFs didn't have OCR applied to them by my dead-tree scanner apparatus. However, I made a little script that does that for me: pvanhoof@lars:~/Documents$ cat /usr/local/bin/fixpdfs.sh for a in *pdf; do pdftk $a cat 1-endwest output ROT-$a; pdfocr -i ROT- $a -o OCR-$a; rm ROT-$a; mv OCR-$a $a; done pvanhoof@lars:~/Documents$ Now I was wondering: couldn't we add non-intrusive OCR to Tracker's PDF extractor? By that I mean we could let it do an OCR first, extract it that way, but don't write that to the original PDF (as our extractors should not modify the files). I guess we could use tracker-writeback (if that still works) to write the OCR into PDF files in case the user wants that. Given that not forgetting to run that damn script on my recently scanned PDFs is probably more time consuming over de span of one year, than to just add it to tracker-extractor's PDF extractor; I might actually just do this myself. If somebody wants to beat me to it or join the fun. Let me know. Thoughts? I think we'll a) See if the PDF already has text embedded or not b) Detect orientation and rotate the PDF to a temporary file. Else OCR will not detect anything c) Link with an OCR library and enrich-first and/or extract the detected text d) SPARQL-insert the text as nie:plainTextContent or something. Kind regards, Philip
Attachment:
signature.asc
Description: This is a digitally signed message part