Re: [Tracker] The PDF extractor and OCR



Here you go:

https://github.com/pvanhoof/tracker-gnome/tree/wip/pvanhoof/ocr-pdf-support

Note that this doesn't yet do automatic rotating. And note that I think that instead of using pdftoppm I could also use Poppler's API to convert a page into a temporary PPM file.

Note. Carlos: I tried pushing this to a branch on git.gnome.org, but apparently that fails nowadays *.

Kind regards,

Philip


* Attempt to push to git.gnome.org:

pvanhoof@lars:~/repos/gnome/tracker-miners$ git push origin wip/pvanhoof/ocr-pdf-support:wip/pvanhoof/ocr-pdf-support
Counting objects: 6, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 2.02 KiB | 0 bytes/s, done.
Total 6 (delta 5), reused 0 (delta 0)
remote: hooks/pre-receive: line 125: syntax error in conditional _expression_: unexpected token `;'
remote: hooks/pre-receive: line 125: syntax error near `;'
remote: hooks/pre-receive: line 125: `    if [[ $basedir = '/var/opt/gitlab/git-data/repositories/GNOME' || $basedir = '/git' || $basedir = '/var/opt/gitlab/git-data/repositories/Infrastructure']]; then'
To ssh://git.gnome.org/git/tracker-miners
 ! [remote rejected]     wip/pvanhoof/ocr-pdf-support -> wip/pvanhoof/ocr-pdf-support (pre-receive hook declined)
error: failed to push some refs to 'ssh://pvanhoof git gnome org/git/tracker-miners'
pvanhoof@lars:~/repos/gnome/tracker-miners$ git remote add github git github com:pvanhoof/tracker-gnome.git
pvanhoof@lars:~/repos/gnome/tracker-miners$ git push github wip/pvanhoof/ocr-pdf-support:wip/pvanhoof/ocr-pdf-support
Counting objects: 83907, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (17095/17095), done.
Writing objects: 100% (83907/83907), 32.94 MiB | 2.50 MiB/s, done.
Total 83907 (delta 66373), reused 83894 (delta 66363)
remote: Resolving deltas: 100% (66373/66373), done.
To github.com:pvanhoof/tracker-gnome.git
 * [new branch]          wip/pvanhoof/ocr-pdf-support -> wip/pvanhoof/ocr-pdf-support
pvanhoof@lars:~/repos/gnome/tracker-miners$ git push github master:master
Total 0 (delta 0), reused 0 (delta 0)
To github.com:pvanhoof/tracker-gnome.git
 * [new branch]          master -> master
pvanhoof@lars:~/repos/gnome/tracker-miners$



On Sat, 2018-02-24 at 12:33 +0100, Philip Van Hoof wrote:
Hey us, the people who made Tracker,

First time in my life I actually used the same software I worked on a
few years ago. I bought myself a Medion laptop and cheap as it is, it
of course spontaneously broke. So I had to find the invoice and other
documents for it, so that I could bring it back to the store for RMA.

Tried to search my PDFs (I'm one of those 'Yes we scan'-people, who
scans all his documents before putting them in maps or bringing them to
the accountant). Of course that didn't work. Because the PDFs didn't
have OCR applied to them by my dead-tree scanner apparatus.

However, I made a little script that does that for me:

pvanhoof@lars:~/Documents$ cat /usr/local/bin/fixpdfs.sh 
for a in *pdf; do pdftk $a cat 1-endwest output ROT-$a; pdfocr -i ROT-
$a -o OCR-$a; rm ROT-$a; mv OCR-$a $a; done
pvanhoof@lars:~/Documents$

Now I was wondering: couldn't we add non-intrusive OCR to Tracker's PDF
extractor? By that I mean we could let it do an OCR first, extract it
that way, but don't write that to the original PDF (as our extractors
should not modify the files). I guess we could use tracker-writeback
(if that still works) to write the OCR into PDF files in case the user
wants that.

Given that not forgetting to run that damn script on my recently
scanned PDFs is probably more time consuming over de span of one year,
than to just add it to tracker-extractor's PDF extractor; I might
actually just do this myself. If somebody wants to beat me to it or
join the fun. Let me know.

Thoughts?

I think we'll

a) See if the PDF already has text embedded or not

b) Detect orientation and rotate the PDF to a temporary file. Else OCR
will not detect anything

c) Link with an OCR library and enrich-first and/or extract the
detected text

d) SPARQL-insert the text as nie:plainTextContent or something.

Kind regards,

Philip



_______________________________________________
tracker-list mailing list
tracker-list gnome org
https://mail.gnome.org/mailman/listinfo/tracker-list

Attachment: signature.asc
Description: This is a digitally signed message part



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]