pvanhoof@lars:~/repos/gnome/tracker-miners$ git push origin wip/pvanhoof/ocr-pdf-support:wip/pvanhoof/ocr-pdf-support
Counting objects: 6, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 2.02 KiB | 0 bytes/s, done.
Total 6 (delta 5), reused 0 (delta 0)
remote: hooks/pre-receive: line 125: syntax error in conditional _expression_: unexpected token `;'
remote: hooks/pre-receive: line 125: syntax error near `;'
remote: hooks/pre-receive: line 125: ` if [[ $basedir = '/var/opt/gitlab/git-data/repositories/GNOME' || $basedir = '/git' || $basedir = '/var/opt/gitlab/git-data/repositories/Infrastructure']]; then'
To ssh://git.gnome.org/git/tracker-miners
! [remote rejected] wip/pvanhoof/ocr-pdf-support -> wip/pvanhoof/ocr-pdf-support (pre-receive hook declined)
error: failed to push some refs to 'ssh://pvanhoof git gnome org/git/tracker-miners'
pvanhoof@lars:~/repos/gnome/tracker-miners$ git remote add github git github com:pvanhoof/tracker-gnome.git
pvanhoof@lars:~/repos/gnome/tracker-miners$ git push github wip/pvanhoof/ocr-pdf-support:wip/pvanhoof/ocr-pdf-support
Counting objects: 83907, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (17095/17095), done.
Writing objects: 100% (83907/83907), 32.94 MiB | 2.50 MiB/s, done.
Total 83907 (delta 66373), reused 83894 (delta 66363)
remote: Resolving deltas: 100% (66373/66373), done.
To github.com:pvanhoof/tracker-gnome.git
* [new branch] wip/pvanhoof/ocr-pdf-support -> wip/pvanhoof/ocr-pdf-support
pvanhoof@lars:~/repos/gnome/tracker-miners$ git push github master:master
Total 0 (delta 0), reused 0 (delta 0)
To github.com:pvanhoof/tracker-gnome.git
* [new branch] master -> master
pvanhoof@lars:~/repos/gnome/tracker-miners$
Hey us, the people who made Tracker,First time in my life I actually used the same software I worked on afew years ago. I bought myself a Medion laptop and cheap as it is, itof course spontaneously broke. So I had to find the invoice and otherdocuments for it, so that I could bring it back to the store for RMA.Tried to search my PDFs (I'm one of those 'Yes we scan'-people, whoscans all his documents before putting them in maps or bringing them tothe accountant). Of course that didn't work. Because the PDFs didn'thave OCR applied to them by my dead-tree scanner apparatus.However, I made a little script that does that for me:pvanhoof@lars:~/Documents$ cat /usr/local/bin/fixpdfs.shfor a in *pdf; do pdftk $a cat 1-endwest output ROT-$a; pdfocr -i ROT-$a -o OCR-$a; rm ROT-$a; mv OCR-$a $a; donepvanhoof@lars:~/Documents$Now I was wondering: couldn't we add non-intrusive OCR to Tracker's PDFextractor? By that I mean we could let it do an OCR first, extract itthat way, but don't write that to the original PDF (as our extractorsshould not modify the files). I guess we could use tracker-writeback(if that still works) to write the OCR into PDF files in case the userwants that.Given that not forgetting to run that damn script on my recentlyscanned PDFs is probably more time consuming over de span of one year,than to just add it to tracker-extractor's PDF extractor; I mightactually just do this myself. If somebody wants to beat me to it orjoin the fun. Let me know.Thoughts?I think we'lla) See if the PDF already has text embedded or notb) Detect orientation and rotate the PDF to a temporary file. Else OCRwill not detect anythingc) Link with an OCR library and enrich-first and/or extract thedetected textd) SPARQL-insert the text as nie:plainTextContent or something.Kind regards,Philip_______________________________________________tracker-list mailing list
Attachment:
signature.asc
Description: This is a digitally signed message part