FYI: Some pdf files fails to index 100% due to invliad utf8 generated by pdftotext on fedora core 6 (FC6)



FYI:
 
I had problems finde some pdf files with beagle-query.
I think the problem is pdftotext some times returns invalid utf8 data - probably in some documents with danish letter æøåÆØÅ
 
wrapping pdftotext  to below seems to work:
 
    /usr/bin/pdftotext -q -nopgbrk -enc Latin1 "$FILE" - | iconv -t UTF-8 -f iso8859-1
 
/knr


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]