Re: FYI: Some pdf files fails to index 100% due to invliad utf8 generated by pdftotext on fedora core 6 (FC6)
- From: "D Bera" <dbera web gmail com>
- To: "Karsten Rasmussen" <frommetoyou comxnet dk>
- Cc: dashboard-hackers gnome org
- Subject: Re: FYI: Some pdf files fails to index 100% due to invliad utf8 generated by pdftotext on fedora core 6 (FC6)
- Date: Mon, 24 Mar 2008 12:44:18 -0400
> I had problems finde some pdf files with beagle-query.
> I think the problem is pdftotext some times returns invalid utf8 data -
> probably in some documents with danish letter æøåÆØÅ
>
> wrapping pdftotext to below seems to work:
>
> /usr/bin/pdftotext -q -nopgbrk -enc Latin1 "$FILE" - | iconv -t UTF-8 -f
> iso8859-1
The -enc is supposed to control the text output encoding. Beagle uses
-enc utf8. If doing -enc Latin1 and then passing the result through
iconv to change it to utf8 outputs valid utf8 text, then it is
definitely a bug with pdftotext. pdftotext -enc utf8 should have
produced correct utf8 text itself.
You might want to look into xpdf bugzilla and see if they have any
related bugs opened.
Thanks,
- dBera
--
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]