Re: UTF-8 problem (XLS)



On Mon, 2007-08-06 at 02:19 -0300, John Coppens wrote:
Hello people.

Using version 1.6.3 of gnumeric, I tried to read an xls file and save it
as csv. There was a problem with the resulting csv file, in that iconv
didn't want to convert it into another coding. I'm not sure where the
problem lies.

The original .xls had the following string in it:

0068 0069 0070 [201A] 0072 ...

I've marked the 201A code, apparently a valid utf-16 code (according to
the xls specs).

Yes, this is U+201A which is a type of quotation mark. Someone might
have used it (wrongly) instead of a comma, or it might be the usual
style of quotation mark for some non-English text. If there's just one
such character and you're sure it's a mistake (e.g. it should obviously
be a comma) you can fix it in the spreadsheet and ignore the rest of my
post.

Gnumeric (or ssconvert) saved this in the csv as:

68 69 70 E2 80 9A 72 ...

Again 201A, and it seems to be  the shortest utf-8 code that can represent
it. But iconv -f utf8 -t iso-8859-1 chokes on the sequence and aborts
with:

illegal input sequence at position xxxx

I know -c can make iconv skip the error, but that doesn't seem elegant.
Can anyone indicate where to look for a solution?

This is not a Gnumeric problem

The ISO 8859-1 character set does not include U+201A, so iconv is
objecting because this transformation loses information. If you use the
iconv -c switch this quotation mark will just vanish from the output.

There is no way to do what you're asking, there simply isn't a way to
write this U+201A character using ISO-8859-1, so either you need to
choose a different encoding, or re-think the whole plan.

Nick.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]