Re: [xml] utf-8 encoding and xmlSAXParseMemory



On Tue, May 02, 2006 at 07:15:07PM +0200, A. Pagaltzis wrote:
* Olivier Sirven <osirven elma fr> [2006-05-02 18:35]:
If you have a solution for correcting every invalid character
into a valid one without loosing information I would be really
happy to read it :)

Well, not in the general case; the computer is not a mind reader.
But depending on the assumptions you can make, you can do
something like what I wrote about here:

    Repairing broken documents that mix UTF-8 and ISO-8859-1
    http://plasmasturm.org/log/416/

  The problem is "how do you know it's ISO-8859-1 and not another variant.
You can't garantee to not generate false positive (i.e. corrupt data) which
is why the XML Working Group declared this had to be a fatal error. The
only sane approach (in those days of liability for software this is
especially true) is to force the error to get the input fixed, unless you
have some information which tells you what the encoding really is and
then you can still preprocess.

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]