Re: [xml] SAX API returns wrong length for characters containing non-ASCII



On Mon, Jun 24, 2013 at 11:40:41PM +0200, Ludwig Weiss wrote:
Hello,

I'm trying to parse a xml document with the SAX API. The document
containts some some german "umlaute". A short example:
<?xml version="1.0" encoding="UTF-8"?>
<Mediathek>
<X><n>hello from Köln</n><g>http://www.koeln.de</g></X>
<X><n>öhello from Köln</n><g>http://www.koeln.de</g></X>
</Mediathek>

The callback to my charactersSAXFunc tells me the String inside
<n>...</n> of the first line is 12. So the String I save for later use
is just "hello from K".

Whereas for the second line it returns the correct length of 18, so I
get the complete String. The difference is that it starts with an
non-ascii sign. The same happens btw. with french letters.

Possibly I forgot to tell something to the parser?

Thanks for your great effort :)

  I think in the first case you should get 2 consecutive character
callbacks, not one, make sure you don't miss events from the parser.

thinkpad:~/XML -> xmllint --sax --debug tst.xml
SAX.setDocumentLocator()
SAX.startDocument()
SAX.startElementNs(Mediathek, NULL, NULL, 0, 0, 0)
SAX.characters(
, 1)
SAX.startElementNs(X, NULL, NULL, 0, 0, 0)
SAX.startElementNs(n, NULL, NULL, 0, 0, 0)
SAX.characters(hello from K, 12)
SAX.characters(öln, 4)
SAX.endElementNs(n, NULL, NULL)

...

  SAX is not a good API for developpers, just to easy to get things
wrong, i sugegst to use the Reader API instead !

  http://xmlsoft.org/xmlreader.html

Daniel

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]