[xml] Trouble with UTF-8



Greetings,

First a little background: I've been using libxml++ (and have tried a few
other C++ wrappers) but can't get any of them to work properly with UTF8
(libxml++'s problem actually being a glib bug - #301935).  I also never really
liked all the deps that were needed for libxml++,  so I am attempting
to cut the middle man and finally learn libxml2.

I've written up a simple test case[1] to see if I can get it working
properly but am thoroughly confused.

test xml[2]:
-->
<?xml version="1.0" encoding="UTF-8"?>
<maintainers>
    <maintainer>
        <name>Diego PettenÃÂ</name>
    </maintainer>
    <maintainer>
        <name>Bryan Ãstergaard</name>
    </maintainer>
</maintainers>
<--

If I run the program, and use xmlUTF8Strsub(ch,0,len) in the characters
callback (using the SAX2 interface, btw), it truncates the string before the
character.  It also looks like the callback is called twice if the xmlChar*
contains a UTF8 char, with the string being truncated before the char on the
first one and everything afthe UTF8 char on the second one.  To see what I'm
talking about, you can view the output of the program here[3].

I was able to make a teensie amount of progress, using the resulting length
returned by xmlGetUTF8Char(), but this only worked on the first name (since it
was the last char).  Am I at least heading in the right direction here?

How can I go about getting the final string I'm looking for?

Thanks

[1] http://butsugenjitemple.org/~ka0ttic/parser.c
    http://butsugenjitemple.org/~ka0ttic/parser.c.html
[2] http://butsugenjitemple.org/~ka0ttic/test.xml
[3] http://butsugenjitemple.org/~ka0ttic/parser-output.txt

-- 
Aaron Walker



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]