I imagine my problem is due to my own ignorance of how char encodings work and how libxml2 handles them, but I’m growing frustrated with my inability to figure it out so thought to beg advice from the list. Given this small program: /* * author: Lucas Brasilino <brasilino recife pe gov br> * copy: see Copyright for the status of this software * hacked up by Fred Smith to illustrate a problem I'm having. */ #include <stdio.h> #include <libxml/parser.h> #include <libxml/tree.h> int main(int argc, char **argv) { xmlDocPtr doc = NULL; /* document pointer */ xmlNodePtr root_node = NULL, node = NULL, node1 = NULL;/* node pointers */ xmlDtdPtr dtd = NULL; /* DTD pointer */ char buff[256]; int i, j; xmlChar * convstr; char tststr[40]; xmlNodePtr sub; LIBXML_TEST_VERSION; doc = xmlNewDoc(BAD_CAST "1.0"); snprintf (tststr, sizeof(tststr), "Test %c Test", 0xC9); convstr = xmlEncodeEntitiesReentrant (doc, (xmlChar *)tststr); if (convstr) { printf ("tststr: %s\n", tststr); printf ("convstr: %s\n", convstr); free (convstr); } xmlFreeDoc(doc); xmlCleanupParser(); xmlMemoryDump(); return(0); } I get this output: $ ./tree
tststr: Test
� Test convstr: Test ɠTest hexdump reveals it as: 000000: 73 74 73 74 72 3a 20 20 54 65 73 74 20
c9 20 54 ststr: Test . T 000010: 65 73 74 0a 63 6f 6e 76 73 74 72 3a 20 54 65 73 est.convstr: Tes 000020: 74 20
26 23 78 32 36 30 3b 54 65 73 74 0a t ɠTest.
Now,… I’m puzzled by why the output from xmlEncodeEntitiesReentrant() seems clearly (to me) to be wrong. First of all, it has sucked up not only the 0xC9, but the character following it too, but just as bad when the app that should be receiving
this actually gets it, it is unable to reconstruct the actual Unicode point that appeared in the original text (i.e., the 0xC9, which represents a capital E with acute accent). I’m sure I’m doing something wrong here, but I am unable to see it, so your advice will be appreciated. Thanks in advance! Fred Smith |