[xml] Unicode VS XML: Need help with char encoding in XML



I imagine my problem is due to my own ignorance of how char encodings work and how libxml2 handles them, but I’m growing frustrated with my inability to figure it out so thought to beg advice from the list.

 

Given this small program:

 

/*

 * author:   Lucas Brasilino <brasilino recife pe gov br>

 * copy:     see Copyright for the status of this software

 * hacked up by Fred Smith to illustrate a problem I'm having.

 */

 

#include <stdio.h>

#include <libxml/parser.h>

#include <libxml/tree.h>

 

int

main(int argc, char **argv)

{

    xmlDocPtr doc = NULL;       /* document pointer */

    xmlNodePtr root_node = NULL, node = NULL, node1 = NULL;/* node pointers */

    xmlDtdPtr dtd = NULL;       /* DTD pointer */

    char buff[256];

    int i, j;

    xmlChar * convstr;

    char tststr[40];

    xmlNodePtr sub;

 

    LIBXML_TEST_VERSION;

    doc = xmlNewDoc(BAD_CAST "1.0");

    snprintf (tststr, sizeof(tststr), "Test %c Test", 0xC9);

    convstr = xmlEncodeEntitiesReentrant (doc, (xmlChar *)tststr);

    if (convstr)

           {

           printf ("tststr:  %s\n", tststr);

           printf ("convstr: %s\n", convstr);

           free (convstr);

           }

    xmlFreeDoc(doc);

    xmlCleanupParser();

    xmlMemoryDump();

    return(0);

}

I get this output:

 

$ ./tree

tststr:  Test Test

convstr: Test &#x260;Test

 

hexdump reveals it as:

 

000000: 73 74 73 74 72 3a 20 20 54 65 73 74 20 c9 20 54    ststr:  Test . T

000010: 65 73 74 0a 63 6f 6e 76 73 74 72 3a 20 54 65 73    est.convstr: Tes

000020: 74 20 26 23 78 32 36 30 3b 54 65 73 74 0a          t &#x260;Test. 

 

Now,… I’m puzzled by why the output from xmlEncodeEntitiesReentrant() seems clearly (to me) to be wrong. First of all, it has sucked up not only the 0xC9, but the character following it too, but just as bad when the app that should be receiving this actually gets it, it is unable to reconstruct the actual Unicode point that appeared in the original text (i.e., the 0xC9, which represents a capital E with acute accent).

 

I’m sure I’m doing something wrong here, but I am unable to see it, so your advice will be appreciated.

 

Thanks in advance!

 

Fred Smith

 

 


  ­­  


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]