Re: [xml] setting the default charset ?



Le ven, jui 27, 2001, à 01:36:57 -0400, Daniel Veillard a écrit:

On Fri, Jul 27, 2001 at 06:49:25PM +0200, Cyrille Chepelov wrote:
So, at the worst case, we could pass the older files through iconv() to make
sure they're UTF-8 and let libxml2 handle the result.

  Well if you use the libxml2 framework, it will be done progtressively 

this was just a contingency solution ; I prefer to cooperate with my libs,
not circumvent them :-)

  Libxml will never look at locales, I garantee this !

Good ! (and yes, I definitely agree: there are so many ways of looking at
locales... the application knows better).

Something like 
    int xmlSetParserEncoding(xmlParserCtxPtr ctxt,
                            const char *encoding);
would be nice. (I initially thought that would be what xmlSwitchEncoding()
was supposed to do, but it didn't quite work. And I'm afraid I don't really
understand what the libxml-parserinternals page says on this function).

  xmlSwitchEncoding will put an iconv filer for this encoding between your
source and the parser, more precisely a encoder from this encoding to
UTF8 

So, in theory, that should be OK ?

 in the meantime use xmlSwitchEncoding().

I tried: this failed (here's the code snippet I used, raw with the comments
:-) I had delayed a bit since I wrote the comments, because I hoped to see
you in Bordeaux -- your badge was there, but I don't know whether you met it)

/* int get_local_charset(const char **charset) returns TRUE if the local
charset is UTF-8, FALSE otherwise. charset is filled with the correct
character set information (usually from nl_langinfo(CODESET) but also from
what libunicode says when it's not broken). */

xmlDocPtr
xmlDiaParseFile(const char *filename) {
  /* Copied from libxml 2.3.9's xmlSAXParseFile function.
     written by Daniel Veillard w3 org, then modified for dia's purpose 
     by Cyrille Chepelov */

    xmlDocPtr ret;
    xmlParserCtxtPtr ctxt;
    char *directory = NULL;
    char *local_charset = NULL;

    ctxt = xmlCreateFileParserCtxt(filename);
    if (ctxt == NULL) {
        return(NULL);
    }

    if ((ctxt->directory == NULL) && (directory == NULL))
        directory = xmlParserGetDirectory(filename);
    if ((ctxt->directory == NULL) && (directory != NULL))
        ctxt->directory = (char *) xmlStrdup((xmlChar *) directory);

#ifdef XML2
#if 1 /* This doesn't work. In fact, libxml seems to do just whatever it 
         pleases wrt charsets (it *seems* to do the right thing when loading 
         older 8859-1 diagrams. I really don't know whether it'll load 
         correctly non-8859-1 diagrams !). If it doesn't, I see two courses 
         of action: 
                 1) ask Daniel Veillard for help.
                 2) run a quick and dirty zcat|sed job to put the encoding
                 of the dia file on the fly into a temporary, and then 
                 load that temp. instead of the real file.

         For now I'll hope everything will happen alright   -- CC */

    if (!get_local_charset(&local_charset)) {
      /* local charset is not UTF-8. We switch at first to local encoding,
         libxml will switch back to another encoding if necessary and 
         present in the XML file. */
      xmlCharEncoding enc = xmlParseCharEncoding(local_charset);
      if (enc != XML_CHAR_ENCODING_ERROR) {
        xmlSwitchEncoding(ctxt,enc);
      } else {
        xmlSwitchEncoding(ctxt, XML_CHAR_ENCODING_8859_1);
        g_warning("local encoding %s unsupported by libxml; will use 8859-1
"
                  "as default.", local_charset); 
      }
    } else {
      xmlSwitchEncoding(ctxt, XML_CHAR_ENCODING_UTF8);
    }
#endif

#else
#ifdef UNICODE_WORK_IN_PROGRESS
#error "We can't make this work without libxml2."
#endif
#endif

    xmlParseDocument(ctxt);

    if ((ctxt->wellFormed)) ret = ctxt->myDoc;
    else {
       ret = NULL;
       xmlFreeDoc(ctxt->myDoc);
       ctxt->myDoc = NULL;
    }
    xmlFreeParserCtxt(ctxt);
    
    return(ret);
}

(yes, the XML2 symbol is defined).


Should this bit of code work ? If it should, then I'll commit it "as is",
and see whether I hear screams from Eastern Europe...

(well, usually nobody really cares about what's committed in dia's tree.
Only after two releases are made, bug reports begin flowing in).

        -- Cyrille

-- 
Grumpf.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]