[xml] UTF8ToHtml entity handling problem



In the process of doing maintenance on the mod_virgule (advogato.org)
codebase, I've run across what I think is a minor bug in the libxml2
function, UTF8ToHtml(). I've checked the bug database and didn't see any
reports on this. I thought it would be a good idea to post some
background on how I'm using UTF8ToHtml() before filing a bug report in
case I'm misunderstanding the intention of this function. 

Mod_virgule accepts HTML form submissions encoded as UTF-8 data. This
data often includes HTML markup as well as various international
characters. The data must be must be processed by older mod_virgule code
that can handle only ASCII, not UTF-8 data. So the raw UTF-8 data is
passed through UTF8ToHtml() to convert it to ASCII HTML with entity
encoding of non-ASCII characters.

This works great when the input is English or most languages that use a
Latin character set. But when valid HTML markup encoded as UTF-8
contains more exotic characters, such as Han ideographs, it causes
UTF8ToHtml() to fail, returning an error code of -2. This is unexpected
since the input was valid UTF-8.

I examined the code of the UTF8ToHtml() function and discovered that it
fails with error -2 because the input contains UTF-8 characters for
which libxml2 does not know a named entity value (e.g. "É").
Since there are tens of thousdands of possible UTF-8 characters and
libxml2 only knows names for a couple of hundred, this seems to suggest
UTF8ToHtml() will fail most of the time if the input includes non-Latin
character sets.

By making a trivial change to the code in UTF8ToHtml(), I was able to
correct this behavior. When a named entity value cannot be found in the
internal libxml2 entity table, a numeric entity value (e.g. "兡")
is used instead. 

Here's the original code where the problem lies:

            /*
             * Try to lookup a predefined HTML entity for it
             */

            ent = htmlEntityValueLookup(c);
            if (ent == NULL) {
                /* no chance for this in Ascii */
                *outlen = out - outstart;
                *inlen = processed - instart;
                return(-2);
            }

And here's the same piece of my revised code that seems to have fixed
the problem:

            /*
             * Try to lookup a predefined HTML entity for it
             */
 
            ent = htmlEntityValueLookup(c);
            if (ent == NULL) {
              snprintf(nbuf, sizeof(nbuf), "#%u", c);
              cp = nbuf;
            }


I can file a bug report and attach a full patch for this if desired.
Otherwise, maybe somebody can explain where I've gone wrong. Thanks!

-Steve





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]