[xml] Problems with xmlCharEncInFunc()



Hello,

i've tested my self defined character encoding handler for HP-ROMAN8 with
the encoding API of libxml and it works fine so far.
But i've done some kind of "stress" testing near the default xml-buffer size
and unpropitious character combinations and run into problems in
xmlCharEncInFunc().
ROMAN-8 includes some characters that have to be transcoded to three
octets in UTF-8, so a maximum unpropitious ROMAN-8 string needs the
triple space in UTF-8 encoding.
In 'encoding.c' we have the following code to calculate the needed size
for the output buffer

encoding.c: 2056,2063
  toconv = in->use;
  if (toconv == 0)
    return (0);
  written = out->size - out->use;
  if (toconv * 2 >= written) {
    xmlBufferGrow(out, out->size + toconv * 2);
    written = out->size - out->use - 1;
  }

So, if the double in-buffer size () exceeds the available out-buffer size,
the out-buffer size is increased by the double in-buffer size.
This is sufficient for ISO-8859-1 encoding in any case, and for most
cases it will work for other encodings like HP-ROMAN8. But one can
construct in-buffer where it fails, e.g. any buffer that is greater than
1/3 and smaller than 1/2 of the default buffer size (provided that we
use default out-buffer) and holds characters that are mapped to three
UTF-8 octets. OK, it's constructed, but not impossible!?
I guess a simple (maybe naive) solution would be to change this behavior
to

  toconv = in->use;
  if (toconv == 0)
    return (0);
  written = out->size - out->use;
  if (toconv * 3 >= written) {
    xmlBufferGrow(out, out->size + toconv * 3);
    written = out->size - out->use - 1;
  }

which should work for any encodings that are covered by Unicode (UCS-2),
but it's possibly a waste of memory?
Maybe it's better to perform a retry for the case that the registered
xmlCharEncodingInputFunc returns -1 (which is the correct semantic for lack
of space if i got it right)?
xmlCharIncFunc() returns -1 in that case, which stays for "generall error",
so i'm not sure if we can do a reliable retry from application level,
i don't know if there's a way at all, since we would need the number of
bytes
that are already consumed.




Mit freundlichen Gruessen - Kind regards
Markus Henke



________________________Addressed by:________________________
 ORDAT GmbH & Co. KG  -  Serversystems / eCom 
 Dipl.-Inf. (FH) Markus Henke  Fon: +49 (641) 7941-0
 Rathenaustr. 1                Fax: +49 (641) 7941-132
 35394 Gießen                  mailto:markus henke ordat com
 See:                          http://www.ordat.com
_____________________________________________________________
              ...this behavior is by design...



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]