Re: [xml] Important: possible incompatible changes ahead for 2.9.0 !



On Mon, Aug 06, 2012 at 11:39:23PM +0200, Stefan Behnel wrote:
Hi Daniel,

  Hi Stefan,

thanks for the heads-up. I don't care all that much about the global dict
size - 10M entries should be hard enough to reach for normal use cases.
Most users only deal with a very small number of XML formats.

  Okay, the point too is that the dictionary may be used to intern small
strings and while this is unlikely to break for a single document
reusing the same dictionary over and over for many documents may lead to
problem. In any case the dictionary is only limited as part of the
parsing process if you allocate it on your own and override the parser
context one you won't be affected.

But I did run into issues with the buffer changes.

Daniel Veillard, 06.08.2012 09:00:
  The new buffer structure will be ABI compatible with the old ones,
i.e. the old code as compiled wil be able to work with the new one, as
the fields with the same values are in the same place in the new
structures. But the structure are now opaque and the few places where
the code was using it directly will need fixing.
  What I see from the usage there are for example access to xmlOutputBuffers:

  buf = xmlAllocOutputBuffer (NULL);
  ....dump stuff to the buffer...
  use data at buf->buffer->content, of size buf->buffer->use

First okay, that was allowed by the API, but such buffers were supposed
to be used for I/O and encoding conversion, in general accessing
buf->buffer->content and buf->buffer->use directly was not really the
expected way to do things. The fact that xmlOutputBuffer were not
supposed to be used that way is the reason why there is no accessors for
getting the output data, this is now fixed as of commit

  http://git.gnome.org/browse/libxml2/commit/?id=e258adecd0e19a6cfe6afa232b89aa416368820e

 So where there is such use of direct access, check the LIBXML2_NEW_BUFFER
macro and if present then
   - replace buf->buffer->content with xmlOutputBufferGetContent(buf)
   - replace buf->buffer->use with xmlOutputBufferGetSize(buf)

I tested it and found that lxml is affected by this. lxml currently takes
the xmlBuffer* from either the "conv" or "buffer" field of the output
buffer and then calls xmlBufferContent() and xmlBufferLength() to get at
the result. I take it that this isn't how it'll work in the future, because
xmlBufferLength() returns an int and buffers are supposed to be larger than
that, right?

  yes basically it's the problem, the two fields have been replaced by
  xmlBufPtr instead of xmlBufferPtr . But you should just have to
  call xmlBufContent() and xmlBufLength() conditionally to the
  availability of LIBXML2_NEW_BUFFER to compile in the new setup

However, xmlOutputBufferGetContent() only reads the "buffer" field, not the
"conv" field. How should I use the "conv" field now? Can't the new
xmlOutputBufferGetContent() do "the right thing" for me?

  Just don't use xmlOutputBufferGetContent() , use xmlBufContent() and
  xmlBufLength() on the new structure instead and keep you code as-is

Code that uses xmlBuffer directly is here:

https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L31

  Actually that looks just fine. xmlBufferPtr still exists and the
  associated routines and entry points are maintained. Calling

    c_buffer = tree.xmlBufferCreate()
    c_text = tree.xmlBufferContent(c_buffer)

is part of the API, it's not changed in any way.

https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L123

  c_buffer = tree.xmlAllocOutputBuffer(enchandler)
  ...
  tree.xmlOutputBufferFlush(c_buffer)
 
  if c_buffer.conv is not NULL:
      c_result_buffer = c_buffer.conv
  else:
      c_result_buffer = c_buffer.buffer

    I think you can still keep the code initializing c_result_buffer
 the pointer names are kept the same and you're just testing for NULL
 so that should be fine

  if encoding is _unicode:
      result = (<unsigned char*>tree.xmlBufferContent(
              c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)].decode('UTF-8')
  else:
      result = <bytes>(<unsigned char*>tree.xmlBufferContent( 
c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)]

    That will have to be changed. I don't know how you express #ifdef in pxi
but if LIBXML2_NEW_BUFFER is defined then use
      tree.xmlBufContent/tree.xmlBufLenght
if not defined keep using
      tree.xmlBufferContent/tree.xmlBufferLenght
on the pointers.

Another issue I found: xmlDumpNotationTable() still wants an xmlBuffer
instead of the xmlBuf that outbuffer.buffer returns. Is the right fix here
to include buf.h and call xmlBufBackToBuffer()?

  yeah those routines are unlikely to generate more than 2GBytes of
  output which is why using xmlBuffer is still okay. Actually I did not
  change any of the APIs, all the old APIs using xmlBuffer will still
  work as before, it's just that internally I changed
  xmlParserInputBuffer and xmlOutputBuffer to use the new xmlBuf
  internally.

https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L293

     if c_dtd.notations != NULL:
        tree.xmlDumpNotationTable(c_buffer.buffer,
                                  <tree.xmlNotationTable*>c_dtd.notations)

  that will need some fixing, right ... You can't include buf.h because
it is private ... What I would do is create a new xmlBuffer, dump to it
get the resulting string and do tree.xmlOutputBufferWrite() to append it
and then free the buffer. That sounds the simplest and most portable to
old and new versions.
 
(BTW, the reason why the serialisation code is doing so much stuff manually
is IIRC that lxml still supports a couple of libxml2 versions that lack the
newer features of the serialisation/xmlSave API. And also to avoid slight
changes to the serialised XML if it switched to native libxml2 functions
abruptly.)

  yeah, understood... libxslt does a bit of mess around this too :-)


  if in some place the xmlBufferPtr was passed independantly of the
OutputBuffer, it's possible to use xmlBufGetContent(buffer) and
xmlBufUse(buffer) to achieve the same.

I assume you meant xmlBufContent() ?

  yup, right

It seems to me that redefining xmlBufferLength and xmlBufferContent to call
the new xmlBuf functions and using a size_t (or ssize_t?) to store the
result of xmlBufLength would do the trick.

  I can't do that really it would break the API, both data structures
will coexist ... forever.

BTW, is there a reason why there's both an xmlBufLength() and an
xmlBufUse() that do the same thing? Since this is a new API that doesn't
suffer from legacy junk yet, wouldn't one be enough? (And wouldn't
xmlBufLength() be the perfect name?)

  well I wanted the conversion to be as automatic as possible and
since the field which was used was buf->use, I perfer to keep that
'alias' for simpler conversions where needed. But in theory you're
perfectly right it is pure code duplication ...


  I don't plan to make an official release with the changes before
September, so there is a bit of time to get this all cleaned up, and
possibly refine the migration stategy for the few apps affected.

There'll be a new release (3.0) of lxml quite soon, within a few weeks. It
should be doable to get this fixed up by then.

  Okay, tell me if you have problems. I didn't fully finished, as you
can see I'm still commiting fixes and improvements on that part. Let's
synchronize so that people making new build of lxml do it on top of
the new libxml2, otherwise they will have to rely on the ABI
compatibility ! I plan beginning of September, I hope it's not too late
for you :)

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]