Re: [xml] Important: possible incompatible changes ahead for 2.9.0 !



Daniel Veillard, 07.08.2012 10:16:
On Mon, Aug 06, 2012 at 11:39:23PM +0200, Stefan Behnel wrote:
thanks for the heads-up. I don't care all that much about the global dict
size - 10M entries should be hard enough to reach for normal use cases.
Most users only deal with a very small number of XML formats.

  Okay, the point too is that the dictionary may be used to intern small
strings and while this is unlikely to break for a single document
reusing the same dictionary over and over for many documents may lead to
problem.

Ah, right - I remember one user complaining once that DTD IDs were stored
there and lead to a "memory leak". He generated them on the fly, which
meant that each document had a completely distinct set of IDs. That can add
up pretty quickly.

Maybe I should add a parser option that would use a subdict instead of the
global per-thread dict.


In any case the dictionary is only limited as part of the
parsing process if you allocate it on your own and override the parser
context one you won't be affected.

Ok, then lxml won't run into this anyway.


https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L123

  c_buffer = tree.xmlAllocOutputBuffer(enchandler)
  ...
  tree.xmlOutputBufferFlush(c_buffer)
 
  if c_buffer.conv is not NULL:
      c_result_buffer = c_buffer.conv
  else:
      c_result_buffer = c_buffer.buffer

    I think you can still keep the code initializing c_result_buffer
 the pointer names are kept the same and you're just testing for NULL
 so that should be fine

  if encoding is _unicode:
      result = (<unsigned char*>tree.xmlBufferContent(
            c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)].decode('UTF-8')
  else:
      result = <bytes>(<unsigned char*>tree.xmlBufferContent( 
c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)]

    That will have to be changed. I don't know how you express #ifdef in pxi
but if LIBXML2_NEW_BUFFER is defined then use
      tree.xmlBufContent/tree.xmlBufLenght
if not defined keep using
      tree.xmlBufferContent/tree.xmlBufferLenght
on the pointers.

Ok. I can just #define the names conditionally in an external header file.


Another issue I found: xmlDumpNotationTable() still wants an xmlBuffer
instead of the xmlBuf that outbuffer.buffer returns. Is the right fix here
to include buf.h and call xmlBufBackToBuffer()?

  yeah those routines are unlikely to generate more than 2GBytes of
  output which is why using xmlBuffer is still okay. Actually I did not
  change any of the APIs, all the old APIs using xmlBuffer will still
  work as before, it's just that internally I changed
  xmlParserInputBuffer and xmlOutputBuffer to use the new xmlBuf
  internally.

https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L293

     if c_dtd.notations != NULL:
        tree.xmlDumpNotationTable(c_buffer.buffer,
                                  <tree.xmlNotationTable*>c_dtd.notations)

  that will need some fixing, right ... You can't include buf.h because
it is private ...

What I would do is create a new xmlBuffer, dump to it
get the resulting string and do tree.xmlOutputBufferWrite() to append it
and then free the buffer. That sounds the simplest and most portable to
old and new versions.

Ok. Let's assume that internal subsets tend to be small and that this is
good enough.


It seems to me that redefining xmlBufferLength and xmlBufferContent to call
the new xmlBuf functions and using a size_t (or ssize_t?) to store the
result of xmlBufLength would do the trick.

  I can't do that really it would break the API, both data structures
will coexist ... forever.

Sorry - I meant that it would do the trick for me, i.e. doing it on user
side will work. I wasn't suggesting to do it in libxml2's header files.


BTW, is there a reason why there's both an xmlBufLength() and an
xmlBufUse() that do the same thing? Since this is a new API that doesn't
suffer from legacy junk yet, wouldn't one be enough? (And wouldn't
xmlBufLength() be the perfect name?)

  well I wanted the conversion to be as automatic as possible and
since the field which was used was buf->use, I perfer to keep that
'alias' for simpler conversions where needed. But in theory you're
perfectly right it is pure code duplication ...

Your choice - I would have decided against having two ways to do it.


  I don't plan to make an official release with the changes before
September, so there is a bit of time to get this all cleaned up, and
possibly refine the migration stategy for the few apps affected.

There'll be a new release (3.0) of lxml quite soon, within a few weeks. It
should be doable to get this fixed up by then.

  Okay, tell me if you have problems. I didn't fully finished, as you
can see I'm still commiting fixes and improvements on that part. Let's
synchronize so that people making new build of lxml do it on top of
the new libxml2, otherwise they will have to rely on the ABI
compatibility ! I plan beginning of September, I hope it's not too late
for you :)

I'm not all too worried here. Linux users either use the lxml that comes
with their system or freshly build it themselves. In the first case, it's
unlikely that a distribution would switch to libxml2 2.9 behind their back
(distributors tend to be rather conservative), in the latter case, they'll
often take care of the dependencies themselves and just rebuild when they
upgrade something.

MacOS users as well as Windows users typically use builds that link libxml2
and libxslt in statically. On MacOS, because the system libraries tend to
interfere with newer installations in uncontrollable ways and on Windows
because users can't build software themselves and happily click on
redundantly bloated installers they find on the Internet.

So it's actually rare that libxml2 gets updated without users being able to
update lxml as well. And building Python extensions is pretty automatic for
most users who tend to build their own software.

Thanks again!

Stefan




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]