Re: [xml] libxml2 2.7.1 breaks XML serialisation of HTML trees



Hi,

Martin (gzlist) wrote:
On 08/09/2008, Stefan Behnel <stefan_ml behnel de> wrote:
 there was a change in 2.7.1 (xmlsave.c, ~760) that prevents HTML documents
 from being serialised in XML style...
 ...
 If the current behaviour is wanted, what's the future way of achieving
 this *without* temporarily modifying the document? (i.e. without breaking
 thread concurrency)

I have been eyeing the other 28 bits of xmlSaveOption recently, mostly
to add a XML_SAVE_XHTML to go counter to the current XML_SAVE_NO_XHTML
that would unconditionally turn *on* the Appendix C rules without
needing one of the XHTML 1.0 doctypes.

Sounds fine.


Some other tweaks to like
XML_SAVE_XHTML_NO_META_CHARSET would perhaps also be good.

Why only for XHTML? The <meta> entry is either wanted or not, and it changes
the document on output, which is not always desirable. The libxml2 options
should say: "I want it added if it's not there" (which is the current
behaviour anyway) and "I do not want my document modified on output".


Would an
XML_SAVE_TEXT_HTML option to do the old sgmlish serialisation answer
your use case?

Doesn't sound like it. The problem is that I need to distinguish between a
serialisation as well-formed XML and a serialisation in HTML style
*independent* of the type of document. And I also need to do so in a way that
produces the same output across libxml2 versions. I wouldn't mind switching to
a different API based on an "#if LIBXML_VERSION ...", but I would still want
to get comparable output. lxml never used the xmlSave* API for exactly that
reason: the output changed heavily across the supported versions.

The change in 2.7.1 broke a whole bunch of doctests for lxml. I fixed some of
those, but users will run into the same problem.

Stefan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]