Re: [xml] HTML parser space dropping (with patch)

From: Daniel Veillard <veillard redhat com>
To: David Gatwood <dgatwood mac com>
Cc: xml gnome org
Subject: Re: [xml] HTML parser space dropping (with patch)
Date: Wed, 28 Apr 2004 10:48:22 -0400

On Mon, Apr 26, 2004 at 07:45:09PM -0700, David Gatwood wrote:


I'm running into a problem with the HTML parser dropping lots of 
spaces.  There was a patch for this a while back, and I've verified 
that my source tree still contains that patch, but it isn't solving the 
problem.

Trivial example: feed the following to xmllint --html --htmlout

<html>
<head></head><body>
<pre>
Word1<!-- comment 1 --> <!-- comment2 -->Word2
</.pre>
</body></html>

Note that the contents are in a <pre> tag, so all spaces should be 
kept.  However, the space between comments is dropped, resulting in 
very damaged output with words run together.  Imagine a few thousand of 
these in a doc and you see my problem.  :-)

I've also found that there's no way to turn this space dropping off.  
The obvious flags (keepSpaces, for example) don't work.

After looking at the code, it appears that the current behavior is to 
only keep spaces if:

1.  It is the first thing in an element that allows spaces
2.  It follows another container that allows spaces

This second part, of course, fails for comments because they aren't 
really elements, so they can't legitimately allow spaces.

I suspect that there are other cases where spaces are being dropped 
incorrectly as well.  The ones that stand out from looking at the code 
are spaces after <br> and after <img>.

In any case, it seems to me that the current model is more complicated 
than necessary, and that any space within an element that follows a 
mixed content model should automatically be considered sacred, 
regardless of what element (if any) precedes it.  If an app needs the 
current behavior, it's easy to nuke the spaces in the tree.  By 
contrast, you can never get back spaces that have been dropped during 
the parse.

Suggested patch attached.


Thoughts?


  This might be right, this may also break existing apps relying on
the current behaviour. So this is a potentially difficult issue,
I suggest you bugzilla this so we can keep track of it, and add the 
patch to bugzilla, then a first estimation will be to see how much
this changes regression tests and for examples the xmlsoft.org pages
which are generated on XSLT output of HTML parsed files.

  thanks,

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

References:
- [xml] HTML parser space dropping (with patch)
  - From: David Gatwood

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]