[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [xml] HTML parser space dropping (with patch)



On Mon, Apr 26, 2004 at 07:45:09PM -0700, David Gatwood wrote:
> 
> I'm running into a problem with the HTML parser dropping lots of 
> spaces.  There was a patch for this a while back, and I've verified 
> that my source tree still contains that patch, but it isn't solving the 
> problem.
> 
> Trivial example: feed the following to xmllint --html --htmlout
> 
> <html>
> <head></head><body>
> <pre>
> Word1<!-- comment 1 --> <!-- comment2 -->Word2
> </.pre>
> </body></html>
> 
> Note that the contents are in a <pre> tag, so all spaces should be 
> kept.  However, the space between comments is dropped, resulting in 
> very damaged output with words run together.  Imagine a few thousand of 
> these in a doc and you see my problem.  :-)
> 
> I've also found that there's no way to turn this space dropping off.  
> The obvious flags (keepSpaces, for example) don't work.
> 
> After looking at the code, it appears that the current behavior is to 
> only keep spaces if:
> 
> 1.  It is the first thing in an element that allows spaces
> 2.  It follows another container that allows spaces
> 
> This second part, of course, fails for comments because they aren't 
> really elements, so they can't legitimately allow spaces.
> 
> I suspect that there are other cases where spaces are being dropped 
> incorrectly as well.  The ones that stand out from looking at the code 
> are spaces after <br> and after <img>.
> 
> In any case, it seems to me that the current model is more complicated 
> than necessary, and that any space within an element that follows a 
> mixed content model should automatically be considered sacred, 
> regardless of what element (if any) precedes it.  If an app needs the 
> current behavior, it's easy to nuke the spaces in the tree.  By 
> contrast, you can never get back spaces that have been dropped during 
> the parse.
> 
> Suggested patch attached.
> 
> 
> Thoughts?

  This might be right, this may also break existing apps relying on
the current behaviour. So this is a potentially difficult issue,
I suggest you bugzilla this so we can keep track of it, and add the 
patch to bugzilla, then a first estimation will be to see how much
this changes regression tests and for examples the xmlsoft.org pages
which are generated on XSLT output of HTML parsed files.

  thanks,

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]