[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] HTML parser space dropping (with patch)
- From: Daniel Veillard <veillard redhat com>
- To: David Gatwood <dgatwood mac com>
- Cc: xml gnome org
- Subject: Re: [xml] HTML parser space dropping (with patch)
- Date: Wed, 28 Apr 2004 10:48:22 -0400
On Mon, Apr 26, 2004 at 07:45:09PM -0700, David Gatwood wrote:
>
> I'm running into a problem with the HTML parser dropping lots of
> spaces. There was a patch for this a while back, and I've verified
> that my source tree still contains that patch, but it isn't solving the
> problem.
>
> Trivial example: feed the following to xmllint --html --htmlout
>
> <html>
> <head></head><body>
> <pre>
> Word1<!-- comment 1 --> <!-- comment2 -->Word2
> </.pre>
> </body></html>
>
> Note that the contents are in a <pre> tag, so all spaces should be
> kept. However, the space between comments is dropped, resulting in
> very damaged output with words run together. Imagine a few thousand of
> these in a doc and you see my problem. :-)
>
> I've also found that there's no way to turn this space dropping off.
> The obvious flags (keepSpaces, for example) don't work.
>
> After looking at the code, it appears that the current behavior is to
> only keep spaces if:
>
> 1. It is the first thing in an element that allows spaces
> 2. It follows another container that allows spaces
>
> This second part, of course, fails for comments because they aren't
> really elements, so they can't legitimately allow spaces.
>
> I suspect that there are other cases where spaces are being dropped
> incorrectly as well. The ones that stand out from looking at the code
> are spaces after <br> and after <img>.
>
> In any case, it seems to me that the current model is more complicated
> than necessary, and that any space within an element that follows a
> mixed content model should automatically be considered sacred,
> regardless of what element (if any) precedes it. If an app needs the
> current behavior, it's easy to nuke the spaces in the tree. By
> contrast, you can never get back spaces that have been dropped during
> the parse.
>
> Suggested patch attached.
>
>
> Thoughts?
This might be right, this may also break existing apps relying on
the current behaviour. So this is a potentially difficult issue,
I suggest you bugzilla this so we can keep track of it, and add the
patch to bugzilla, then a first estimation will be to see how much
this changes regression tests and for examples the xmlsoft.org pages
which are generated on XSLT output of HTML parsed files.
thanks,
Daniel
--
Daniel Veillard | Red Hat Desktop team http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]