Re: [xml] HTML parser space dropping (with patch)
- From: Daniel Veillard <veillard redhat com>
- To: David Gatwood <dgatwood mac com>
- Cc: xml gnome org
- Subject: Re: [xml] HTML parser space dropping (with patch)
- Date: Wed, 28 Apr 2004 10:48:22 -0400
On Mon, Apr 26, 2004 at 07:45:09PM -0700, David Gatwood wrote:
I'm running into a problem with the HTML parser dropping lots of
spaces. There was a patch for this a while back, and I've verified
that my source tree still contains that patch, but it isn't solving the
problem.
Trivial example: feed the following to xmllint --html --htmlout
<html>
<head></head><body>
<pre>
Word1<!-- comment 1 --> <!-- comment2 -->Word2
</.pre>
</body></html>
Note that the contents are in a <pre> tag, so all spaces should be
kept. However, the space between comments is dropped, resulting in
very damaged output with words run together. Imagine a few thousand of
these in a doc and you see my problem. :-)
I've also found that there's no way to turn this space dropping off.
The obvious flags (keepSpaces, for example) don't work.
After looking at the code, it appears that the current behavior is to
only keep spaces if:
1. It is the first thing in an element that allows spaces
2. It follows another container that allows spaces
This second part, of course, fails for comments because they aren't
really elements, so they can't legitimately allow spaces.
I suspect that there are other cases where spaces are being dropped
incorrectly as well. The ones that stand out from looking at the code
are spaces after <br> and after <img>.
In any case, it seems to me that the current model is more complicated
than necessary, and that any space within an element that follows a
mixed content model should automatically be considered sacred,
regardless of what element (if any) precedes it. If an app needs the
current behavior, it's easy to nuke the spaces in the tree. By
contrast, you can never get back spaces that have been dropped during
the parse.
Suggested patch attached.
Thoughts?
This might be right, this may also break existing apps relying on
the current behaviour. So this is a potentially difficult issue,
I suggest you bugzilla this so we can keep track of it, and add the
patch to bugzilla, then a first estimation will be to see how much
this changes regression tests and for examples the xmlsoft.org pages
which are generated on XSLT output of HTML parsed files.
thanks,
Daniel
--
Daniel Veillard | Red Hat Desktop team http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]