Re: [xml] How is ignorableWhitespace defined?

From: Bill Moseley <moseley hank org>
To: xml gnome org
Subject: Re: [xml] How is ignorableWhitespace defined?
Date: Sun, 30 Sep 2001 08:43:06 -0700

At 02:13 PM 09/03/01 -0400, Daniel Veillard wrote:

On Mon, Sep 03, 2001 at 08:01:25PM +0200, Jonas Borgström wrote:

Hi, 

How is ignorableWhitespace in SAX defined?


Yes, I'm wondering too, as it doesn't really seem ignorable in some cases.
Am I misunderstanding the meaning of ignorableWhitespace?

For example:

<B> Next:</B> <A HREF="node151.html">12 Abbreviations</A>
<B>Up:</B> <A NAME="tex2html2196" HREF="lpg.html">e</A>

The line break should force white space to be rendered between
"Abbreviations" and "Up", but the SAX parser calls that ignorable.  I'd
also would not call the space between "Next:" and "12" ignorable.

SAX.characters( Next:, 6)
SAX.endElement(b)
SAX.ignorableWhitespace( , 1)
SAX.startElement(a, href='node151.html')
SAX.characters(12 Abbreviations, 16)
SAX.endElement(a)
SAX.ignorableWhitespace(
, 1)
SAX.startElement(b)
SAX.characters(Up:, 3)
SAX.endElement(b)

On a related note, I'm using libxml2 to extract out words (as they are
typically rendered by a client).  When parsing I use htmlElemDesc.inline to
help decide if I have a word break.  But that doesn't help with:

         foo<br>bar

Which is clearly two words.  Is there anything in libxml2 that could help
me decide what tags would indicate a word break?  Would it be helpful to
extend htmlElemDesc with this information?

It's not all together clear, of course, as it's not uncommon to run text up
to both sizes of an image as a foo<img...>bar and it's impossible to know
if that image is separating words or is part of the word.

Thanks,


Bill Moseley
mailto:moseley hank org

References:
- [xml] How is ignorableWhitespace defined?
  - From: Jonas =?ISO-8859-1?Q?Borgstr=F6m?=
- Re: [xml] How is ignorableWhitespace defined?
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]