Re: [xml] Keeping entity references unchanged with xmlParseFile



On Tue, Dec 09, 2003 at 10:26:19PM +0100, bvh wrote:
I am using libxml2.5.11 to parse a docbook file. Currently I use xmlParseFile(...)
to parse and convert the whole document to a tree. Later I traverse the tree and
output a LaTeX file. I'd like xmllib to leave entity references like à
alone so that I can convert them myself.

  they are left in the tree, apparently you din't found them, or you asked
for their replacement but it's not libxml2 default behaviour.

I already figured out that I had to do

xmlLoadExtDtdDefaultValue = XML_DETECT_IDS;

to get libxml to look up the catalog file instead of complaining about unknown
entity references.

  Well libxml2 doesn't load the DTD by default. If you reference entities
the parser emits a *warning* about the fact that the entities is not defined.

Although this is suboptimal (the input themselve is
generated and known to be valid, well-formed, etc.. so I don't really need
libxml to verify that yes it's valid according to the dtd) I can live with it.

  Well you know it, then ignore the warning (that can be done programmatically
too of course !).

However the reference entities seem to be skipped over completly. For
example for

<para>Foo &agrave; bar</para>

I get simple two text node with "Foo " and " bar" as content under the para node.

  I bet you missed the entity reference node between those 2 nodes !!!
By default libxml2 
   1/ does not repace entities by theur content (unless you asked for it !)
   2/ coalesce adjacent text nodes.

Is there some documentation that explains how API is supposed to
work together? The doc on the website leaves me a little frustrated because it
is either to superficial or to incoherent.

  there is 1400+ functions in the API. Using the tree requires to be able to
walk the tree and analyze it. If you have trouble doing so, use the xmlReader.

Character entities get through but are converted to utf-8 encoded. Although
not critical, I'd much rather have them as character entities in the character
nodes.

  no way, "character entities" are *not* entities !!! they are character
*references* and reference a UNICODE code point which has *nothing* to
do with entities, you're confused. Parser are not supposed to keep that
information and libxml2 doesn't.

One more thing : could you please make it more clear on the website which
parts of the API are in the latest release and which only in CVS? I spent quite
some time hunting for xmlReadFile before reading in the mailing list archive that
it's a proposal for a new API...

  this has been in the last 3 public releases ! 2.6.0, 2.6.1 and 2.6.2
this information is all over the place, in the archives and in the news
section. If you think it's still not sufficient, we take patches...

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]