Re: [xml] character entity replacements



On Fri, Mar 08, 2002 at 07:00:26AM -0500, David Santamauro wrote:
I have a very unique request and was wondering how I can accomplish this
with libxml2 (2.4.16) (if at all).

I create XML with some external script from a databases export and am
currently using

xmllint --valid --noent --format

to validate, format and replace character entities (it's very fast). The
database export contains entities already (some of which are defined in
external *.ent files). Is there a way to intercept the handling of entities
not declared (unresolved entities) and replace them with
[entityName] -please don't ask why -o)?

This would save me from collecting all those reported by xmllint and
declaring them as
<!ENTITY ent "&lsqb;ent&rsqb;">. This wouldn't be a problem but I'm talking
about (possibly) hundreds spread across millions of records.

Any help/advice would be appreciated,

 Okay, first point is that if you want to validate your file, I tend to think
that checking that entities are "well known" seems IMHO part of the checking
one really ought to do.
 That said doing the technical change you're suggesting is not trivial and
would require some coding, the problem is the following, when running with
--noent the parser is instructed to not generate any entity reference node.
So at that point the parser has no information left allowing to represent
the entity when reserializing the tree:

paphio:~/XML -> cat tst.xml 
<!DOCTYPE doc SYSTEM "foo.dtd">
<doc>hello this is &foo; some text</doc>
paphio:~/XML -> cat foo.dtd 
<!ELEMENT doc (CDATA)>
paphio:~/XML -> ./xmllint --shell --valid --debug --noent tst.xml 
tst.xml:2: error: Entity 'foo' not defined
<doc>hello this is &foo; some text</doc>
                        ^
tst.xml:2: validity error: Element doc content doesn't follow the DTD
Expecting (CDATA), got (CDATA)
<doc>hello this is &foo; some text</doc>
                                       ^
/ > ls
?--        1 doc
---        1 doc
/ > cd doc
doc > ls
t--       24 hello this is  some text
doc > 

  Actually when there is an undeclared entity libxml doesn't produce
any entity reference in the generated tree, whether --noent is declared or
not:

(gdb) p *doc->children->next->children
$4 = {_private = 0x0, type = XML_TEXT_NODE, name = 0x80a3260 "text", 
  children = 0x0, last = 0x0, parent = 0x80f4ce0, next = 0x0, prev = 0x0, 
  doc = 0x80e4960, ns = 0x0, content = 0x80e4e60 "hello this is  some text", 
  properties = 0x0, nsDef = 0x0}

  This is an error handling condition, I don't know what's the most appropriate
thing to do in this case.

Daniel
 
-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]