Re: [xml] Problem with an old SGML



On Tue, Nov 08, 2005 at 12:44:19AM +0100, Kail wrote:
I've a problem with an old SGLM.
This have many format error, the 2 most annoing are:

1- Have more than 1 element as root child
     //Start of file
     <reuters> ........ </reuters>
     <reuters> ........ </reuters>
etc.
This file is 7 years old, but i need to parse it :(
There is a possibility to parse it without add a node from the start
of file to the end?

  It is not XML.
  Hum, it's not simple, but you can try to use an XML file
which declares that file as an external entity, then make one 
reference to that entity within a top level element in that file

<!DOCTYPE doc [
<!ENTITY old_content SYSTEM "old.sgml">
]>
<doc>&old_content;</doc>

  then
  
  xmllint --nooent new.xml > content.xml

2- There are also some char like &#31; that obviusly are not
recognised and generate errors...there is a way to avoid the errors
and make the parser recognise  them as TEXT element avoiding the call
of xmlParseCharRef or make this function don't generate error? (an
Option i haven't found ^_^)

  Again this is not XML, that can't be parsed as is. You could try
the --recover option of xmllint in addition to --nooent, but you have
no garantee of result, and it will loose data. This is not XML and
can't be expected to be parsed as such. You could try the html parser
too to see what it gives on it

  xmllint --html old.sgml >  content.html

and process from there.

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]