Re: [xml] Tool to convert malformed XML into valid XML



On Sat, Jul 20, 2013 at 11:07:22AM -0000, Subrata Dasgupta wrote:
Respected Sir,

    While working in a project I have faced huge problems with malformed XML files. Most of the times few 
opening or closing tags are missing in those files and some times though XML is not malformed but it is not 
matching with the DTD. It is very very hard to fix this by hand because XMLs are very big more than 30 MB 
to 2 GB.

    So I am looking for a open source tool which can detect and fix the malformed xml with the help of a 
DTD or XSD automatically(at least where there is no ambiguity). But till now I am unable to find such a 
tool. But after googling it seems to me that we can write such tool using GNU libxml open source library.

    But I am not sure how to implement this and which API functions I should use. Please help me to write 
such an application. I am proficient in c and c++. It would be very much helpful if you provide me some 
information on this.

    If there is any already available free tool or open source for this purpose then also please let me 
know.

  It's very hard, if not impossible. It is my understanding that XML
parsing rules are so drastic (if anything looks seriously wrong don't
deliver any more data), is because more than a decade of accumulated
expertise with SGML led to the belief that it is impossible to correct
without risk of injecting serious errors.

  The only open source exception I know is 'tidy' the tool to correct
HTML, and it is faily tied to that specific set of DTD and knowledge
about that format (specification and use case).

  Libxml allows to override the drastic rules of XML, you could also
use the HTML parser, but in all honnesty you may just end up with
30M to 2G of garbage !

  Fix the generator tools, if it's not well formed, it's not XML,
and a violation of the contract that the the spec guarantees.
And IMHO 2GB XML that you just can't fix is ridiculous, there is
a serious design failure there, not XML fault !

Daniel

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]