Re: [xml] parsing fragments of a larger file



On Fri, Aug 29, 2003 at 08:55:25PM -0400, Daniel Veillard wrote:

Do you know of any other XML libraries which are designed to fail softly
and provide enough meta information that they can be used intelligently
by an editor? If not, would you advise me to write my own library rather

  Reliably ? none. You can hack perl, or write a minimal parser
front-end, but only deep experience with the spec(s) and lot of
sweat may give you something wich will recover non-trivial cases 
in a reasonable fashion. This is precisely the mess that people
tried to get away from the HTML "experience" and which led to
XML relatively drastic rules. As a result XML tools are common, 
reliable and cheap. What you're trying to do is costly, time
consuming and not rewarding, but you're warned already ...

This solves a big mystery for me: Why almost no XML tool
provides useful error messages. I think libxml2 is the only
XML tool which will specify the line/column of the offending
*start* element when start and end mismatch (to understand the
joke here: Think of an XML file with 200 elements. One of them
is wrong in such a way that the last element doesn't match.
Good luck finding the mismatch manually...).

My suggestion here would be to create a "fallback": when
libxml2 detects an error, it goes to "HTML" mode. The
idea is to parsei the document *again* and try to read
as much of the document possible
and then attach a large CDATA element or maybe even
a new element ERROR.

Example:

    <?xml ...>

    <root>
        <e1>
            <e2>
                <e3></e3>
            </e2>
        </e1>
        <e1>
            <e2>
                <e3></e4>
            </e2>
        </e1>
    </root>

This should return this parse tree:

    <root ERROR=TRUE>
        <e1>
            <e2>
                <e3 />
            </e2>
        </e1>
        <e1 ERROR=TRUE>
            <e2 ERROR=TRUE>
                <e3 ERROR=TRUE><ERROR line="11">&lt/e4&gt;
            &lt;e2&gt;
        &lt;e1&gt;
    &lt;root&gt;
                </e3>
            </e2>
        </e1>
    </root>

The artifical ERROR attributes (should probably go into a
separate namespace as well), tag elements whose content you
cannot really trust. In an Editor, they should be marked as
faulty.

The end elements are added by the error handler to create
a syntactical correct DOM in order to allow to write a
standard error handler.

In the case above (start/end element mismatch), the
offending element is always the parent of the ERROR element.
The editor could offer to display the (unparsed) source
of the DOM and jump to the line of the problem.

-- 
==============================================
Sowatec AG,       CH-8330 Pfäffikon (ZH)
Witzbergstr. 7,   http://www.sowatec.com
Tel: +41-(0)1-952 55 55
Fax: +41-(0)1-952 55 66
----------------------------------------------
Aaron "Optimizer" Digulla, digulla sowatec com
==============================================



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]