Re: [xml] i'm here to contribute



Jumping back on that old thread, now that I have a bit of time for
the xml mail folder :-)

On Mon, Oct 31, 2011 at 07:27:43PM -0500, Ladar Levison wrote:
On Mon, 10/31/2011 5:48 PM, Stefan Sauer wrote:
On 09/18/2011 10:24 PM, Glen Hein wrote:
[...]
My vote is to add a generic XML sanitizer. Presumably it would
correct syntax problems, escape special characters, etc. Once the
data is syntactically correct, the sanitizer could use a
dtd/schema/xslt to add missing elements, or more importantly strip
unwanted elements. The obvious application is HTML. A web server
could pass untrusted bytes into the sanitizer and get back a result
that is both valid and safe. Different levels/rules would be used to
achieve different results.

  Well the canonical way is HTML tidy from Dave Ragett (though
he seems to have stepped down) http://tidy.sourceforge.net/

Of course there are existing solutions, but everything I've found so
far is written in PHP, Perl, Python, Java, et al. And most are
written as standalone command line tools. Launching a command line
tool, particularly an executable that runs atop a virtual machine is
very inefficient, and difficult to scale. Having the functionality
inside libxml2 means daemons that already use the library could
easily sanitize their output, and with relatively little overhead
protect themselves from a number of potential problems.

A secondary goal would be the standardization of the dtd/schema/xslt
rules that are used to sanitize HTML (and other XML formatted
content). Right now, every sanitizer uses a different set of rules,
and looks for a different collection of exploits. If a new trick is
discovered to pass harmful data to clients, presumably by
encapsulating it in a way that might be valid, but which gets parsed
by some clients in a "vendor specific" way, updating the
standardized rules would allow all the saniziters to adapt without
changing code...

  One of the real development goals that could still make sense
in libxml2 is to make the HTML parser behave like an HTML 5 one
(or allow this as an option), there is already shared code for HTML5
parsing but it's C++ (IIRC) and I can't rely on it. If people start
to agree a bit formally on how to parse "web HTML" i.e. the ignomous
mixtures that most Web parser are built to process, and handle all
corner cases in a consistent documented way, then upgrading libxml2
to behave in the same way as much as possible would be *great*, but
that would definitely be a lot of work, and I can't commit to anything
like this :-)
  The interesting point in this approach is that it doesn't have to
be 6 months of continous work to produce results, this could be achieved
progressively, adding an HTML_PARSE_HTML5 flag to htmlParserOption
and adding fixes as we meet them and decide to fix them to the
existing HTML parser.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]