Re: [xml] i'm here to contribute



On Mon, 10/31/2011 5:48 PM, Stefan Sauer wrote:
On 09/18/2011 10:24 PM, Glen Hein wrote:
Hello,

I'm a software developer and I'd like to contribute to Gnome's XML project. I've used the libxml software for a long time and I'd like to give something back.

I just started a voluntary career break, but I'd like to stay active.

I looked over the TODO file, but I'm not sure which item to tackle. Could you recommend an item for someone new to the project?

Thanks,
Glen Hein


One thing that would be super cool would be multi-threaded xslt processing (e.g. for chunked document output). Unfortunately again, this is not trivial at all. But any speedup for xslt processing would be great. The docbook xml -> html step in gtk-doc is so slow that most developers to api-doc generation off still :/

Stefan

My vote is to add a generic XML sanitizer. Presumably it would correct syntax problems, escape special characters, etc. Once the data is syntactically correct, the sanitizer could use a dtd/schema/xslt to add missing elements, or more importantly strip unwanted elements. The obvious application is HTML. A web server could pass untrusted bytes into the sanitizer and get back a result that is both valid and safe. Different levels/rules would be used to achieve different results.

Of course there are existing solutions, but everything I've found so far is written in PHP, Perl, Python, Java, et al. And most are written as standalone command line tools. Launching a command line tool, particularly an executable that runs atop a virtual machine is very inefficient, and difficult to scale. Having the functionality inside libxml2 means daemons that already use the library could easily sanitize their output, and with relatively little overhead protect themselves from a number of potential problems.

A secondary goal would be the standardization of the dtd/schema/xslt rules that are used to sanitize HTML (and other XML formatted content). Right now, every sanitizer uses a different set of rules, and looks for a different collection of exploits. If a new trick is discovered to pass harmful data to clients, presumably by encapsulating it in a way that might be valid, but which gets parsed by some clients in a "vendor specific" way, updating the standardized rules would allow all the saniziters to adapt without changing code...

Just my .02.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]