Re: [xml] Applying XSLT to HTML

From: Stefan Behnel <stefan_ml behnel de>
To: Dmitry Dzhus <mail sphinx net ru>
Cc: xml gnome org
Subject: Re: [xml] Applying XSLT to HTML
Date: Mon, 02 Jul 2007 16:11:13 +0200



Dmitry Dzhus wrote:

My aim is to apply XSLT to some HTML document (which may be broken
just a little). 

I'm using standard Python libxml2/libxslt bindings.

My code is:

   mf_extract = libxslt.parseStylesheetFile("mf-extract.xsl")
   
   doc = libxml2.readHtmlFile(url, None, libxml2.HTML_PARSE_RECOVER)
   
   mf_extract.applyStylesheet(doc, None)

Applying XSLT results as if there were no content in `doc` tree at
all. Using `readFile` instead of `readHtmlFile` works fine as
expected.

I tried to `print doc` after using both `readHtmlFile` and `readFile`
and noticed that, given the input document is well-formed, the output
differs only in XML declaration at the very beginning.

As I understand (and `document.type` indicates), using `readFile` and
`readHtmlFile` results in different kinds of documents --
`document_xml` and `document_html` -- while applying XSLT is only
possible with `document_xml` one. Is there any way to convert
`document_html` to `document_xml`?



Consider using lxml.

http://codespeak.net/lxml/

untested:

   import lxml.etree as et
   parser = et.HTMLParser()
   doc = et.parse(url, parser)

   doc.xslt(et.parse("mf-extract.xsl"))

   for el in doc.getiterator("*"):
       if '{' not in el.tag:
           el.tag = "{http://www.w3.org/1999/xhtml}"; + el.tag

Stefan

Follow-Ups:
- Re: [xml] Applying XSLT to HTML
  - From: Nic James Ferrier

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]