Re: [xml] Reliability of htmlElementStatusHere and line number



On Thu, Jun 05, 2008 at 12:31:07PM +1000, Gabriele Bartolini wrote:
Hello,

   I am evaluating the HTML parsing features of the libxml2 library. I
am evaluating the opportunity to embed the library in a C++ open-source
project which I maintain, in order to get rid of my previous parser and
use libxml2 to analyse both HTML and XHTML documents (and possibly
validate them).
[...]
   However, I have run into two problems:

1) I cannot get the line number of the element (I have used the node's
line attribute and even the XML_GET_LINE macro)

  I just tried with xmllint --html on a simple file I seems to get line numbers

Breakpoint 1, xmlFreeDoc (cur=0x1b277d8) at tree.c:1177
1177        xmlDictPtr dict = NULL;
Missing separate debuginfos, use: debuginfo-install glibc.x86_64 zlib.x86_64
(gdb) p cur->children
$1 = (struct _xmlNode *) 0x1b28288
(gdb) p *cur->children
$2 = {_private = 0x0, type = XML_DTD_NODE, name = 0x1b27738 "html", 
  children = 0x0, last = 0x0, parent = 0x1b277d8, next = 0x1b27d18, 
  prev = 0x0, doc = 0x1b277d8, ns = 0x0, content = 0x0, properties = 0x0, 
  nsDef = 0x0, psvi = 0x1b28348, line = 33720, extra = 434}
(gdb) p *cur->children->next
$3 = {_private = 0x0, type = XML_ELEMENT_NODE, name = 0x1b276e8 "html", 
  children = 0x1b27e78, last = 0x1b27e78, parent = 0x1b277d8, next = 0x0, 
  prev = 0x1b28288, doc = 0x1b277d8, ns = 0x0, content = 0x0, 
  properties = 0x0, nsDef = 0x0, psvi = 0x0, line = 1, extra = 0}
(gdb) p *cur->children->next->children
$4 = {_private = 0x0, type = XML_ELEMENT_NODE, name = 0x1b27e28 "body", 
  children = 0x1b27f88, last = 0x1b27f88, parent = 0x1b27d18, next = 0x0, 
  prev = 0x0, doc = 0x1b277d8, ns = 0x0, content = 0x0, properties = 0x0, 
  nsDef = 0x0, psvi = 0x0, line = 2, extra = 0}
(gdb) 

line values for element nodes seems correct

2) I seem to receive false INVALID results from the
htmlElementStatusHere function which I call on each element. For
instance, I get an 'HTML_INVALID' result for an 'a' element within a 'p'
element.

  Dunno, that's not related to parsing, but extra functions developped
for editing purposes.
  instead of blindly dropping errors and warnings at parsing time, you should
instead record them if you want to verify an input HTML tree.

  In any case the best you can do is use the XHTML1 DTDs to validate the
HTML resulting tree, anything else will be subject to interpretation as
libxml2 just cannot validate based on SGML HTML rules.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]