[xml] HTML push parser fix for repeated start tags



I found that the following document:

  <td><td><!-- <a><b> -->

was not parsing correctly and giving an error in the HTML push parser:

  $ xmllint --html --push test.html
  test.html:1: HTML parser error : htmlParseStartTag: invalid element name
  <td><td><!-- <a><b> -->
           ^
  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>
  <html><body>
  <td></td>
  <td><b> --&gt;
  </b></td>
  </body></html>

Compare with (no --push):

  $ xmllint --html test.html
  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>
  <html><body>
  <td></td>
  <td><!-- <a><b> --></td>
  </body></html>

I tracked the problem down to the code around line 4773 of HTMLparser.c.
The if statement appears to be intended to check if htmlParseStartTag()
failed. It compares the tag name and depth with those before the call,
and assumes that htmlParseStartTag() failed if they are equal. However,
this situation occurs in the case above when the second <td> is being
parsed. The depth is equal because a td start tag is defined to close
any open td (in htmlStartClose).

The result is that the parser is left in the wrong state for parsing the
comment, and that's why the "invalid element name" error occurs.

The attached patch fixes the bug by making htmlParseStartTag() return 0
on success and -1 on error, and replacing the comparison of tag name and
depth.

James

Attachment: HTMLparser.patch
Description: Text Data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]