Re: [xml] Support of HTML v5 parsing

From: Daniel Veillard <veillard redhat com>
To: Yuriy Ustushenko <yoreek yahoo com>
Cc: libxml gnome <xml gnome org>
Subject: Re: [xml] Support of HTML v5 parsing
Date: Wed, 8 Jul 2015 18:46:43 +0800

On Wed, Jul 08, 2015 at 12:55:31PM +0300, Yuriy Ustushenko wrote:

On 07/08/2015 07:10 AM, Daniel Veillard wrote:

that looks like a very good start, would have been better if the parser context
didn't need tweaking as well as xmlDtd. Also I'm not sure about the way to
detect HTML5:

+    if (name != NULL && !xmlStrcasecmp(name, BAD_CAST "HTML")) {
+        if (ExternalID == NULL && ((SystemID == NULL) ||
+            !xmlStrcasecmp(SystemID, BAD_CAST "about:legacy-compat"))) {
+            cur->html_schema = &html5Schema;

 seems a bit too inclusive, Looks like we would default to html5 each time
there is an  URI for the systemID, which a lot of HTML4 do.


I agree with you, but I have no good idea how to do it.


  I guess we need to infer based on the DOCTYPE, but it's rather ugly

http://www.w3.org/TR/html5/syntax.html#the-doctype

the problem is that

<html>
<body>
 ....
</body>
</html>

ca be either and we can only detect when we hit a problem.
I would be tempted to parse using html4, assuming we don't know (unless
we see an HTML4 DOCTYPE SYSTEM or PUBLIC), if we found a problem
using the html4 schemas and this is avoided by the html5 schemas then
switch.
I.e. a late detection assuming we receive html4, since it's mostly a subset
of html5 that sounds the safer.

That will require tweaking for sure !


Thanks for review.


 and thanks again for the patch :-)

Daniel
  

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/

References:
- Re: [xml] Support of HTML v5 parsing
  - From: Yuri U.
- Re: [xml] Support of HTML v5 parsing
  - From: Daniel Veillard
- Re: [xml] Support of HTML v5 parsing
  - From: Yuriy Ustushenko

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]