commit 4cf5e24cba905cc5ec3dd33c2c46807e5df838d9 Author: Olli Pottonen Date: Sat Jul 4 08:52:08 2015 +1000 Implement HTML5 encoding detection algorithm. diff --git a/HTMLparser.c b/HTMLparser.c index a302500..82545cb 100644 --- a/HTMLparser.c +++ b/HTMLparser.c @@ -345,6 +345,342 @@ htmlNodeInfoPop(htmlParserCtxtPtr ctxt) if (l == 1) b[i++] = (xmlChar) v; \ else i += xmlCopyChar(l,&b[i],v) + +/* HTML 5 REC encoding sniffing algorithm */ + +typedef struct { + const xmlChar *cur; + const xmlChar *end; + xmlChar *res; +} _encSniffState; + +static inline int isHtmlSpace(xmlChar c) { + return((c == '\t') || (c == '\n') || (c == '\f') || + (c == '\r') || (c == ' ')); +} + +static inline int isAsciiAlpha(xmlChar c) { + return(('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z')); +} + +typedef struct { + const xmlChar *name; + int nameLen; + const xmlChar *value; + int valueLen; +} _encSniffAttribute; + +/** + * encSniffGetAttribute: + * Auxiliary function for W3C HTML 5 REC encoding sniffing. + * + * Return attribute contained in a tag, if any. + */ +static _encSniffAttribute * encSniffGetAttribute(_encSniffState *state) { + static _encSniffAttribute res; + res.value = NULL; + res.valueLen = 0; + + const xmlChar *cur = state->cur, *end = state->end; + while (cur < end && + (isHtmlSpace(cur[0]) || cur[0] == '/')) + cur++; + if (cur >= end || cur[0] == '>') { + state->cur = cur +1; + return(NULL); + } + + res.name = cur; + while (cur < end) { + if ((cur[0] == '/') || (cur[0] == '>')) { + res.nameLen = cur - res.name; + state->cur = cur+1; + return(&res); + } + + if (isHtmlSpace(cur[0]) || (cur[0] == '=' && cur > res.name)) + break; + cur++; + } + res.nameLen = cur - res.name; + + if (cur >= end) { + state->cur = cur; + return(NULL); + } + + while (cur < end && isHtmlSpace(cur[0])) + cur++; + if (cur >= end || cur[0] != '=') { + state->cur = cur; + return(NULL); + } + cur++; + while (cur < end && isHtmlSpace(cur[0])) + cur++; + + if (cur >= end) { + state->cur = cur; + return(NULL); + } + + if ((cur[0] == '\'') || (cur[0] == '"')) { + xmlChar quote_char = cur[0]; + res.value = ++cur; + + while (cur < end && cur[0] != quote_char) + cur++; + + res.valueLen = cur - res.value; + state->cur = cur +1; + } else if(cur[0] != '>') { + res.value = cur; + + while (cur < end && !isHtmlSpace(cur[0]) && (cur[0] != '>')) + cur++; + res.valueLen = cur - res.value; + state->cur = cur; + } + return((cur >= end) ? NULL : &res); +} + +/** + * encSniffEncodingFromMeta: + * Auxiliary function for W3C HTML 5 REC encoding sniffing. + * + * Find encoding from a meta tag such as + * "Content-Type: text/html; charset=ascii". + */ +static int encSniffEncodingFromMeta(const xmlChar *s, const xmlChar *end, + const xmlChar **res, int *res_len) { + for(;;) { + while(s + 7 < end && xmlStrncasecmp(s, BAD_CAST "charset", 7)) + s++; + if (s + 7 >= end) + return(0); + s += 7; + + while (s < end && isHtmlSpace(s[0])) + s++; + if (s >= end) + return(0); + + if (s[0] == '=') + break; + } + s++; + + while (s < end && isHtmlSpace(s[0])) + s++; + if (s >= end) + return(0); + + const xmlChar *start; + if (s[0] == '\'' || s[0] == '"') { + xmlChar quote_char = s[0]; + start = ++s; + while (s < end && s[0] != quote_char) + s++; + } else { + start = s; + while (s < end && !isHtmlSpace(s[0]) && s[0] != ';') + s++; + } + if (s >= end) + return(0); + while (start < s && isHtmlSpace(start[0]) ) + start++; + *res = start; + *res_len = s - start; + return(1); +} + +/** + * encSniffScanMeta: + * Auxiliary function for W3C HTML 5 REC encoding sniffing. + * + * Scan a meta tag and try to find encoding declaration. + */ +static int encSniffScanMeta(_encSniffState *state) { + const xmlChar *cur = state->cur, *end = state->end; + if (cur + 5 > end || + (cur[0] != '<') || + ((cur[1] != 'm') && (cur[1] != 'M')) || + ((cur[2] != 'e') && (cur[2] != 'E')) || + ((cur[3] != 't') && (cur[3] != 'T')) || + ((cur[4] != 'a') && (cur[4] != 'A')) || + (!isHtmlSpace(cur[5]) && cur[5] != '/')) + return 0; + + state->cur += 6; + int gotPragma = 0, needPragma = -1; + const xmlChar *charset = NULL; + int charset_len = 0; + + for(;;) { + _encSniffAttribute *attr = encSniffGetAttribute(state); + + if (attr == NULL) { + break; + } else if ((attr->nameLen == 10) && + !xmlStrncasecmp(attr->name, BAD_CAST "http-equiv", 10) && + (attr->valueLen == 12) && + !xmlStrncasecmp(attr->value, BAD_CAST "content-type", 12)) { + gotPragma = 1; + } else if ((attr->nameLen == 7) && + !xmlStrncasecmp(attr->name, BAD_CAST "content", 7) && + charset == NULL) { + if (encSniffEncodingFromMeta(attr->value, + attr->value + attr->valueLen, + &charset, &charset_len) ) { + needPragma = 1; + } + } else if ((attr->nameLen == 7) && + !xmlStrncasecmp(attr->name, BAD_CAST "charset", 7)) { + charset = attr->value; + charset_len = attr->valueLen; + needPragma = 0; + } + } + + if (needPragma && !gotPragma) + charset = NULL; + + if ((charset_len == 6 && !xmlStrcasecmp(charset, BAD_CAST "UTF-16")) || + (charset_len == 8 && !xmlStrcasecmp(charset, BAD_CAST "UTF-16LE")) | + (charset_len == 8 && !xmlStrcasecmp(charset, BAD_CAST "UTF-16BE"))) { + charset = BAD_CAST "UTF-8"; + charset_len = 5; + } + + state->res = xmlStrndup(charset, charset_len); + return(1); +} + +/** + * encSniffSkipComment: + * Auxiliary function for W3C HTML 5 REC encoding sniffing. + * + * Skip comment, if any. + */ +static int encSniffSkipComment(_encSniffState *state) { + const xmlChar *cur = state->cur, *end = state->end; + if ((cur + 3 > end) || (cur[0] != '<') || (cur[1] != '!') || + (cur[2] != '-') || (cur[3] != '-')) + return(0); + + cur += 2; + while (cur + 2 < end && + ((cur[0] != '-') || (cur[1] != '-') || (cur[2] != '>'))) + cur++; + + state->cur = cur + 3; + fprintf(stderr, "done\n"); + return(1); +} + +/** + * encSniffSkipTag: + * Auxiliary function for W3C HTML 5 REC encoding sniffing. + * + * Skip element tag, if any. + */ +static int encSniffSkipTag(_encSniffState *state) { + const xmlChar *cur = state->cur, *end = state->end; + + int startfound = + ((cur + 1 < end) && (cur[0] == '<') && isAsciiAlpha(cur[2])) || + ((cur + 2 < end) && (cur[0] == '<') && (cur[1] == '/') && + isAsciiAlpha(cur[2])); + + if (!startfound) + return(0); + + while (cur < end && !isHtmlSpace(cur[0]) && cur[0] != '>') + cur++; + state->cur = cur; + + + while (state->cur < end && encSniffGetAttribute(state)) + ; + + return(1); +} + +/** + * encSniffSkipMisc: + * Auxiliary function for W3C HTML 5 REC encoding sniffing. + * + * Skip doctype declaration, processing instruction or SGML style + * element tag, if any. + */ +static int encSniffSkipMisc(_encSniffState *state) { + const xmlChar *cur = state->cur, *end = state->end; + + if ((cur + 1 >= end) || + (cur[0] != '<') || + (cur[1] != '!' && cur[1] == '/' && cur[1] == '?')) { + return(0); + } + + while (cur < end && cur[0] != '>') + cur++; + + state->cur = cur; + return(1); +} + +/** + * encSniffSkipContent: + * Auxiliary function for W3C HTML 5 REC encoding sniffing. + * + * Skip content until the next start tag, comment, PI or + * doctype declaration. + */ +static void encSniffSkipContent(_encSniffState *state) { + ++state->cur; + while (state->cur < state->end && state->cur[0] != '<') + ++state->cur; +} + +/** + * html5FindEncoding: + * @the HTML parser context + * + * W3C HTML 5 Recommendation algorithm to prescan a byte stream to + * determine its encoding. + * + * Returns an encoding string or NULL if not found. The string needs to + * be freed. + */ +static const char * +html5FindEncoding(xmlParserCtxtPtr ctxt) { + if ((ctxt == NULL) || (ctxt->input == NULL) || + (ctxt->input->encoding != NULL) || (ctxt->input->buf == NULL) || + (ctxt->input->buf->encoder != NULL)) + return(NULL); + if ((ctxt->input->cur == NULL) || (ctxt->input->end == NULL)) + return(NULL); + + const xmlChar *end = ctxt->input->cur + 4096; + end = (end < ctxt->input->end) ? end : ctxt->input->end; + _encSniffState state = {ctxt->input->cur, end, NULL}; + + while (state.cur < end && state.res == NULL) { + if (encSniffSkipComment(&state)) + continue; + if (encSniffScanMeta(&state)) + continue; + if (encSniffSkipTag(&state)) + continue; + if (encSniffSkipMisc(&state)) + continue; + encSniffSkipContent(&state); + } + + return((const char *) state.res); +} + /** * htmlFindEncoding: * @the HTML parser context @@ -510,10 +846,12 @@ htmlCurrentChar(xmlParserCtxtPtr ctxt, int *len) { * Humm this is bad, do an automatic flow conversion */ { - xmlChar * guess; + xmlChar * guess = NULL; xmlCharEncodingHandlerPtr handler; - guess = htmlFindEncoding(ctxt); + if ((ctxt->options & HTML_HTML5_ENC_SNIFF) == 0) { + guess = htmlFindEncoding(ctxt); + } if (guess == NULL) { xmlSwitchEncoding(ctxt, XML_CHAR_ENCODING_8859_1); } else { @@ -3574,7 +3912,8 @@ htmlCheckMeta(htmlParserCtxtPtr ctxt, const xmlChar **atts) { int http = 0; const xmlChar *content = NULL; - if ((ctxt == NULL) || (atts == NULL)) + if ((ctxt == NULL) || (atts == NULL) || + (ctxt->options & HTML_HTML5_ENC_SNIFF)) return; i = 0; @@ -6595,6 +6934,10 @@ htmlCtxtUseOptions(htmlParserCtxtPtr ctxt, int options) ctxt->options |= HTML_PARSE_NOIMPLIED; options -= HTML_PARSE_NOIMPLIED; } + if (options & HTML_HTML5_ENC_SNIFF) { + ctxt->options |= HTML_HTML5_ENC_SNIFF; + options -= HTML_HTML5_ENC_SNIFF; + } ctxt->dictNames = 0; return (options); } @@ -6619,6 +6962,13 @@ htmlDoRead(htmlParserCtxtPtr ctxt, const char *URL, const char *encoding, htmlCtxtUseOptions(ctxt, options); ctxt->html = 1; + + int free_encoding = 0; + if (options & HTML_HTML5_ENC_SNIFF) { + encoding = html5FindEncoding(ctxt); + free_encoding = 1; + } + if (encoding != NULL) { xmlCharEncodingHandlerPtr hdlr; @@ -6630,6 +6980,9 @@ htmlDoRead(htmlParserCtxtPtr ctxt, const char *URL, const char *encoding, ctxt->input->encoding = xmlStrdup((xmlChar *)encoding); } } + if (free_encoding && encoding != NULL) + xmlFree((xmlChar *) encoding); + if ((URL != NULL) && (ctxt->input != NULL) && (ctxt->input->filename == NULL)) ctxt->input->filename = (char *) xmlStrdup((const xmlChar *) URL); diff --git a/include/libxml/HTMLparser.h b/include/libxml/HTMLparser.h index 551186c..5c06351 100644 --- a/include/libxml/HTMLparser.h +++ b/include/libxml/HTMLparser.h @@ -185,7 +185,8 @@ typedef enum { HTML_PARSE_NONET = 1<<11,/* Forbid network access */ HTML_PARSE_NOIMPLIED= 1<<13,/* Do not add implied html/body... elements */ HTML_PARSE_COMPACT = 1<<16,/* compact small text nodes */ - HTML_PARSE_IGNORE_ENC=1<<21 /* ignore internal document encoding hint */ + HTML_PARSE_IGNORE_ENC=1<<21,/* ignore internal document encoding hint */ + HTML_HTML5_ENC_SNIFF= 1<<23 /* use HTML5 encoding sniffing algorithm */ } htmlParserOption; XMLPUBFUN void XMLCALL