Re: [xml] Changes which might be required "Not converting names to lower-case in HTML parsing"

From: Daniel Veillard <veillard redhat com>
To: GPN <gpn libxml gmail com>
Cc: xml gnome org
Subject: Re: [xml] Changes which might be required "Not converting names to lower-case in HTML parsing"
Date: Mon, 10 Oct 2005 10:30:54 -0400

On Mon, Oct 10, 2005 at 06:13:19PM +0530, GPN wrote:

Based on your inputs above, I am assuming that you are referring to
the options:
- enum xmlParserOption defined in include/libxml/parser.h
- enum htmlParserOption defined in include/libxml/HTMLparser.h


  teh second one, yes

htmlCtxtUseOptions() does the following -
a) Normalize the HTML options to XML options
   Probably to reflect the options in the core parsing engine
b) Sets some members of the context structure
   Probably for ease of condition checking.

"HTML_PARSE_RETAINCASE" could be added as the additional option,
but need not reflect as a core XML parsing option.
Do we need to add a member in the context structure (something
like retainCase)?


  no, the remaining options should be kept in ctxt->options

Do these checks have to be made conditional? For e.g.
  if (options & HTML_PARSE_RETAINCASE) {
    if (!xmlStrcasecmp()) {
      /* Code segment */
    }
  } else {
    if (xmlStrEqual()) {
      /* Code segment */
    }
  }


  yes, if (ctxt->options & HTML_PARSE_RETAINCASE) ... the code segment
should not be duplicated of course, the conditional should be unified.

- In htmlParseName(), the condition which checks if the
current character is upper-case, and which transforms
it needs to be removed. Name can be stored as it is.



 no. That would have to be conditionalized depending on a special
parsing flag option. There is also  a number of tables indexed by
the lowercase name and that will need to be preserved

I hope the inclusion of the new option satisifies this comment.
But, I am concerned about which tables might need to be taken care
of, so that the engine is not broken.


  you will have to also generate the lower case version of the name
and use it for lookup in those tables.

- In other parts of the code (only in HTMLparser.c), the
comparsions using xmlStrEqual() for names, need to be
replaced by xmlStrcaseEqual().



 I.e. makes a lot of costly calls instead of one costly and a number
of cheap ones, I disagree with this approach.

I hope this is also answered above. xmlStrcaseEqual() will not be
used.

I did make these changes, and tested once. I found that some tags
during the parse are missing out. For example and in particular,
the "body" tag seems to be missed out. Probably, this is because
I haven't taken care of the tables which you have mentioned above.


 Well, I can't tell ..

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

References:
- Re: [xml] Changes which might be required "Not converting names to lower-case in HTML parsing"
  - From: GPN

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]