Re: [xml] Changes which might be required "Not converting names to lower-case in HTML parsing"



Daniel Veillard wrote:
On Thu, Sep 29, 2005 at 01:22:04PM +0530, GPN wrote:

Daniel Veillard wrote:

I am assuming that pretty much all HTML related functionality
is contained within HTMLparser.c, and the core xml functions
need not change to accomodate this enhancement.


This enhancement will have to be conditional to a parsing option.

Daniel, Can you expand on this?


  Sure:
    when you create an html parsing context like with htmlReadFile
you pass an options arguments which is a set of OR'ed htmlParserOption
There is already a few of them, another one would need to be added
to preserve case.
  Then the places where you want to make the changes should be tested
against the ctxt->options for that new option.
  This allows to plug your new behaviour without disturbing the default
one.

Daniel


I use the following API's to create a context and subsequently parse
the buffers -
htmlCreatePushParserCtxt()
htmlParseChunk()

Based on your inputs above, I am assuming that you are referring to
the options:
- enum xmlParserOption defined in include/libxml/parser.h
- enum htmlParserOption defined in include/libxml/HTMLparser.h

htmlCtxtUseOptions() does the following -
a) Normalize the HTML options to XML options
   Probably to reflect the options in the core parsing engine
b) Sets some members of the context structure
   Probably for ease of condition checking.

"HTML_PARSE_RETAINCASE" could be added as the additional option,
but need not reflect as a core XML parsing option.
Do we need to add a member in the context structure (something
like retainCase)?

Reply sent by you earlier -

On Wed, Sep 28, 2005 at 07:25:52PM +0530, GPN wrote:

Hello,
I am working off release 2.6.22, and I am proposing the
following changes to the code.
- A new function xmlStrcaseEqual() might be required in
 xmlstring.c, which can check if the current character
 being parsed is between 'A' and 'Z', and if so compares
 using casemap array as is done in xmlStrcasecmp().


  not needed, just !xmlStrcasecmp() the API is too large already

Do these checks have to be made conditional? For e.g.
  if (options & HTML_PARSE_RETAINCASE) {
    if (!xmlStrcasecmp()) {
      /* Code segment */
    }
  } else {
    if (xmlStrEqual()) {
      /* Code segment */
    }
  }

instead of -

  if (xmlStrequal())

- In htmlParseName(), the condition which checks if the
 current character is upper-case, and which transforms
 it needs to be removed. Name can be stored as it is.


  no. That would have to be conditionalized depending on a special
parsing flag option. There is also  a number of tables indexed by
the lowercase name and that will need to be preserved

I hope the inclusion of the new option satisifies this comment.
But, I am concerned about which tables might need to be taken care
of, so that the engine is not broken.


- In other parts of the code (only in HTMLparser.c), the
 comparsions using xmlStrEqual() for names, need to be
 replaced by xmlStrcaseEqual().


  I.e. makes a lot of costly calls instead of one costly and a number
of cheap ones, I disagree with this approach.

I hope this is also answered above. xmlStrcaseEqual() will not be
used.

I did make these changes, and tested once. I found that some tags
during the parse are missing out. For example and in particular,
the "body" tag seems to be missed out. Probably, this is because
I haven't taken care of the tables which you have mentioned above.

Best Regards,
GPN



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]