Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount

From: Daniel Veillard <veillard redhat com>
To: Graham Leggett <minfrin sharp fm>
Cc: xml gnome org
Subject: Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
Date: Sat, 20 Oct 2012 09:16:17 +0800

On Fri, Oct 19, 2012 at 04:57:14PM +0200, Graham Leggett wrote:

On 19 Oct 2012, at 5:20 AM, Daniel Veillard <veillard redhat com> wrote:

The docs say "This function provides the current index of the parser relative to the start of the 
current entity.", when it says "current index of the parser" what exactly does this point to? The start 
of the element? The character following the end of the element? Something else?


 That depends when you ask !


I would be asking in the middle of a SAX event. In other words, I would be saying "this event you just 
called for for, where does the data start in the original raw stream, and where does it end".


  Then I can't have a generic answer for you. I looked at the element
start callback, because that looked like what you were interested in,
but for text, CData, entities, PI, comment callbacks you will have to
do the same analysis on a case by case basis.

[...]

Assuming you call the function in a start element SAX callback you will
get xmlByteConsumed pointing just after the '>' at the end of the start
tag. You should be able to find the corresponding '<' in
ctxt->input->base when progressing backward from ctxt->input->cur
which is the current index of the parser. then you can get the lenght
of the start tag in uTF-8 encoding, and from there find the lenght
of the start tag in the original document encoding, and then you can
substract it from xmlByteConsumed() to get the second value you want.


Am I right in understanding that ctxt->input->base points at the buffer having been previously passed to 
htmlParseChunk(), or can I expect this buffer to have been generated internally to libxml somewhere using 
malloc or something else?


  no, there can't be any guarantee about this. The libxml2 API guarantee
you an UTF-8 stream of Markup and data, expat SAX doesn't do that for you
and just pass you whatever encoding the document was in.
  libxml2 may convert encoding like I said in my first reply, it's
obviously not from the same buffer then !
  In general libxml2 will copy data in its own buffer catually, so you
can be 100% sure you won't get the same area than what you passed in.

  Seems your code depends way too heavilly on assumptions of the
expat processing model, nothing specific to SAX there.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

References:
- [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
  - From: Graham Leggett
- Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
  - From: Daniel Veillard
- Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
  - From: Graham Leggett
- Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
  - From: Daniel Veillard
- Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
  - From: Graham Leggett

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]