Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount

From: Daniel Veillard <veillard redhat com>
To: Graham Leggett <minfrin sharp fm>
Cc: xml gnome org
Subject: Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
Date: Fri, 19 Oct 2012 11:20:01 +0800

On Thu, Oct 18, 2012 at 06:19:31PM +0200, Graham Leggett wrote:

On 18 Oct 2012, at 6:07 PM, Daniel Veillard <veillard redhat com> wrote:

 See xmlByteConsumed() but it's more complex for us than for expat
as we convert the initial byte stream to UTF-8 if it was in a different
encoding. See the xmlByteConsumed() code.


The docs say "This function provides the current index of the parser relative to the start of the current 
entity.", when it says "current index of the parser" what exactly does this point to? The start of the 
element? The character following the end of the element? Something else?


  That depends when you ask !

I don't understand what
"the length of the element" is supposed to mean.


The length of the element is the distance from the start of the element, to the end of the element. For 
example, if the element was '<body  id="foo">' the length would be 16 (note the extra space between body 
and id). The expat function that gives you this is XML_GetCurrentByteCount().


  You seems you have a very perverse definition of what an element is:

  <body  id="foo"> .... </body>

By definition an element end with the ETag if not empty, the end tag:
  http://www.w3.org/TR/REC-xml/#NT-element
What you are referencing is actually the start and the end of the
start tag STag
  http://www.w3.org/TR/REC-xml/#NT-STag

Please avoid inventing terms. The spec is out there it defines the
terminology precisely.

Assuming you call the function in a start element SAX callback you will
get xmlByteConsumed pointing just after the '>' at the end of the start
tag. You should be able to find the corresponding '<' in
ctxt->input->base when progressing backward from ctxt->input->cur
which is the current index of the parser. then you can get the lenght
of the start tag in uTF-8 encoding, and from there find the lenght
of the start tag in the original document encoding, and then you can
substract it from xmlByteConsumed() to get the second value you want.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

Follow-Ups:
- Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
  - From: Graham Leggett

References:
- [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
  - From: Graham Leggett
- Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
  - From: Daniel Veillard
- Re: [xml] libxml2 equivalents for expat's XML_GetCurrentByteIndex and XML_GetCurrentByteCount
  - From: Graham Leggett

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]