[xml] Line-end Normalisation



Hi there.

I can somehow feel that this theme has lived a thousand lives on this
mailing list. Maybe I am blind or I don't see well, but I haven't found
the answer to the following.

There is an article on the web:
http://www.xml.com/axml/target.html#sec-line-ends and it describes how
line-ends should be handled on input. libxml2 does exactly as stated
there, namely, it converts all two charracter literals #xD#xA and all
solitary #xD literals to #xA. 

Now, let's assume I generate the folowing DOM im memory:

  <doctag>
    This is the text,   <-- #xD#xA here
    the text this is.   <-- #xD#xA here
  </doctag>

This <doctag/> element has a text child which contains more than one
line of text. The string in memory has line delimiters represented as
#xD#xA. If I now save this thing to a file, examining the file on the
disk reveals #xD#xD#xA line ends. Now I parse this file back into memory
and, following the input conversion rules, I get line-ends represented
as #xA#xA. Saving it again produces #xD#xA#xD#xA line-ends on the disk.
At this point line ends are duplicated and the phenomenon continues from
the beginning: each #xD#xA pair becomes #xD#xD#xA... and so on.

Now, all this happens on a Win32 machine and that when libxml2 is built
atop of MS C-runtime. We know that text files have CRLF ends on the disk
under Win32 and the fact is that MS C-runtime converts those to LF and
back in functions such as getc, putc, getchar, putchar and friends. I
don't have an UNIX machine handy at the moment to try it out, so I would
appreciate some feedback about if this happens there as well.

Now, the question is: What is to do? Does libxml2 require all strings in
memory to have #xA line-delimiters? If so, that is good because then I
must change my program and that is easy for me. If not, then perhaps we
have a problem between libxml2 and MS C runtime, given the fact they
both convert?

Ciao
Igor




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]