Re: [xml] Error on parsing HTML with libxml
- From: "Eric S Eberhard" <flash vicsmba com>
- To: 'André Rothe' <andre rothe zks uni-leipzig de>, <xml gnome org>, "'Liam R. E. Quin'" <liam fromoldbooks org>
- Subject: Re: [xml] Error on parsing HTML with libxml
- Date: Tue, 21 Aug 2018 12:55:43 -0700
That would be incorrect behavior for libxml2 -- as Liam and I both said -- you have to encode some how.
CDATA is one way and URL encoding (e.g. <, >, etc).
I sent you a link.
https://stackoverflow.com/questions/1398571/html-inside-xml-should-i-use-cdata-or-encode-the-html
Which I believe is the correct answer. If someone else is making the XML then they should fix it. I also
like the "soup" answer and agree.
We have people send invalid XML to our customers all the time ... my customers have chosen to make me fix it
:-) . That is what I get paid for so ...
We pre-process all XML files and fix every mistake we know (and the program slowly grows) before parsing it.
Examples include attributes without a space between the quote and start of next attribute. It would be wrong
for me to ask libxml2 to do this -- not on spec. So I do it.
So if was you and you have you take the files like this -- then pre-process them and fix them with either
CDATA or encoding because I don't think anyone else would support the kind of change you are asking for ...
Eric
Eric S Eberhard
VICS (Vertical Integrated Computer Systems)
Voice: 928 567 3529
Cell : 928 301 7537 (not reliable except for text or if not home)
2933 W Middle Verde Rd
Camp Verde, AZ 86322
-----Original Message-----
From: xml [mailto:xml-bounces gnome org] On Behalf Of André Rothe
Sent: Monday, August 20, 2018 12:48 AM
To: xml gnome org; Liam R. E. Quin <liam fromoldbooks org>
Subject: Re: [xml] Error on parsing HTML with libxml
I can't chage the source of the HTML page, because the page will be generated by another system, where I
don't have access. I get only the pages from there and our Apache module makes a post-processing step just
before the pages will be sent to the user's browser. And there I need a parser to change something within the
page.
So I think, the libxml should not parse the content of inline scripts to handle that.
There is also a comment on
https://stackoverflow.com/questions/51892455/php-5-4-16-domdocument-removes-parts-of-javascript
which describes your idea with CDATA, but it didn't work.
~André
On 18.08.2018 04:13, Liam R. E. Quin wrote:
On Fri, 2018-08-17 at 14:42 +0200, André Rothe wrote:
https://3v4l.org/O0iEf
Try changing
...writeln('</td>');
to
...writeln('<' + '/td>');
and see if that helps; or use a CDATA section, <script><![CDATA[
//..
]]></script> to escape the </td> markup from the HTML parser.
Although it may depend on what the missing //... lines look like,
assuming this is not the complete source.
Better yet, don't use document.write at all, and switch to more modern
practices :)
I'm not sure there's actually a bug here; if you feed the parser tag
soup, expect a mess. Keep zPHP, JavaScript, HTML, CSS in separate
files and life will probably be simpler.
_______________________________________________
xml mailing list, project page http://xmlsoft.org/ xml gnome org https://mail.gnome.org/mailman/listinfo/xml
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]