[xml] Parsing a file that I didn't create



Hello,

I'd like to use libxml to parse documents on the web that I didn't
create.  Some of these are malformed according to the standard, and,
unfortunately, I can't do anything about that.  For example, yahoo.com
contains the following piece of code:

<script language=javascript>
if(typeof(YAHOO)!='undefined') {
      document.write('<map name="yodel"><area shape="rect"
coords="209,30,216,39" href="http://www.yahoo.com";
onclick="callYodel();return false;"><area shape="poly"
coords="211,0,222,1,215,26,211,25" href="http://www.yahoo.com";
onclick="callYodel();return false;"></map><div id=l_fl
style="position:absolute"></div>');
      var
lr0='http://us.ard.yahoo.com/SIG=12ldjm870/M=386734.8419383.10128039.81613...
      var lcap=0,lncap=0,ad_jsl=0,lnfv=6,ylmap=0;
      var ldir="http://us.i1.yimg.com/us.yimg.com/i/mntl/ww/06q3/";;
      var swfl1=ldir+"yodel.swf";
      var swflw=1,swflh=1;
}
...
</script>


libxml correctly messes this up because the closing HTML tags between
the </script> tags aren't correctly written as <\/name>.  Is there a
way to use libxml (I'm currently using the SAX parser) without having
it try to fix things for me?  If not, is there another C library that
people know of that can just return each tag to me, one at a time,
without enforcing adherence to the standard?

Thanks,
Jeff



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]