[xml] handling href tags in html



I am parsing arbitrary (often noncompliant) HTML pages.  I need to
convert all URLs found into absolute, properly escaped URLs.

The problem I am facing is that libxml2 converts the entire document to
UTF8, even the href attribute stuff.  The problem is that characters
such as "ñ" do not get escaped properly.  It seems that I need to do an
UTF8Toisolat1(...) call to convert the URL to something escapable.

startElementCallback(   // libxml callback
    this_t*         p_this,    // XML parser context
    const xmlChar*  p_name,    // xmlChar characters of element name
    const xmlChar** p_atts)    // tags for the element
{

...

// p_atts[i] below is attribute for href tag
int alength = strlen((const char*)p_atts[i]);
int length = 2*alength;
unsigned char fruit[length];

int fred = UTF8Toisolat1(fruit,&length,p_atts[i],&alength);
fruit[length]=0;
xmlChar* escURL = xmlURIEscape(fruit);

...

}

The above, probably braindead, code is my quick hack to get around this.
Does it make sense?

Joel



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]