Thanks for you help guys,
I've done a bit more experimenting and think I've nailed it. The problem seems to occur because the source HTML defines some mark-up before the <meta> tag defining the utf-8 charset. This mark-up contains a UTF8 character. The parser strikes it and this throws out the rest of the parsing. If I ensure that the first tag in the <head> is the charset <metatag> the encoding proceeds as expected. Interesting that the encode flag of xmllint is ignored under these circumstances though! Guess I'll have to strip this tag out before I pass the source to xmllint. Thanks again, Matt |