Re: How to deal with different encodings ?

From: "Karsten Rasmussen" <frommetoyou comxnet dk>
To: <dashboard-hackers gnome org>
Subject: Re: How to deal with different encodings ?
Date: Tue, 1 Apr 2008 15:49:25 +0200

I have no idea how it determines if data is  in non-UT8 encoding

I have previous used below function in php (found on php.net:http://dk2.php.net/manual/en/function.utf8-encode.php#39986)


function seems_utf8($Str) {
for ($i=0; $i<strlen($Str); $i++) {
 if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
 elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
 elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
 elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
 elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
 elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
 else return false; # Does not match any model
 for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
  if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
   return false;
 }
}
return true;
}

It returns true if it is legale utf-8.
If it return false I usaly assume it is ISO8859-1 (in my part of the world)

Disadvantage is you have to parse the whole string.

Regarding Windows:

I think windows notepad program stores a utf-8 null char as the first letterwhen it saves a text fil in utf-8 format, it makes it faster to determineencodinging - but beagle can of course not assume this.

References:
- How to deal with different encodings ?
  - From: Debajyoti Bera

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]