Re: Glib::Regex and unicode chars



Am 19.08.2012 19:10, schrieb Jakub Okoński:
> Hello,
> 
> I'm trying to implement syntax highlighting and it works perfectly until
> I input some special chars, in which fetch_pos method gives me number of
> bytes rather than characters. Let's take this regex object for example:
> 
> Glib::Regex::create(R"((?<word>class))",
> Glib::RegexCompileFlags::REGEX_OPTIMIZE);
> 
> fetch_pos method works perfectly on ascii text, but as soon as I prepend
> the string with any multibyte unicode character, fetch_pos gives me
> shifted values. (The shift is equal to test_string.bytes() -
> test_string.length()).
> 
> I could probably fix this manually by adjusting shift by the difference
> of bytes and length, but I'm sure that would not be efficient and I
> would have to look back in the buffer constantly.
> 
> Same goes for capturing keywords that have unicode characters
> themselves, for example capturing "clasś" (note the special character at
> the end) would result in fetch_pos giving range of 6 characters, when
> the word contains 5 characters (but has 6 bytes).
> 
> Maybe I'm not using it correctly, but it was said that Glib::Regex
> supports utf-8.
> 
> Thanks
> 

Hmm, that's unfortunate. Two ideas to work around:

1. Work on the std::string directly (ustring::raw()) and convert back to
ustring when you are finished.

2. Use g_utf8_pointer_to_offset() to get from position to position,
caching the result if necessary.

From looking at the code for ustring, it is obvious that it does
basically the same thing so you don't lose any performance.

(Which, by the way, also means that iterating over a ustring with
operator[] has an O(n^2) overhead. Good to know.)

Hope this helps,
Florian Philipp

Attachment: signature.asc
Description: OpenPGP digital signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]