Re: Gtk::Text widget



    -> I created a Motif text widget about 5 years ago that uses UTF-16 (aka
    -> UCS-2) internally exclusively.

    Derek> UTF-16 is a multibyte encoding, where characters can have either 2
    Derek> or 4 bytes.  If you only used 2 bytes per char *every* time, then
    Derek> you were using UCS-2.  UTF stands for "UCS Transformation Format",
    Derek> i.e., a way to put the full 32-bit Unicode character set into a
    Derek> multibyte transformation.
	
UTF-16 is only a multi-byte encoding in the sense that it takes more than one
byte to represent a character, and it takes 2 or 3 (surrogates) bytes for each
character, never 4.

If you are ignoring surrogate characters, then yes, it would be proper to say
UCS-2, not UTF-16.  I handle surrogates.

If you combine the surrogates, then you will require an atomic type that
provides 4 bytes, because nobody provides a reasonably efficient 3-byte atomic
type that I know of, unless it is a requirement for the new .NET effort from
M$.

Unicode is not a 32-bit character set; it is 21-bit (when surrogates are
combined).  ISO 10646-[12] is a 32-bit character set.


    Derek> This is what Nedit does, too (although I think it only uses 1 byte
    Derek> per char, i.e. ASCII).  You're limited to about 65 thousand
    Derek> different attributes, but that shouldn't be a problem (it was a
    Derek> problem for me when I was considering the Nedit method, i.e. only
    Derek> 256 possible different attributes)

In actual practice, one particularly pathological case required 481 different
attribute sets, but in general use, it has hovered around 170.  As we have
been adding more attributes (i.e. language info, parts-of-speech, etc.) the
number of attribute sets has been going up, and we'll need more than 256 soon.

    Derek> Yes, but I've already done the coding for parallel attribute
    Derek> structure (a list of Lines that holds a list of spans), so I'm
    Derek> going to get the "View" working before I make any drastic
    Derek> architecture changes.  Also, I want to store information about
    Derek> lines (so particular line numbers can be marked with "bookmarks",
    Derek> or with an "error"/"warning" icon in a GUI debugger).

I didn't intend to lobby for a change, I was just describing what I did and
why.

    Derek> I think of my widget as structured in layers, like in the Gimp.
    Derek> The bottom layer is the gapped text buffer, the next layer up are
    Derek> the "Style" spans (which are optimal for scanner-based syntax
    Derek> highlighting).  The third layer up will be overlapping tags, in a
    Derek> tree structure, once I've had time to write it.

Overlapping tags is going to be fun :-)  I have them in my widget, but it is a
really poor implementation.  I haven't had time to revisit them.

-> And adding regex wasn't difficult with a gap buffer.

    Derek> (This is why I'm using a gapped buffer :)

Out of curiosity, which regex package are you intending to use?  I whipped up
a freely available DFA package that works in most cases and is almost a real
DFA :-)  I intentionally left it almost complete for some students to whom I
was teaching debugging techniques.  But it's fast, pretty small, and has the
right kind of copyright.

   http://crl.nmsu.edu/~mleisher/
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab            Once you fully apprehend the vacuity of a
New Mexico State University       life without struggle, you are equipped
Box 30001, Dept. 3CRL             with the basic means of salvation.
Las Cruces, NM  88003                            -- Tennessee Williams




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]