Unicode question...



	This is really more of a Unicode question than a Gtk question, but
I want to understand the answer in the context of Owen's new gunicode.h,
so here goes:


	How do C's escape characters relate to Unicode?  I.e., the
string

"Hello World.\n"

	Has 13 ASCII characters, the last one of which is \n.  What does
that look like as a wide character?  What does \t look like?  Does it
matter?


	Basically, I need to split UTF-8 string input on the carriage
return.  So would I do somthing like this:

  while ( utf8_input_string != NULL )
    {
      if ( *utf8_input_string == '\n' )
          total_lines_detected++;

      utf8_input_string = g_utf8_next_char( utf8_input_string );
    }


	Or would I need to do this:

  gint char_count;
  gunichar *ucs4_input_string;
  gunichar wide_newline;
  
  char_count = g_utf8_strlen(utf8_input_string);
  ucs4_input_string = g_utf8_to_ucs4(utf8_input_string, char_count);

  wide_newline = g_utf8_to_ucs4("\n", 1);

  while ( ucs4_input_string != NULL )
    {
      if ( *ucs4_input_string == wide_newline )
          total_lines_detected++;

      ucs4_input_string++;
    }


	I'm assuming that C converts '\n' into an 8-byte ASCII value, so
things like 

      if ( *ucs4_input_string == '\n' )
          total_lines_detected++;

	will not work.  Or is there some kind of hidden typecasting that
will let the one-byte \n compare directly to a 4-byte ucs4 character?

	Any help is greatly appreciated...

Thanks,
Derek Simkowiak
dereks@kd-dev.com

P.S.> It would be helpful if, in gunicode.h, every instance of "gint len"
were replaced with one of these:

gint char_count    [...or...]
gint byte_count





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]