Re: Fix for "Need encoding conversion for GIOChannel"

From: HideToshi Tajima <tajima happy eng sun com>
To: otaylor redhat com
Cc: gtk-devel-list gnome org
Subject: Re: Fix for "Need encoding conversion for GIOChannel"
Date: Thu, 26 Apr 2001 17:44:45 -0700 (PDT)
Hi Owen,

Thanks for your comments.

Yes, the current code is not following GLib/GTK+ style. Will fix this
first of all.

I'll get back later for the other parts of the comments, as I'm stuck
at the very 1st one about handling of partial characters.


Regards,
Toshi

>Date: Thu, 26 Apr 2001 19:34:01 -0400
>From: Owen Taylor <otaylor redhat com>
>Subject: Re: Fix for "Need encoding conversion for GIOChannel"
>To: HideToshi Tajima <tajima happy eng sun com>
>Cc: gtk-devel-list gnome org
>MIME-version: 1.0
>X-Loop: gtk-devel-list gnome org
>X-BeenThere: gtk-devel-list gnome org
>Delivered-to: gtk-devel-list gnome org
>User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7
>X-Mailman-Version: 2.0beta5
>List-Id: Development of GTK+ <gtk-devel-list.gnome.org>
>
>
>HideToshi Tajima <tajima happy eng sun com> writes:
>
>> I've posted a patch for one of the remaining 2.0 API bugs:
>> http://bugzilla.gnome.org/show_bug.cgi?id=52811, whose short summary
>> is "Need encoding conversion for GIOChannel".
>> 
>> The patch should give GIOChannel class the features listed below:
>> 
>>        * charset code conversion at read/write, read/write by chars
>>        * buffering 
>>        * blocking vs non-blocking (controled by *set_flags/get_flags)
>>        * Reading a file a line at a time
>>        * Reading a whole file at once
>>        * error handling with more error feedbacks	
>> 
>> A simple testcase is attached to the above bugzilla report. I've run
>> the test with this testcase on two Unix platforms - redhat linux6.2 Intel
>> and Solaris 8 sparc, and the patch seems to work well on both. I have to
>> say that win32 backend code is untested(more precisely unimplemented) - so
>> I wish to get someone who knows win32 have a look, test, and complete the
>> code.
>
>Reading over your this patch, it's clear to me that this is a much
>harder problem than I realized when I first wrote down the API.
>
>There are various issues about the exact semantics of the functions
>that seem minor at first glance, but have fairly large consequences
>for the internal implementation.
>
>So, most of this mail is just listing some of places where your
>code did different things than I expected, or I wasn't sure
>what I expected.
>
>I think once we agree on exactly what the behavior should be for
>these cases, making the code match that shouldn't be hard, even
>it in a few cases it may be a little more complicated than I 
>expected initially.
>
>So, my apologies if the following has more questions in it than
>answers.
>
>Regards,
>                                        Owen
>
> * One issue that came to mind reading your code and thinking about
>   the problems is the handling of partial characters:
>
>    - If you call read() with a buffer of length 100, the first 60
>      characters take 99 bytes, and the 61st 2 bytes, does it put
>      half the last character result buffer?
>
>    - If you call write() with a buffer of length 100, with 66
>      complete characters taking 99 bytes and 1 incomplete character, 
>      does it return a length written of 99, or a length written
>      of 100 and store the partial character internally?
>
>   If the answer is that conversion should take into account partial
>   characters, then the question is, what should happen when the
>   encoding for the stream is UTF-8? What should happen when it is
>   NULL?
>
> * I think this code probably should be using iconv() directly
>   instead of going through g_convert(). The advantage of using
>   iconv() is that we can convert the number of characters that
>   fit into a fixed size output buffer, and we don't have to malloc
>   intermediate buffers; also using one iconv_t converter for
>   the entire stream is a potentially a fairly big efficiency win.
>  
>   A final reason for this is that if you use iconv() directly,
>   then an encoding like UTF-16 will handle the BOM correctly,
>   since it will only reset the converter once for the stream,
>   not for every line.
>
>   This obviously does pose a problem for doing 'fallbacks'
>   in the style of g_convert_with_fallback().
>
> * Changing the line terminator depending on the platform is a little
>   dubious - files don't nicely stick to their on platforms.
>
>   I think what I'd do is have a enum:
>
>    GIOLineTerminator 
>    {
>      G_IO_LINE_AUTO, 
>      G_IO_LINE_DOS,       /* \r\n */
>      G_IO_LINE_UNIX,      /* \n */
>      G_IO_LINE_MACINTOSH  /* \r */
>    };
>
>   And have the default be AUTO - which means, separating on
>   any of \r, \n, or \r\n. Though this requires one character
>   read-ahead, hmmm, which means that using \r termination
>   with AUTO is unsuitable for network protocol type applications
>   where responses to complete lines are needed.
>
>   Related to this is the fact that we can't assume the input
>   encoding will have literal \n, \r characters in it; I'd 
>   certainly like to be able to handle UTF-16 as a file encoding.
>   This means that conversion needs to be done _before_ looking
>   for terminators, not after.
>   
> * I'm not sure that converting the virtual functions in GIOFunc
>   to take GError * arguments is the right approach. The problem
>   is that GError * should only be used for "real" errors - it
>   is fairly expensive to duplicate a string, allocate memory
>   for the GError and so forth.
>   
>   And read() and write() have two non-error conditions that are
>   signalled by an "errno" - EAGAIN and EINTR. With the
>   "io_channel_read_chars()" type signature, there is a clear
>   mapping for EAGAIN that doesn't involve setting the GError * - 
>   return 0 bytes read with err == NULL.
>
>   But there is no corresponding mapping for EINTR. One approach
>   to handing EINTR would be to make g_io_channel_read()/write() 
>   simply always restart on EINTR; that does affect people
>   trying to do, say, read/write timeouts with SIGALRM. But
>   I don't know if anybody really does that with GIOChannel.   
>
> * When a converter is involved, no buffering may not sense. 
>   This is certainly true when reading. If a partial character
>   is read from the stream, I can't convert it and return
>   it to the user, so I have to store it somewhere.
>
>   On the read side, in fact, buffering is basically invisible
>   to the caller and is an implementation detail.
>
>   On the write side, we have the option of returning a short
>   write when partial characters are involved. (See the above
>   question.) Though this may not be particularly convenient
>   for the caller.
>
>   On the write side buffering is not invisible to the caller;
>   the timing of buffer flushing has important consequences.
>
>   Callers will have somewhat different expectations for what
>   happens on a non-buffered stream when they call write()
>   depending on whether the stream is blocking or non-blocking.
>   The naive expecation is:
>
>    For a blocking stream, all complete characters will be written
>    to the network out before the write() operation returns.
>
>    For a non-blocking stream, a short write will occur, with
>    the amount written being the amount of data that fits
>    into the network buffers.
>
>   But the second expectation isn't really implementable with
>   encoding conversion, since we have to convert the data
>   before we try writing it.
>
>   So, I think, instead the correct behavior for a "non-buffered" 
>   non-blocking stream is that as much data as possible will
>   be written to the network before returning, but with the
>   potential for having more written to the internal GIOChannel buffer.
>   After all, the caller can't really tell if the data is
>   being buffered in the GIOChannel or the system's network
>   layer.
>
>   With these ideas in mind, I'm not sure that my original idea
>   of indicating a non-buffered stream with a buffer size of
>   0 is correct. Instead, I think the correct interface is 
>   probably:
>
>    /* A hint to as how much buffering is useful. GIOChannel
>     * may not use exactly this size.
>     */
>    g_io_channel_set_buffer_size (GIOChannel	*channel,
>                                  gint		 size);
>
>    g_io_channel_set_auto_flush  (GIOChannel    *channel,
>                                  gboolean       auto_flush);
>
>   Where auto_flush is the non-buffered behavior above. When
>   an conversion is in effect, GLib would make the buffer at
>   least big enough to store partial characters.
>
> * When I proposed g_io_channel_read_line_string(), I actually
>   meant GString *buffer, not GString **buffer. 
>
>   The idea is, by using a single GString as the buffer while
>   reading a series of lines, constant reallocation of the line
>   buffer is avoided.
>
>   g_io_channel_read_line() can then be implemented as:
>
>    GString *buffer = g_string_new (NULL);
>    g_io_channel_read_line_string (channel, buffer);
>
>    if (length)
>      *length = buffer->len;
>    if (string)
>      *string = g_string_free (string, FALSE);
>    else
>      g_string_free (string, TRUE);
>
> * g_io_channel_read_line() should probably handle lines longer than
>   the buffer size. For blocking streams, this is not a problem;
>   you just keep reading, converting and writing to the output 
>   buffer.
>
>   For non-blocking streams, it is a little more tricky, since 
>   you might want to write something like:
>
>    gboolean
>    my_watch (GIOChannel *channel, GIOCondition cond, gpointer data)
>    {
>      char *string, int len;
>      GError *err = NULL;
>
>      while (g_io_channel_read_line (channel, &string, &len, &err))
>        {
>          printf ("Received line: %s\n", string);
>          g_free (string);
>        }
>
>      if (err)
>        [...]
>    }
>
>   But this involves keeping an arbitrary-sized input buffer.
> 
>   And if you can mix calls to g_io_channel_read_line()
>   and g_io_channel_read_chars() [ though I can't really see why
>   you would want to do this ], this arbitrary-sized input buffer
>   needs to be accessible by all the read_* functions.
>
> * GIOChannel is an interface for bidirectional channels, so it 
>   is probably necessary to keep separate read and write buffers;
>   by only allocating the buffers on demand, the overhead of 
>   this for GIOChannels being used unidirectionally should
>   be minimal.
>
> * A few formatting issues 
>
>   - GLib/GTK+ style is 'function (' not 'function('; 'if (', not 'if'
>
>   - GLib/GTK+ has the '{' on the next line GNU style, indented one
>     level. 
>   
>    if (tmperr)
>      {
>        g_propagate_error (err, tmperr);
>        return 0;
>      }
>
> * Inline doc comments will be needed before the patch goes
>   into GTK+.
>
> * There needs to be a call like g_io_channel_pending() to check
>   if there are any characters buffered, and the IO channel
>   sources need to check this in their pending() check() methods
>   before checking the underlying stream.
>
>_______________________________________________
>gtk-devel-list mailing list
>gtk-devel-list gnome org
>http://mail.gnome.org/mailman/listinfo/gtk-devel-list
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]