Re: Fix for "Need encoding conversion for GIOChannel"

Hi Owen,

Thanks for your comments.

Yes, the current code is not following GLib/GTK+ style. Will fix this
first of all.

I'll get back later for the other parts of the comments, as I'm stuck
at the very 1st one about handling of partial characters.


>Date: Thu, 26 Apr 2001 19:34:01 -0400
>From: Owen Taylor <otaylor redhat com>
>Subject: Re: Fix for "Need encoding conversion for GIOChannel"
>To: HideToshi Tajima <tajima happy eng sun com>
>Cc: gtk-devel-list gnome org
>MIME-version: 1.0
>X-Loop: gtk-devel-list gnome org
>X-BeenThere: gtk-devel-list gnome org
>Delivered-to: gtk-devel-list gnome org
>User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7
>X-Mailman-Version: 2.0beta5
>List-Id: Development of GTK+ <>
>HideToshi Tajima <tajima happy eng sun com> writes:
>> I've posted a patch for one of the remaining 2.0 API bugs:
>>, whose short summary
>> is "Need encoding conversion for GIOChannel".
>> The patch should give GIOChannel class the features listed below:
>>        * charset code conversion at read/write, read/write by chars
>>        * buffering 
>>        * blocking vs non-blocking (controled by *set_flags/get_flags)
>>        * Reading a file a line at a time
>>        * Reading a whole file at once
>>        * error handling with more error feedbacks	
>> A simple testcase is attached to the above bugzilla report. I've run
>> the test with this testcase on two Unix platforms - redhat linux6.2 Intel
>> and Solaris 8 sparc, and the patch seems to work well on both. I have to
>> say that win32 backend code is untested(more precisely unimplemented) - so
>> I wish to get someone who knows win32 have a look, test, and complete the
>> code.
>Reading over your this patch, it's clear to me that this is a much
>harder problem than I realized when I first wrote down the API.
>There are various issues about the exact semantics of the functions
>that seem minor at first glance, but have fairly large consequences
>for the internal implementation.
>So, most of this mail is just listing some of places where your
>code did different things than I expected, or I wasn't sure
>what I expected.
>I think once we agree on exactly what the behavior should be for
>these cases, making the code match that shouldn't be hard, even
>it in a few cases it may be a little more complicated than I 
>expected initially.
>So, my apologies if the following has more questions in it than
>                                        Owen
> * One issue that came to mind reading your code and thinking about
>   the problems is the handling of partial characters:
>    - If you call read() with a buffer of length 100, the first 60
>      characters take 99 bytes, and the 61st 2 bytes, does it put
>      half the last character result buffer?
>    - If you call write() with a buffer of length 100, with 66
>      complete characters taking 99 bytes and 1 incomplete character, 
>      does it return a length written of 99, or a length written
>      of 100 and store the partial character internally?
>   If the answer is that conversion should take into account partial
>   characters, then the question is, what should happen when the
>   encoding for the stream is UTF-8? What should happen when it is
>   NULL?
> * I think this code probably should be using iconv() directly
>   instead of going through g_convert(). The advantage of using
>   iconv() is that we can convert the number of characters that
>   fit into a fixed size output buffer, and we don't have to malloc
>   intermediate buffers; also using one iconv_t converter for
>   the entire stream is a potentially a fairly big efficiency win.
>   A final reason for this is that if you use iconv() directly,
>   then an encoding like UTF-16 will handle the BOM correctly,
>   since it will only reset the converter once for the stream,
>   not for every line.
>   This obviously does pose a problem for doing 'fallbacks'
>   in the style of g_convert_with_fallback().
> * Changing the line terminator depending on the platform is a little
>   dubious - files don't nicely stick to their on platforms.
>   I think what I'd do is have a enum:
>    GIOLineTerminator 
>    {
>      G_IO_LINE_AUTO, 
>      G_IO_LINE_DOS,       /* \r\n */
>      G_IO_LINE_UNIX,      /* \n */
>      G_IO_LINE_MACINTOSH  /* \r */
>    };
>   And have the default be AUTO - which means, separating on
>   any of \r, \n, or \r\n. Though this requires one character
>   read-ahead, hmmm, which means that using \r termination
>   with AUTO is unsuitable for network protocol type applications
>   where responses to complete lines are needed.
>   Related to this is the fact that we can't assume the input
>   encoding will have literal \n, \r characters in it; I'd 
>   certainly like to be able to handle UTF-16 as a file encoding.
>   This means that conversion needs to be done _before_ looking
>   for terminators, not after.
> * I'm not sure that converting the virtual functions in GIOFunc
>   to take GError * arguments is the right approach. The problem
>   is that GError * should only be used for "real" errors - it
>   is fairly expensive to duplicate a string, allocate memory
>   for the GError and so forth.
>   And read() and write() have two non-error conditions that are
>   signalled by an "errno" - EAGAIN and EINTR. With the
>   "io_channel_read_chars()" type signature, there is a clear
>   mapping for EAGAIN that doesn't involve setting the GError * - 
>   return 0 bytes read with err == NULL.
>   But there is no corresponding mapping for EINTR. One approach
>   to handing EINTR would be to make g_io_channel_read()/write() 
>   simply always restart on EINTR; that does affect people
>   trying to do, say, read/write timeouts with SIGALRM. But
>   I don't know if anybody really does that with GIOChannel.   
> * When a converter is involved, no buffering may not sense. 
>   This is certainly true when reading. If a partial character
>   is read from the stream, I can't convert it and return
>   it to the user, so I have to store it somewhere.
>   On the read side, in fact, buffering is basically invisible
>   to the caller and is an implementation detail.
>   On the write side, we have the option of returning a short
>   write when partial characters are involved. (See the above
>   question.) Though this may not be particularly convenient
>   for the caller.
>   On the write side buffering is not invisible to the caller;
>   the timing of buffer flushing has important consequences.
>   Callers will have somewhat different expectations for what
>   happens on a non-buffered stream when they call write()
>   depending on whether the stream is blocking or non-blocking.
>   The naive expecation is:
>    For a blocking stream, all complete characters will be written
>    to the network out before the write() operation returns.
>    For a non-blocking stream, a short write will occur, with
>    the amount written being the amount of data that fits
>    into the network buffers.
>   But the second expectation isn't really implementable with
>   encoding conversion, since we have to convert the data
>   before we try writing it.
>   So, I think, instead the correct behavior for a "non-buffered" 
>   non-blocking stream is that as much data as possible will
>   be written to the network before returning, but with the
>   potential for having more written to the internal GIOChannel buffer.
>   After all, the caller can't really tell if the data is
>   being buffered in the GIOChannel or the system's network
>   layer.
>   With these ideas in mind, I'm not sure that my original idea
>   of indicating a non-buffered stream with a buffer size of
>   0 is correct. Instead, I think the correct interface is 
>   probably:
>    /* A hint to as how much buffering is useful. GIOChannel
>     * may not use exactly this size.
>     */
>    g_io_channel_set_buffer_size (GIOChannel	*channel,
>                                  gint		 size);
>    g_io_channel_set_auto_flush  (GIOChannel    *channel,
>                                  gboolean       auto_flush);
>   Where auto_flush is the non-buffered behavior above. When
>   an conversion is in effect, GLib would make the buffer at
>   least big enough to store partial characters.
> * When I proposed g_io_channel_read_line_string(), I actually
>   meant GString *buffer, not GString **buffer. 
>   The idea is, by using a single GString as the buffer while
>   reading a series of lines, constant reallocation of the line
>   buffer is avoided.
>   g_io_channel_read_line() can then be implemented as:
>    GString *buffer = g_string_new (NULL);
>    g_io_channel_read_line_string (channel, buffer);
>    if (length)
>      *length = buffer->len;
>    if (string)
>      *string = g_string_free (string, FALSE);
>    else
>      g_string_free (string, TRUE);
> * g_io_channel_read_line() should probably handle lines longer than
>   the buffer size. For blocking streams, this is not a problem;
>   you just keep reading, converting and writing to the output 
>   buffer.
>   For non-blocking streams, it is a little more tricky, since 
>   you might want to write something like:
>    gboolean
>    my_watch (GIOChannel *channel, GIOCondition cond, gpointer data)
>    {
>      char *string, int len;
>      GError *err = NULL;
>      while (g_io_channel_read_line (channel, &string, &len, &err))
>        {
>          printf ("Received line: %s\n", string);
>          g_free (string);
>        }
>      if (err)
>        [...]
>    }
>   But this involves keeping an arbitrary-sized input buffer.
>   And if you can mix calls to g_io_channel_read_line()
>   and g_io_channel_read_chars() [ though I can't really see why
>   you would want to do this ], this arbitrary-sized input buffer
>   needs to be accessible by all the read_* functions.
> * GIOChannel is an interface for bidirectional channels, so it 
>   is probably necessary to keep separate read and write buffers;
>   by only allocating the buffers on demand, the overhead of 
>   this for GIOChannels being used unidirectionally should
>   be minimal.
> * A few formatting issues 
>   - GLib/GTK+ style is 'function (' not 'function('; 'if (', not 'if'
>   - GLib/GTK+ has the '{' on the next line GNU style, indented one
>     level. 
>    if (tmperr)
>      {
>        g_propagate_error (err, tmperr);
>        return 0;
>      }
> * Inline doc comments will be needed before the patch goes
>   into GTK+.
> * There needs to be a call like g_io_channel_pending() to check
>   if there are any characters buffered, and the IO channel
>   sources need to check this in their pending() check() methods
>   before checking the underlying stream.
>gtk-devel-list mailing list
>gtk-devel-list gnome org

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]