Re: Text editing and UTF-8 conversion

From: Keith Maika <keithm aoeex com>
To: gtk-app-devel-list gnome org
Subject: Re: Text editing and UTF-8 conversion
Date: Fri, 16 Jul 2010 16:12:48 -0400

On 7/16/2010 4:16 AM, Tor Lillqvist wrote:

Is there anything that I could do differently to increase the loading speed
or is this just something I need to just deal with and move on to the next
item on the list?


You need to find out what exactly it is that is taking a lot of time,
and experiment with different ways to do that. I.e., write separate
test programs to explore different strategies for:

- conversion. Do you convert a line at a time, some buffer of some
length at a time, all the file at a time, or what?

- reading in the file data. Do you read the whole file into memory in
one go? or a line at a time? or what?

Thank you for your response Tor. I have tried a few things to figureout where the time is being taken up. This is why I created a simpletest application where all it does it read/convert a file and I timethat. I tried using a profiler application I found online to see whereexactly the time is spent, but the results I got from it soundedincorrect to me. According to it time was spent ing_uri_list_extract_uris() which seems to have no relevance in this code.Perhaps it is not reading symbols correctly from the glib library calls.

Currently, my code reads and converts the file in small chunks. I'vetried adjusting the chunk size to be anywhere from one meg to the entirefile size. The chunk size does not appear to terribly effect the speed.GIO reads the file quite fast, the slow part of this app is the theg_locale_to_utf8 call.

For the conversion itself, I've tried a few things as well. Currently,I call g_utf8_validate on the data received in the last read, tovalidate the entire buffer. This function call executes pretty quicklyno matter the buffer size. If this call fails, I attempt to convertthe data using g_locale_to_utf8, executing it as well on the entirebuffer. I attempted changing this to only run on the data from thefailure point (as indicated by g_utf8_validate and beyond, but that alsoappeared to have no real effect on the speed.

I've also attempted g_convert() specifying that it should convert fromISO-8859-1 to UTF-8 but this change made no difference in the run timeeither. I read in the docs that g_iconv() should be used for streamingconversion and not g_convert, so I will try that route next time I workon the application.

One interesting note, I copied and built my test program on my olderubuntu based computer I use as my personal web server. On there, itexecuted siginificantly faster than on my windows machine. I'm not surewhy this is, maybe due to a native iconv rather than libiconv? Theoutput from the two runs are:


Windows: (pkg-config --modversion glib-2.0) = 2.24.0
-------------------------------------------------------------------
Converted data to UTF8 successfully.
Read 52428800 bytes; 52428800 of 77594624 bytes total           [67.57%]

Converted data to UTF8 successfully.
Read 25165824 bytes; 77594624 of 77594624 bytes total           [100.00%]

Valid UTF8 data read.
Read 0 bytes; 77594624 of 77594624 bytes total          [100.00%]

Read entire file in 10 seconds.
-------------------------------------------------------------------

Ubuntu: (pkg-config --modversion glib-2.0) = 2.20.1
-------------------------------------------------------------------
Converted data to UTF8 successfully.
Read 52428800 bytes; 52428800 of 77594624 bytes total           [67.57%]

Converted data to UTF8 successfully.
Read 25165824 bytes; 77594624 of 77594624 bytes total           [100.00%]

Valid UTF8 data read.
Read 0 bytes; 77594624 of 77594624 bytes total          [100.00%]

Read entire file in 1 seconds.
-------------------------------------------------------------------

I will keep trying a few things for another day or two. If there areany other suggestions from anyone I'd be happy to hear them. I have notdealt with programming for large files much, and even less dealing withcharacter encodings.


Thanks,
Keith M.

References:
- Text editing and UTF-8 conversion
  - From: Keith Maika
- Re: Text editing and UTF-8 conversion
  - From: Tor Lillqvist

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]