Re: Text editing and UTF-8 conversion
- From: Keith Maika <keithm aoeex com>
- To: gtk-app-devel-list gnome org
- Subject: Re: Text editing and UTF-8 conversion
- Date: Fri, 16 Jul 2010 16:12:48 -0400
On 7/16/2010 4:16 AM, Tor Lillqvist wrote:
Is there anything that I could do differently to increase the loading speed
or is this just something I need to just deal with and move on to the next
item on the list?
You need to find out what exactly it is that is taking a lot of time,
and experiment with different ways to do that. I.e., write separate
test programs to explore different strategies for:
- conversion. Do you convert a line at a time, some buffer of some
length at a time, all the file at a time, or what?
- reading in the file data. Do you read the whole file into memory in
one go? or a line at a time? or what?
Thank you for your response Tor.  I have tried a few things to figure 
out where the time is being taken up.  This is why I created a simple 
test application where all it does it read/convert a file and I time 
that.   I tried using a profiler application I found online to see where 
exactly the time is spent, but the results I got from it sounded 
incorrect to me.  According to it time was spent in 
g_uri_list_extract_uris() which seems to have no relevance in this code. 
 Perhaps it is not reading symbols correctly from the glib library calls.
Currently, my code reads and converts the file in small chunks.  I've 
tried adjusting the chunk size to be anywhere from one meg to the entire 
file size.  The chunk size does not appear to terribly effect the speed. 
 GIO reads the file quite fast, the slow part of this app is the the 
g_locale_to_utf8 call.
For the conversion itself, I've tried a few things as well.  Currently, 
I call g_utf8_validate on the data received in the last read, to 
validate the entire buffer.  This function call executes pretty quickly 
no matter the buffer size.   If this call fails, I attempt to convert 
the data using g_locale_to_utf8, executing it as well on the entire 
buffer.  I attempted changing this to only run on the data from the 
failure point (as indicated by g_utf8_validate and beyond, but that also 
appeared to have no real effect on the speed.
I've also attempted g_convert() specifying that it should convert from 
ISO-8859-1 to UTF-8 but this change made no difference in the run time 
either.  I read in the docs that g_iconv() should be used for streaming 
conversion and not g_convert, so I will try that route next time I work 
on the application.
One interesting note, I copied and built my test program on my older 
ubuntu based computer I use as my personal web server.  On there, it 
executed siginificantly faster than on my windows machine.  I'm not sure 
why this is, maybe due to a native iconv rather than libiconv?  The 
output from the two runs are:
Windows: (pkg-config --modversion glib-2.0) = 2.24.0
-------------------------------------------------------------------
Converted data to UTF8 successfully.
Read 52428800 bytes; 52428800 of 77594624 bytes total           [67.57%]
Converted data to UTF8 successfully.
Read 25165824 bytes; 77594624 of 77594624 bytes total           [100.00%]
Valid UTF8 data read.
Read 0 bytes; 77594624 of 77594624 bytes total          [100.00%]
Read entire file in 10 seconds.
-------------------------------------------------------------------
Ubuntu: (pkg-config --modversion glib-2.0) = 2.20.1
-------------------------------------------------------------------
Converted data to UTF8 successfully.
Read 52428800 bytes; 52428800 of 77594624 bytes total           [67.57%]
Converted data to UTF8 successfully.
Read 25165824 bytes; 77594624 of 77594624 bytes total           [100.00%]
Valid UTF8 data read.
Read 0 bytes; 77594624 of 77594624 bytes total          [100.00%]
Read entire file in 1 seconds.
-------------------------------------------------------------------
I will keep trying a few things for another day or two.  If there are 
any other suggestions from anyone I'd be happy to hear them.  I have not 
dealt with programming for large files much, and even less dealing with 
character encodings.
Thanks,
Keith M.
[
Date Prev][
Date Next]   [
Thread Prev][
Thread Next]   
[
Thread Index]
[
Date Index]
[
Author Index]