Re: Text editing and UTF-8 conversion



On 7/16/2010 4:16 AM, Tor Lillqvist wrote:
Is there anything that I could do differently to increase the loading speed
or is this just something I need to just deal with and move on to the next
item on the list?

You need to find out what exactly it is that is taking a lot of time,
and experiment with different ways to do that. I.e., write separate
test programs to explore different strategies for:

- conversion. Do you convert a line at a time, some buffer of some
length at a time, all the file at a time, or what?

- reading in the file data. Do you read the whole file into memory in
one go? or a line at a time? or what?

Thank you for your response Tor. I have tried a few things to figure out where the time is being taken up. This is why I created a simple test application where all it does it read/convert a file and I time that. I tried using a profiler application I found online to see where exactly the time is spent, but the results I got from it sounded incorrect to me. According to it time was spent in g_uri_list_extract_uris() which seems to have no relevance in this code. Perhaps it is not reading symbols correctly from the glib library calls.

Currently, my code reads and converts the file in small chunks. I've tried adjusting the chunk size to be anywhere from one meg to the entire file size. The chunk size does not appear to terribly effect the speed. GIO reads the file quite fast, the slow part of this app is the the g_locale_to_utf8 call.

For the conversion itself, I've tried a few things as well. Currently, I call g_utf8_validate on the data received in the last read, to validate the entire buffer. This function call executes pretty quickly no matter the buffer size. If this call fails, I attempt to convert the data using g_locale_to_utf8, executing it as well on the entire buffer. I attempted changing this to only run on the data from the failure point (as indicated by g_utf8_validate and beyond, but that also appeared to have no real effect on the speed.

I've also attempted g_convert() specifying that it should convert from ISO-8859-1 to UTF-8 but this change made no difference in the run time either. I read in the docs that g_iconv() should be used for streaming conversion and not g_convert, so I will try that route next time I work on the application.

One interesting note, I copied and built my test program on my older ubuntu based computer I use as my personal web server. On there, it executed siginificantly faster than on my windows machine. I'm not sure why this is, maybe due to a native iconv rather than libiconv? The output from the two runs are:

Windows: (pkg-config --modversion glib-2.0) = 2.24.0
-------------------------------------------------------------------
Converted data to UTF8 successfully.
Read 52428800 bytes; 52428800 of 77594624 bytes total           [67.57%]

Converted data to UTF8 successfully.
Read 25165824 bytes; 77594624 of 77594624 bytes total           [100.00%]

Valid UTF8 data read.
Read 0 bytes; 77594624 of 77594624 bytes total          [100.00%]

Read entire file in 10 seconds.
-------------------------------------------------------------------

Ubuntu: (pkg-config --modversion glib-2.0) = 2.20.1
-------------------------------------------------------------------
Converted data to UTF8 successfully.
Read 52428800 bytes; 52428800 of 77594624 bytes total           [67.57%]

Converted data to UTF8 successfully.
Read 25165824 bytes; 77594624 of 77594624 bytes total           [100.00%]

Valid UTF8 data read.
Read 0 bytes; 77594624 of 77594624 bytes total          [100.00%]

Read entire file in 1 seconds.
-------------------------------------------------------------------


I will keep trying a few things for another day or two. If there are any other suggestions from anyone I'd be happy to hear them. I have not dealt with programming for large files much, and even less dealing with character encodings.

Thanks,
Keith M.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]