[Tracker] Dropping libunac with custom unaccenting

Hi all,

I sent a patch for GB#619244
(https://bugzilla.gnome.org/show_bug.cgi?id=619244) in order to drop the
libunac dependency in tracker.

The patch provides a new per-parser unaccenting method, using the same
logic as the one in libunac, but not needing the explicit conversion
to/from UTF-16BE.

In a brief, results show that in the best case, the parsing goes up to
73% faster, while in the worst case where no unaccenting needs to be
done, results remain pretty the same.

Given times are the best ones after 10-15 tries with the same file, and
test files are available at:

1) File: accents-big.txt
 * unaccenting needed in ALL words
 * contains NFC and NFD text
 * size = 2.7MBytes
libunac + glib/pango           --> 1.383s
libunac + libunistring         --> 2.295s
libunac + libicu               --> 1.877s
custom-unaccent + glib/pango   --> 0.587s (58% faster than libunac)
custom-unaccent + libunistring --> 0.822s (65% faster than libunac)
custom-unaccent + libicu       --> 0.519s (73% faster than libunac)

So, if unaccenting needs to be done in ALL words, the libicu parser with
custom unaccenting method is the fastest one. This, anyway, is a corner
case as never is really needed unaccenting in all words of a given file,
but at least it shows how faster it goes.

File: mixed-big.txt
 * unaccenting needed only in some words
 * contains mixed languages
 * size = 2.7MBytes
libunac + glib/pango           --> ...several minutes
libunac + libunistring         --> 0.648s
libunac + libicu               --> 0.929s
custom-unaccent + glib/pango   --> ...several minutes
custom-unaccent + libunistring --> 0.386s (41% faster than libunac)
custom-unaccent + libicu       --> 0.636s (32% faster than libunac)

In this case, where only some words need unaccenting (a more general
case than previous one), the libunistring parser with custom unaccenting
method is the fastest one. glib/pango parser doesn't perform ok with
this file.

File: ascii-big.txt
 * no unaccenting needed
 * contains ASCII only
 * size = 2.7MBytes
libunac + glib/pango           --> 0.545s
libunac + libunistring         --> 0.253s
libunac + libicu               --> 0.630s
custom-unaccent + glib/pango   --> 0.495s (10% faster than libunac)
custom-unaccent + libunistring --> 0.252s (almost same)
custom-unaccent + libicu       --> 0.628s (almost same)

In theory, this tests should give more or less same times for both
libunac and custom unaccenting methods, and that's the case for the
libicu and libunistring parsers; but the glib/pango one seems 10% faster
if using the custom unaccenting method. The reason for this seems to be
that the glib/pango implementation doesn't seem to skip unaccenting if
string is ASCII-only.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]