[Tracker] Collations



Hi hi,

So, I've been playing with using a proper unicode collation function for
ordering results. As we depend on 3 different unicode support libraries,
I implemented the collation function with all three, to check which one
performs better. Also, I implemented 2 ways of using the collation:
 * first one (A cases) enabling the collation by default in all text
columns, so that all the order-by searches will use the collation
function if ordering text.
 * second one (B cases) with a new COLLATE keyword enabled in sparql, so
using binary comparisons in the order-by searches if COLLATE not
specified.

The tests have been executed over a set of 3000 nmm:MusicPieces, using
direct-access. Given time values are the best ones over ~10 tries.

A) Enabled the collation function for all TEXT columns in the database
when creating/altering a table. If indexes are created on these columns,
the given collation function will be used for the index. This is done in
the 'collation' remote branch in gnome git.

A.1) Select all titles unordered:
$> time tracker-sparql -q "SELECT ?title WHERE { ?u a nmm:MusicPiece ;
nie:title ?title }" >/dev/null 
 * libicu: 0.182s
 * libunistring: 0.203s
 * glib: 0.328s

A.2) Select all titles (column with index) ordered:
$> time tracker-sparql -q "SELECT ?title WHERE { ?u a nmm:MusicPiece ;
nie:title ?title } ORDER BY (?title)" >/dev/null 
 * libicu: 0.182s
 * libunistring: 0.204s
 * glib: 0.330s

A.3) Select all genres (column without index) ordered:
$> time tracker-sparql -q "SELECT ?genre WHERE { ?u a nmm:MusicPiece ;
nfo:genre ?genre } ORDER BY (?genre)" >/dev/null 
 * libicu: 0.203s
 * libunistring: 0.252s
 * glib: 0.416s


B) Another implementation is done in the 'collate-keyword' branch in
gnome git. In this case, the collation function is registered in the
sqlite connection, but not used during table or index creation. It will
be used only if explicitly requested with a new COLLATE keyword in the
sparql query. So instead of just "ORDER BY (column)" (binary order
applied, not collation); now "ORDER BY (column) COLLATE" can be used,
which will apply the collation only for this search.

B.1) Select all titles unordered:
$> time tracker-sparql -q "SELECT ?title WHERE { ?u a nmm:MusicPiece ;
nie:title ?title }" >/dev/null 
 * libicu: 0.161s
 * libunistring: 0.158s
 * glib: 0.154s

B.2) Select all titles (column with index) ordered with collation:
$> time tracker-sparql -q "SELECT ?title WHERE { ?u a nmm:MusicPiece ;
nie:title ?title } ORDER BY (?title) COLLATE" >/dev/null 
 * libicu: 0.183s
 * libunistring: 0.196s
 * glib: 0.264s

B.3) Select all genres (column without index) ordered with collation:
$> time tracker-sparql -q "SELECT ?genre WHERE { ?u a nmm:MusicPiece ;
nfo:genre ?genre } ORDER BY (?genre) COLLATE" >/dev/null 
 * libicu: 0.184s
 * libunistring: 0.199s
 * glib: 0.241s


Comments:

 * When setting collation in the column (A cases), as collation is used
for the indexes, the collation rules are applied when inserting the
elements in the database. If the user changes her locale settings, then
the index would be wrong and collation wouldn't be updated with the new
locale. This issue doesn't apply to cases B, where the collation applies
only to the current query.

 * When setting collation in the column (A cases) there seems to be an
impact on the search time, even if ORDER BY not used (different values
for glib/icu/unistring in case A.1; and case A.1 compared to B1). This
is pretty strange, and don't really know why. Someone?

 * When using the COLLATE keyword in B cases, having or not having an
index doesn't affect, as the order of the index (binary order) is not
the order requested in the query (collation order).
 
 * Yes, libicu is faster in all tests, even if internally based on
UTF-16. I assume this is because sqlite passes str + len(str) to the
collation method, where str is not NUL-terminated (so len is always
needed). The libicu API supports exactly that; but both glib and
libunistring collation functions need NUL-terminated strings (so they
need to be built, and thus they're slower).

 * Another option would be to enable collation only in the created
indexes, not in all text columns. Didn't do this because then collation
would be only applied if ordering by a column with index.

 * These tests are all with locale-based unicode collation; where 'title
collations' (removing "The" or "A" from the string start before
collating) is not supported.


Cheers,

-- 
Aleksander




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]