[Tracker] Collation implementation



Hi hi JÃrg, Mikael & friends,

I'm writing here the different collation implementations we've been
considering, trying to write down the pros and cons of each solution;
and hopefully we can reach an agreement on what is the best one to get
implemented in Tracker.

The basic list of requirements related to collation would be:
 a) Order of results of queries based on collation, instead of binary
comparison of strings.
 b) Collation is based on locale of the user, so changes of locale must
be supported.
 c) Tracker should support returning collation keys of the result set of
a query, like enabling a new tracker:collate() method in SPARQL.
 d) Something else? 

****************
1) Using a new COLLATE keyword in SPARQL queries.
This solution is probably the slowest one and probably not wanted by
anyone, but anyway, I'm also considering it here. The idea is to have a
new "COLLATE" keyword in SPARQL, so that if you say "ORDER BY (?column)
COLLATE" it will force using collation in the order of the results.

Pros:
+++ Collation only used when needed. Not sure if that's a pro anyway; as
soon as you are doing ORDER BY you are expecting a proper order, and the
one provided by the binary comparison is useless even in English (for
example in small letter vs capital letter).
+++ Doesn't need neither changes in the db tables, nor any special
treatment at database level.
+++ No performance degradation when inserting new elements.
+++ Doesn't need any special change to support locale changes, as every
query is independent, so for each one the locale in that moment will be
used.

Cons: 
--- Slowest method probably, as it means that collation function is used
for comparison for every query which is done to the store.
--- If collation keys requested, they must be computed for each query,
as they are nowhere stored.
--- Order of results may be different if using dbus or direct-access
backends. If using dbus the locale used for collation is the one of the
tracker-store; while if using direct-access the locale used would be the
one of the process making the query.

****************
2) Set the collation method by default in the text columns when creating
the db tables.
We actually already agreed on doing this, before realizing that we also
need to be able to return collation keys (requirement C, above).

Pros:
+++ Fast in columns with indexes, as the collation would be used to
create the indexes properly, and during the query the order is already
implicit.
+++ Integration with sqlite is direct, just enable the collation
function when creating the column, and it will use it directly wherever
needed. Easier to maintain, probably.
+++ Almost no performance degradation when inserting new elements (see
previous emails on this topic).

Cons:
--- If collation keys requested, they must be computed for each query,
as they are nowhere stored.
--- If a query is done using a column without index for ordering,
collation function will be used to order the results, so it will be as
slow as method 1).
--- On locale change, indexes must be re-created, which may be slow if
lots of elements in the tables. Although this is not a big deal at the
end.

****************
3) New column to store the collation keys.
This is the method used in Fremantle. The idea is that for every text
column, we have an additional column where we store the collation key of
the text. Then, when issuing an ORDER BY we would be using that extra
column as input for the ordering, so that a simple binary comparison can
be used. I guess we would need an index in that new column with
collation keys, only for those text columns that have an index.

Pros:
+++ Fast in columns with indexes (as fast as method 2).
+++ Fast in columns without indexes, as binary comparison can be used on
the collation keys column.
+++ Retrieving the collation keys in a query is direct, no need to
compute them on the fly for the query results, so faster than method 2)
for this purpose.

Cons:
--- Inserting new elements will be slowest of the 3 options (don't have
numbers for this); as collation key needs to be computed. If we do this
approach for all text columns, I'm wondering how long it will take to
compute the collation key for plainTextContent... We could always say
that we'll only add the new column on properties configured to be
indexed, but then, what if we want an ORDER BY on a column not indexed?
No collation would be applied in that case.
--- Changes at db level are quite more than case 2). Not only the new
column needs to be populated properly, but also the sparql to sql
translator needs to be patched so that if ORDER BY(text-column) we
really do ORDER BY (collation-key-column) in sql. Right?
--- When locale changes, need to re-compute all collation keys and then
re-create all indexes. This is quite slower than method 2, which only
needs index re-creation).

****************

If requesting the collation keys is going to be done usually, 3rd method
seems the most appropriate. If this is not the case, I would go for the
2nd method as it will be the cleanest one and the easiest to maintain,
as sqlite makes all the work for us (but maybe I'm biased because 2nd
method is almost implemented :-) )

Comments?

-- 
Aleksander




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]