Re: [Tracker] status of 0.6.90?



Jamie McCracken wrote:
On Tue, 2008-10-14 at 10:07 +0100, Martyn Russell wrote:
Jamie McCracken wrote:
On Mon, 2008-10-13 at 12:43 +0100, Martyn Russell wrote:
Jamie McCracken wrote:
There are also a load of other issues that need correcting:

1) enumerating and crawling directories needs to be done in the indexer
(and pass directories to watch back to the daemon). Daemon can then run
as nice 0 and normal ionice instead of nice 19 as only cpu/io heavy ops
will be searches and queries which need to be fast as possible

2) indexing needs to do all files in a directory before indexing the
directory itself to prevent files not getting indexed if daemon is
stopped in mid-index

3) Needs to be fully backward compatible with API and config options.
Its likely that we will have to force a reindex for 0.6.6 as Im not sure
db will work with old versions. I would rather have sqlite fts and xesam
db support as well as flattened tables in to prevent the next few
versions forcing a reindex after each upgrade if that is the case (IE I
would rather do the reindex once if possible)

4) Db connections should use  sqlite3_soft_heap_limit call to limit heap
usage of sqlite (sqlite will runaway and eat memory while indexing if
you dont - these are not leaks and will not show up in valgrind!)

see http://sqlite.org/c3ref/soft_heap_limit.html and note following:

A negative or zero value for N in a call to sqlite3_soft_heap_limit(N)
means that there is no soft heap limit and sqlite3_release_memory() will
only be called when memory is completely exhausted. The default value
for the soft heap limit is zero. 

Ergo sqlite will happily eat all memory until you run out before
attempting to free a single byte unless we set a value for above

5) probably some fine tuning and default settings for throttle might
need to be adjusted

6) indexing email attachments?  Might be a regression if we dont as
0.6.6 did. I need to think how this fits in with xesam though as we dont
want to add them to email or files index as they are of different source
(probably we will have a separate index/db for each source -
archive,attachment et al as they are not files or emails as such)

7) email optimisations - really slow for large mboxes. mbox could be
optimised to store last known offset and record details  to prevent full
scan when new emails are appended. needs to be done smartly (IE verify
last record structure and UID at known offset) so we dont screw up if
mbox was compacted or changed beyond recognition. i will likely restore
the junkemail table to speed up junk checking for mbox too.

I will try and do 4, 5 and 7 above over the weekend. I think martyn said
he will do 1,2,3 soon.
Hi :)

Jamie, can you update us with the status of SQLite FTS3 support? The
reason I ask is we want to really get the MMC support working nicely so
we enable/disable hits based on the MMC being mounted or not and of
course this depends on QDBM which holds statistics for the data we are
searching.
I hope to have fts ready in a week or two (only have time at weekends
atm due to work)

Do you have any ideas on how to do this too? I am guessing the way to do
it is to hinge everything on the Enable column in the services table and
link that to the search we do for _get_all_hits() in the new SQLite QDBM
replacement DB.

we have a volumes table that can store HAL UID against mounted path

each file can have a volume ID in the services table (I think its called
auxilaryID)

the idea is that we do a file move if volume is mounted against a new
path otherwise enabled = 0 for each file with that volumeID if its not
present

never got round to getting this to work in trunk but should be straight forward to do (i think)
Yea it looks straight forward.

One question I have is, do we really need the "enabled" column if it
only pertains to the Volumes table - can we not just have 1 column and 1
row in the Volumes table for "enabled" and do some nice SQL to know if
content is enabled based on purely the auxiliary ID in the Services
table. Or is the "enabled" column needed for other things.

for performance reasons and to simplify queries, the enabled column is a
must

I don't think simplifying the queries holds any water. Speed should
always be the most important consideration right?

its very easy to update (update services set enabled = 1 where
AuxiliaryID = blah)

Updating (say 2000) MP3 items in a table of thousands to set them from
enabled = 0 to enabled = 1 should not be faster than setting 1 column in
1 row in a table of maybe 10 entries surely?

As for lookups, I am not entirely sure, but I would think if the SQL
query was constructed properly it should be as fast to check with the
Volumes table if the AuxiliaryID is enabled.

It doesn't feel truly relational to have an AuxiliaryID and an Enabled
field. Is the Enabled field used for anything else other than this right
now?

Also I notice we have a lot of SELECT DISTINCT SQL statements where the
WHERE clause is the ID - which is unique - doesn't that negate the whole
point of DISTINCT and just slow down the statement?

Something else I have noticed. When it comes to deleting a file, it is
really quite nasty because we have to delete in several places. It would
be much nicer to delete the service id and that trigger deletes for
every other place that relationally depends on it (like
content/metadata/etc).

triggers only work on the db connection that performs the deletion so
will not work on multiple dbs unless they are joined at the db
connection level

You mean with ATTACH?

we do use triggers in a few places but obviously the above makes it
harder

Yea.

-- 
Regards,
Martyn



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]