Re: [Tracker] Getting ready for TRUNK: Config options



On Wed, 2008-07-02 at 16:41 +0100, Martyn Russell wrote:
Hi all,

So as part of finishing up the indexer-split branch to try and get
things into a state to begin merging, I have been looking at the config
options we have and checking we implement them and if we don't finding
out if we need them.

Here are the current config options and some questions/comments about them.


WORKING OPTIONS:
================

â Verbosity
â Initial Sleep
â Enable Indexing
â Min Word Length
â Max Word Length
â Language
â Enable Stemmer
â Max Bucket Count
â Min Bucket Count
â Enable Xesam


NOT WORKING OPTIONS:
====================

â Low Memory Mode

This option currently has no effect in the indexer-split branch.
In TRUNK, it is used to:

  1. Set the cache size to 1/2 of what it is normally when loading DBs
  2. Set the array in update_word_table() to be 1/2 size.
  3. It affects these variables (which are usually 1/2 in low mem mode):

     a)           tracker->memory_limit = 16000 *1024;

     b)           tracker->max_process_queue_size = 5000;
     c)           tracker->max_extract_queue_size = 5000;

     d)           tracker->word_detail_limit = 2000000;
     e)           tracker->word_detail_min = 0;
     f)           tracker->word_count_limit = 500000;
     g)           tracker->word_count_min = 0;

For #1, I think this makes sense to reimplement
agreed

For #2, I think this is pointless if the array grows
I guess


For #3a, The memory limit is used to know when to flush the word cache.
This needs reimplementing in the indexer.
yes


For #3b, The process queue size is used to know how big the files queue
can get before it should be processed in the database. This is done now
by the indexer and I am not sure it is pertinent any longer.

could get away without this 

For #3c, This is the same as #3b.
this aint used


For #3d, This is unused in TRUNK.
For #3e, This is unused in TRUNK
For #3f, This is unused in TRUNK.
For #3g, This is unused in TRUNK

we need to limit no of hits per word if we are to use stack allocated
arrays - however I think this is done elsewhere using a #define in the
code so those vars are likely no longer needed


â NFS Locking

Do we need this? What is it for - as far as I can see, it is just some
simple locking mechanism using a file on the disk. What needs this? Can
we remove it?

no - we need to make sure on NFS that only one indexer can be launched
at any one time per user (note different session bus so cant use dbus
locking)


â Watch Directory Roots
â Crawl Directory
â No Watch Directory
â No Index File Types

These closely map to the .module files. I would like to rename them to
map exactly so they are obviously an override or addition to the
non-user space config of each module. What are your thoughts here?

thats fine


I would like to rename "WatchDirectoryRoots". Everyone, even GIO uses
"monitor", instead of "watch" and you can supply a list so it isn't just
one. Also, should we have ANOTHER option like we do in the module files
right now to be able to set "MonitorRecursiveDirectories" and
"MonitorDirectories"? We assume they are always recursive right now.

thats fine so long as we provide an upgrade path for all changed


I would like to rename "CrawlDirectory". This needs integrating with the
.module files.

I would like to rename "NoWatchDirectory". This is currently working.

â Enable Watching

I would like to rename this to "EnableMonitors"

â Throttle

This needs reimplementing in the indexer. Right now, we don't really
need it - at least my machine copes fine without it, but I think it
might be a good idea to add that back.


yes pls - laptops can get very hot (and with noisy fans too) so some
scaling is needed

â Enable File Content Indexing
â Enable Thumbnails

These need implementing. Plus it would be nicer to call
"EnableThumbnails", "EnableThumbnailIndexing", more consistent. I am
assuming these will both be implemented in the indexer.

yes  the former disables text indexing of files but allows metadata
indexing only


â Fast Merges

Carlos is currently working on a solution which means we won't need this
option or to write to separate files temporarily before writing to the
main index. How do you feel about removing this option?

dunno - ext/3 is so shite with fsync

being able to avoid fsyncs would be nice but cannot be done without
hogging disk when doing large writes



â Battery Index
â Battery Index Initial
â Low Disk Space Limit
â Index Mounted Directories
â Index Removable Media

These need some final testing and fixing up.

â Index Email Client

This has been removed since the .module files mean we don't need this now.

â Max Text To Index

This is not used in trunk, can we remove it?

must be used - we should limit text to 1mb by default otherwise gigantic
indexes could result with large files


â Max Words To Index

We should probably use this, it isn't used right now.

as above 


â Optimization Sweep Count

This is not used in trunk, can we remove it?

for now yes


â Divisions

This was used in TRUNK to call dpoptimize(). Is this really necessary as
an option? We don't use it in the indexer-split branch yet.


no stick with defaults


â Bucket Ratio

We need to readd this to the indexer-split branch. Unless you think it
is unimportant?

stick with defaults



â Padding

This isn't used in TRUNK, can we remove it?

ïstick with defaults


â Thread Stack Size

This is not used now because we don't create threads.


CONCLUSION:
===========

The idea is to get these options working or removed and once that's done
we can hopefully merge to TRUNK pending a big review from Jamie of course.

One other option we have considered, is adding a config version number,
so we know if we ever have to upgrade config files the migration path
needed. What are your thoughts on this?

might be needed

jamie




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]