Re: [Tracker] Re-index/re-scan on each restart?



On 01/09/10 17:18, Michael Steiner wrote:
Hi,

Hi,

After switching from an old RHEL to Ubuntu Lucid, i finally dumped
Google Desktop and went for tracker (0.8.16).  So far i'm quite happy
with what it does and from the architecture and open apis i'm
confident that it will get even better.

That's great to hear! :)

However, there is one thing which is a bit annoying: Tracker seems to
re-scan/re-index all my files each time i start tracker (e.g., after
reboot or re-login), even if in the previous run it seemed to have
finished the complete scan/index (i.e., tracker-status showed
everything as idle).  As i'm indexing a fair bunch of files this takes
several hours (almost days) even with most aggressive scan settings.

Is this a feature or a bug?

Carlos recently fixed a bug which sounds similar to what you're describing here, see commit:

  9339e32afca110fa08ac89a7161c080a9c70636e

This is in master, but not in 0.8. I will cherry-pick this for tomorrow's release. The difference in start up time is incredible, for my 20k files on this desktop machine it takes ~35s, before it was taking minutes IIRC (that time is just to check+add monitors).

If the former, it is probably in the attempt to not loose any file
modification when tracker is not run? Of course, the best approach
for this would be to have a indexer which runs independent of the
desktop.

Not sure what you mean by that?

Short of that, it would be great to have an option which
allows turning off that feature (i definitely would trade the rather
rare missed modifications against not having a CPU and IO hog after
each UI login)

This is possible in 0.9, there is are config options,

  EnableMonitors=false (in 0.8 but will still crawl)
  CrawlingInterval=0 (in 0.9, set to -1 to disable crawling entirely)

The later option above allows application specific indexing only so the crawler doesn't burn any CPU time, however, it isn't the default or recommended since you then rely on applications to keep data up to date.

If it's a bug, following some observations after looking at the
log-files in ~/.local/share/tracker:

- tracker-store.log is empty

All logs will be if Verbosity is < 1 in their respective .cfg files in $HOME/.config/tracker.

- tracker-extract.log contains a warning

       01 Sep 2010, 09:53:06: Tracker-Warning **: Could not load module 'libextract-mplayer.so': 
/usr/lib/tracker-0.8/extract-modules/libextract-mplayer.so: undefined symbol: tracker_extract_guess_date

    which seems due to libextract-mplayer (and libextract-totem) using
    a non-existing function ``tracker_extract_guess_date'' (rather than
    presumably the existing ``tracker_date_guess'') and is unlikely to
    have an impact here.

Checking the source, it seems these extractors are seldom built and still call that function in master. I will put it on the TODO list before tomorrow's release. Also after Michael Biebl pointed out there are a few other miscellaneous issues with 0.8/0.9 in the build system, I will try to fix those at the same time. I wasn't going to do a 0.8 release tomorrow, but I may well do given these recent findings.

The tracker_date_guess() should be the right one here.

    I also see a few warnings along the lines of

        01 Sep 2010, 09:58:05: Tracker-Warning **: Couldn't convert 14848 bytes from CP1252 to UTF-8: Invalid 
byte sequence in conversion input

    but this is probably also not relevant for this problem?

Sounds like the file encoding was incorrectly detected or the file is not encoded in the correct encoding in the first place. It is entirely possibly this may be the case for MP3 files for example. You would need to turn up the verbosity to know more details (like the file involved).

- tracker-miner-fs.log has by far the most messages (several
   hunderts), half of them are of the flavor of below

     01 Sep 2010, 08:28:13: Tracker-Critical **: Could not execute sparql: Unable to insert multiple values 
for subject `urn:uuid:0c147350-e9fe-9b16-ced3-2564b21ef9fa' and single valued property `dc:rights' 
(old_value: 'http://creativecommons.org/licenses/by/2.5/', new value: 
'http://www.apache.org/licenses/LICENSE-2.0')

Those should be fixed. Could you turn the verbosity up to 3 and create a new bug report with the file that causes this? (if possible)

    (3 quarter of them include  http://www.apache.org/licenses/LICENSE-2.0, for the rest i didn't spot a 
pattern)


    Being marked critical, maybe this is causing the re-index?

This is critical because it means the file didn't get indexed and can mean the ontology and/or the SPARQL is incorrect. It shouldn't cause a reindex at all, it just means the file is skipped and the code should be fixed to handle (usually) a corner case.

Any insights are welcome. Thanks!

Thanks for reporting these issues, they're most useful for us to look into and fix.

-michael-


PS: when i installed it, i also run ``make check'' and after i
figured out that i had to do a ``cd `/bin/pwd`'' to please some tests
it all worked fine with the exception of the
``tracker-password-provider-test'' test which didn't run as it
expected some pwd files pre-configured which i didn't have (and didn't
immediately could figure out how to create)

For 0.8? or 0.9? This should be fixed I would say.

--
Regards,
Martyn



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]