Re: [Tracker] Proposal to improve tracker-miner-fs "up-to-date" check performance

From: Carlos Garnacho <carlos lanedo com>
To: tracker-list gnome org
Subject: Re: [Tracker] Proposal to improve tracker-miner-fs "up-to-date" check performance
Date: Mon, 29 Mar 2010 18:08:16 +0200

Hi!,

On lun, 2010-03-29 at 22:44 +0800, Chen, Zhenqiang wrote:

When tracker starts up, it will check whether the entries in DB are up-to-date or not.
Current logic is: for each file, there is at least one dbus-call from tracker-miner-fs to tracker-store 
which will execute a query. 
This is not efficient since dbus and query are expensive. (You can get the logs with dbus-monitor)

Here are two proposals to improve the performance.

1 Skip checks for ignored files:

In function crawler_check_directory_cb (tracker-miner-fs.c), there are two checks:
  
  should_check = should_check_file (fs, file, TRUE);
  should_change_index = should_change_index_for_file (fs, file);
  
As my understanding, if "should_check_file" returns FALSE, "should_change_index_for_file" is meaningless, 
since we do not process such files (see function "should_process_file"). So we can use the same logic in 
"should_process_file" to handle it: 

  if (should_check){
    should_change_index = should_change_index_for_file (fs, file);
  }
  else {
    should_change_index = FALSE;
  }
  
With this improvement, we can skip checks for files like ~/.cache/*, ~/.config/*, etc.


You are right here, I updated master with your suggestion. But very
likely the performance improvement is almost negligible given the usual
hidden vs legitimate files/dirs ratio...


2) Reduce dbus calls and queries:

(1) At the beginning, execute one query to get all the <url, fileLastModified> pairs and put them in a hash 
table.
(2) For each file, lookup the uri in the hash table, 
      if there is, 
          compare the time information of the file with the fileLastModified value from hash table,
          if the values are equal,
              The entry is up-to-date.
              
    Query is only required when it is not in the hash table or time is not match.


As Philip said, we should take into account memory usage as well, and
keeping a hashtable for each known item is not going to be nice...
TrackerCrawler guarantees that any directory will be processed after its
parent folder, and all the items in a directory will be processed
together, so we very probably can do this on a per-folder basis.


(3) There is another issue in current implementation: 
url for "Directory" files have form like "urn:software-category" not "file:///" (see 
"miner_applications_process_file_cb" in tracker-miner-applications.c). So we should change the uri format 
before searching in hash table.


I suggest you to have a look at nie:url, which is meant to have
application readable URIs.

Cheers.
  Carlos

Follow-Ups:
- Re: [Tracker] Proposal to improve tracker-miner-fs "up-to-date" check performance
  - From: Chen, Zhenqiang

References:
- [Tracker] Proposal to improve tracker-miner-fs "up-to-date" check performance
  - From: Chen, Zhenqiang

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]