Re: [Tracker] hierarchical-indexing branch review



On 18/02/10 17:11, Martyn Russell wrote:
Hi all,

Carlos recently worked on the hierarchical-indexing branch to ensure
parent directories are added to the database BEFORE files or child
directories. This avoids additional SPARQL calls and caching that would
otherwise be required when processing the crawled directories. So
essentially this should improve indexing speed but it also fixes file
duplicates which we have seen in some cases. See the bug:

https://bugzilla.gnome.org/show_bug.cgi?id=610360

This is the review of that branch:

http://git.gnome.org/browse/tracker/log/?h=hierarchical-indexing

--

1. The ::finished signal has now changed, the following are broken and
have not been updated yet:

- examples/
- libtracker-miner/tracker-miner-manager
- tracker-control/tracker-control

Carlos has done this.

2. I would use a GSList in DirectoryProcessingData instead of a GList.

Done.

3. I would have named the signal ::crawled-directory since it is more
consistent with the other signal naming.

Agreed with Carlos not to do this.

4. No need for g_slice_new0() in enumerator_data_new(), g_slice_new() is
good enough there.

Done.

5. No need for the cast in file_enumerate_next_cb() here:

+ ed = (EnumeratorData*) user_data;

Done.

6. Doesn't it make sense to close the enumeration before processing more
data with enumerator_process_data()? Not sure here really since it is
all async anyway.

No decision on this yet.

7. In file_enumerator_next_cb() I would set parent AFTER the if (error)
... since it may not ever be used in the case of an error.

Done.

8. Probably doesn't matter so much, but item_queue_get_next_file()
should return a typedef enum not a gint which happens to be an enum anyway.

Done.

9. Can we change the #if 0 in crawler_directory_crawled_cb() to #ifdef
ENABLE_TREE_DEBUGGING or something meaningful? Unless we remove it?

Done.

10. Doesn't G_POST_ORDER make more sense when freeing the GNode tree in
crawled_directory_data_free()?

Done.

11. Shouldn't item_query_exists() use a SPARQL function variable for
performance on the OPTIONAL there?

Done.

--

Since doing these changes and testing, I have noticed 2 regressions:

1. We don't honor the IndexSingleDirectory config option properly.

2. We don't report progress correctly because we don't estimate items processed correctly.

These need fixing before we can merge.

--
Regards,
Martyn



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]