Re: [Tracker] more issues with indexer-split

From: Jamie McCracken <jamie mccrack googlemail com>
To: Carlos Garnacho <carlos imendio com>
Cc: Tracker-List <tracker-list gnome org>
Subject: Re: [Tracker] more issues with indexer-split
Date: Wed, 13 Aug 2008 11:47:10 -0400

On Wed, 2008-08-13 at 17:12 +0200, Carlos Garnacho wrote:

Hi!,

On mar, 2008-08-12 at 14:18 -0400, Jamie McCracken wrote:

<snip>

that sounds inefficient - trunk only ever checked for existing deleted
or junk emails at startup because iterating through all emails in the
summary files is expensive.

From what I've read in trunk code, you still iterate through all the

mails in the summary in check_summary_file(), and you will have to
iterate over them again later to index new messages, etc...


yes but when we are not doing the startup check, we are skipping so its
faster and we are not stopping at any deleted or junk email and checking
it


As far as I know, it's quite unavoidable to parse again summaries, since
under some circumstances Message IDs could be reused, which would leave
you with inconsistent data in the DBs. Even if it isn't, expunging a
folder would render any stored offset for the summary file useless (even
dangerous).


true but we would get a deletion from inotify of the summary file if
that was the case. Its not a byte offset but message count - so we skip
x messages to get the new ones (similar to what beagle does)


Besides, when testing summary parsing, I remember it was pretty fast
(like 2-3 seconds for a ~6500 emails summary), of course without
inserting to DBs nor doing message body or attachments sniffing, which
is more or less what should happen if the junk/deleted flag is set.


with 100,000+ emails its quite noticeable

the use of a separate junk email table meant
lookups were confined to that table and not the services table so was
faster when number of emails was high


You mean the JunkMails table in email-meta.db? As far as I see, this
table is just looked up to make sure there aren't duplicates when
inserting. And in the end, you still have to lookup/modify the Services
table, even if the junk mail wasn't there.


no when junk/deleted email is encountered during the start up scan its
UID is checked against that table  (JunkMails) to see if we already know
about it. If its not in that table then we add it and then delete it
from our index. Ergo its more efficient than what you have


we should also avoid doing this whenever the summary file changes which
is why we stored an offset in trunk so we skip over messages to get to
the new ones only when summary files change or do nothing if no new ones
are present


As said above, I think there are pretty good reasons to avoid this.


the trunk way is faster so i would prefer that restored


If you bear with me, I'd prefer to try a few optimizations before having
to add special cases.


well not doing the junk/deletion check everytime the summary file changes must obviously be faster?

jamie

Follow-Ups:
- Re: [Tracker] more issues with indexer-split
  - From: Martyn Russell
- Re: [Tracker] more issues with indexer-split
  - From: Carlos Garnacho

References:
- [Tracker] more issues with indexer-split
  - From: Jamie McCracken
- Re: [Tracker] more issues with indexer-split
  - From: Carlos Garnacho
- Re: [Tracker] more issues with indexer-split
  - From: Jamie McCracken
- Re: [Tracker] more issues with indexer-split
  - From: Carlos Garnacho

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]