Number 9 of the mytest summary store: writing things



Basic terminology:

o. A summary contains n summary blocks
o. A summary block contains ~1000 summary items
o. A summary item has flags, cc, uid, from, subject, etc etc

New:

o. Writing the data (persisting it)
o. Freeze and Thawing
o. Keeping state about flag-changes, expunges and appends

Missing:

o. Defining the filenames of summary blocks. Right now will only one
   summary block be created for all items (number 0), resulting in three
   files: data_0.mmap, index_0.idx and flags_0.idx

   In future the idea is to have data_n.mmap, index_n.idx and
   flags_n.idx files (the three per summary block created).

   50,000 items will indeed and effectively result in 50 data_n.mmap
   files being mmap()ed if the entire folder is needed. If a search
   result of a search on the summary's data caused hits in 10 of the 50
   files, then 10 files will be mmap()ed.

   Hence why grouping them together on sequence number: most searches
   yield results that are close together in time. Sequence numbers on
   IMAP servers are usually grouped in time too (relatively grouped in
   time, depending on various things).

o. Error strategy: what if there's not enough space to write the summary
   files? What if a file's gone missing?

o. A flock(): what if a second process tries to access the same mapped
   files? I think by just flock()-ing the persisting functions, we are
   relatively save already (I just wonder what happens with my read-only
   mapping if a rename()-overwrite happens on a mapped file).

   Of course is the advise for application developers to either use a
   new cache-dir per process or to have a service that gives the data to
   both applications over an IPC system (but that's not the point of
   protecting the processes from influencing each other: what if still
   the app developer did it wrong? What can we do about that?)

   -- I know this is hard to cope with, perhaps just a g_critical and an
      abort() if we detect this situation? (how do we detect it?) --


Writing strategy:

o. I keep a "has_flagchg", a "has_expunges" and a "has_appends". These
   are the tree types of changes that are possible for a summary. I keep
   these booleans per summary block

o. The functions summary_item_set_flags, summary_add_item and the
   functions summary_expunge_item_by_uid and summary_expunge_item_by_seq
   will modify the values of those booleans.

o. The summary_freeze function will make a function called
   summary_block_persist refrain from actually writing for every summary
   block in the summary passed as parameter to summary_freeze

o. The summary_thaw function will unset the freeze on each summary block
   in the summary passed as parameter to summary_thaw. On top of that
   will it call the summary_block_persist for each summary block in that
   summary.

o. The summary_block_persist function checks what the best write
   strategy will be by evaluating the booleans has_flagchg, has_expunges
   and has_appends.

   - If has_flagchg but not has_expunges and not has_appends then a
     function that just writes the flags-file is utilised to perform
     persisting the summary block.

   - Else if either has_expunges or has_appends then a function that
     writes the flags-file, the index-file and the data-file is utilised
     to perform persisting the summary block.

o. The persisting of a summary-block happens by first sorting all
   strings by occurrence and making them unique. Then the unique strings
   are in that sort-order written to the data-file and offsets to the
   strings are updated into the summary item pointers using ftell().

   The index file is written using the pointers of the summary items,
   meanwhile the flags file is written using the flags of the summary
   items.

   The data-file is now mapped and the summary items re-prepared.

   The summary block is persisted in a VmRss friendly way.

o. When adding summary items to the summary (which will select a summary
   block where the item will be added to using the requested sequence
   number), the caller must attempt to avoid string duplicates for the
   CC and TO fields of the items by sorting the addresses in the comma
   separated strings of the items. Currently will the experimental
   example do this for you. This further reduces VmRss as you'll have
   singled-out more data as duplicate and made more data unique in
   memory this way.




Please test :)



-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be



Attachment: mytest9.tar.gz
Description: application/compressed-tar



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]