landing g_slice_*() in glib



hey all.

i've comitted the new memory allcoator to glib HEAD now.
it replaces the old slow and memory bloated memchunk implementation
and should provide significant speedups and memory savings over
using the systems malloc()/free() implementations.

gslice.c contains a large comment which outlines the basic
allocator structure, i'm appending that to this email for the
interested reader, and the really curious ones can continue with
reading the two slab allocator papers mentioned in the comment which
served as a basis for the implementation ;)

to sum it up, the new slab allocator uses posix_memalign(3) and free(1)
to optimize allocations of many euqally sized chunks, and we now have
per thread free lists (the magazine layer) to quickly satisfy allocation
requests of already known structure sizes.
this is accompanied by extra caching logic that can keep around freed
memory for approximately 15 seconds before returning it to the system.
memory left over from alignment constraints is used for cache colorization
(random distribution of chunk addresses) to improve processor cache
utilization. to improve SMP scalability, the caching layer can adapt
itself to high lock contention as it can occour with multiple threads on
SMP systems.

there's also a new test/profiling program glibs/tests/slice-test now,
which allocates lots of randomly sized structures and releases/reallocates
those in random order.
here're the timing comparisons of system malloc vs. just the new slab
allocator (used internally by gslice.c) and then the slab allocator plus
magazine cache as it's available through g_slice_new() and g_slice_free():

g_malloc/g_free:
  Starting 1 threads allocating random blocks <= 512 bytes with seed=123456789 using system malloc
  real    0m39.502s
internal slab allocator:
  Starting 1 threads allocating random blocks <= 512 bytes with seed=123456789 using slab allocator
  real    0m20.064s
g_slice_*() (slab allocator + magazine cache):
  Starting 1 threads allocating random blocks <= 512 bytes with seed=123456789 using slab allocator + magazine cache
  real    0m13.888s

that's single threaded on an 1.8GHz Athlon 512KB cache, glibc-2.3.2. malloc
takes allmost 3 times as long as gslice.c here.
for greater block sizes, malloc performs a lot worse:

g_malloc/g_free:
  Starting 1 threads allocating random blocks <= 1024 bytes with seed=123456789 using slab allocator + magazine cache
  real    0m16.784s
internal slab allocator:
  Starting 1 threads allocating random blocks <= 1024 bytes with seed=123456789 using slab allocator
  real    0m26.040s
g_slice_*() (slab allocator + magazine cache):
  Starting 1 threads allocating random blocks <= 1024 bytes with seed=123456789 using system malloc
  real    1m40.123s

and here's a test across multiple CPUs (window.gnome.org, QuadXeon, 2.8GHz,
unfortunately tested under load and without support for sched_setaffinity(2)):

  Starting 1 threads allocating random blocks <= 512 bytes with seed=123456789 using slab allocator + magazine cache
  real    0m9.099s
  Starting 2 threads allocating random blocks <= 512 bytes with seed=123456789 using slab allocator + magazine cache
  real    0m9.945s
  Starting 3 threads allocating random blocks <= 512 bytes with seed=123456789 using slab allocator + magazine cache
  real    0m14.043s
  Starting 4 threads allocating random blocks <= 512 bytes with seed=123456789 using slab allocator + magazine cache
  real    0m18.680s

the seed of the random block allocator was fixed to ensure the same allocation
pattern was used to produce comparable runs.


for large block sizes, g_slice_new() and g_slice_alloc() will automatically
delegate to the system malloc implementation. so for newly written code, as
long as objects are not resized during their lifetime and the obejct size
used at allocation time is still available when freeing, it is recommended
to use the new g_slice_*() API instead of g_malloc() and friends.

the allocator can be considered mostly ready at this point, what's left on
my TODO for it is:
- some additional test/profiling programs that should go into CVS
- look into facilitating __thread instead of g_private_get() to speed up
  the common code path
- do controlled profiling on multi-CPU machines to measure scalability with
  increasing numbers of processors.
- do more realistic profiling, preferably with memory allocation intensive
  apps that don't require user interaction.
especially in the latter area, help is apprechiated, i.e. if people have
specific memory benchmarking scenarios that can be made to work with glib
head, please speak up ;)


from gslice.c:
/* the GSlice allocator is split up into 4 layers, roughly modelled after the slab
 * allocator and magazine extensions as outlined in:
 * + [Bonwick94] Jeff Bonwick, The slab allocator: An object-caching kernel
 *   memory allocator. USENIX 1994, http://citeseer.ist.psu.edu/bonwick94slab.html
 * + [Bonwick01] Bonwick and Jonathan Adams, Magazines and vmem: Extending the
 *   slab allocator to many cpu's and arbitrary resources.
 *   USENIX 2001, http://citeseer.ist.psu.edu/bonwick01magazines.html
 * the layers are:
 * - the thread magazines. for each (aligned) chunk size, a magazine (a list)
 *   of recently freed and soon to be allocated chunks is maintained per thread.
 *   this way, most alloc/free requests can be quickly satisfied from per-thread
 *   free lists which only require one g_private_get() call to retrive the
 *   thread handle.
 * - the magazine cache. allocating and freeing chunks to/from threads only
 *   occours at magazine sizes from a global depot of magazines. the depot
 *   maintaines a 15 second working set of allocated magazines, so full
 *   magazines are not allocated and released too often.
 *   the chunk size dependent magazine sizes automatically adapt (within limits,
 *   see [3]) to lock contention to properly scale performance across a variety
 *   of SMP systems.
 * - the slab allocator. this allocator allocates slabs (blocks of memory) close
 *   to the system page size or multiples thereof which have to be page aligned.
 *   the blocks are divided into smaller chunks which are used to satisfy
 *   allocations from the upper layers. the space provided by the reminder of
 *   the chunk size division is used for cache colorization (random distribution
 *   of chunk addresses) to improve processor cache utilization. multiple slabs
 *   with the same chunk size are kept in a partially sorted ring to allow O(1)
 *   freeing and allocation of chunks (as long as the allocation of an entirely
 *   new slab can be avoided).
 * - the page allocator. on most modern systems, posix_memalign(3) or
 *   memalign(3) should be available, so this is used to allocate blocks with
 *   system page size based alignments and sizes or multiples thereof.
 *   if no memalign variant is provided, valloc() is used instead and
 *   block sizes are limited to the system page size (no multiples thereof).
 *   as a fallback, on system without even valloc(), a malloc(3)-based page
 *   allocator with alloc-only behaviour is used.
 *
 * NOTES:
 * [1] some systems memalign(3) implementations may rely on boundary tagging for
 *     the handed out memory chunks. to avoid excessive page-wise fragmentation,
 *     we reserve 2 * sizeof (void*) per block size for the systems memalign(3),
 *     specified in NATIVE_MALLOC_PADDING.
 * [2] using the slab allocator alone already provides for a fast and efficient
 *     allocator, it doesn't properly scale beyond single-threaded uses though.
 *     also, the slab allocator implements eager free(3)-ing, i.e. does not
 *     provide any form of caching or working set maintenance. so if used alone,
 *     it's vulnerable to trashing for sequences of balanced (alloc, free) pairs
 *     at certain thresholds.
 * [3] magazine sizes are bound by an implementation specific minimum size and
 *     a chunk size specific maximum to limit magazine storage sizes to roughly
 *     16KB.
 * [4] allocating ca. 8 chunks per block/page keeps a good balance between
 *     external and internal fragmentation (<= 12.5%) [Bonwick94]
 */

---
ciaoTJ



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]