Re: Review of the new ringbuffer

From: Stefan Westerfeld <stefan space twc de>
To: Tim Janik <timj gtk org>
Cc: beast gtk org
Subject: Re: Review of the new ringbuffer
Date: Fri, 26 Jan 2007 20:57:44 +0100

   Hi!

(this mail exists twice, because I forgot CCing the beast list the first
fime I sent it - sorry for any inconvenience caused by this)

On Thu, Jan 25, 2007 at 10:23:44AM +0100, Tim Janik wrote:
> On Thu, 25 Jan 2007, Stefan Westerfeld wrote:
> > Here are a few thoughts on Tims new Birnet::Atomic::RingBuffer<T>.
> 
> > |   volatile uint m_wmark, m_rmark;
> 
> > | public:
> > |   uint
> > |   write (uint     length,
> > |          const T *data,
> > |          bool     partial = true)
> > |   {
> > |     const uint orig_length = length;
> > |     const uint rm = Atomic::uint_get (&m_rmark);
> > |     uint wm = Atomic::uint_get (&m_wmark);
> > |     uint space = (m_size - 1 + rm - wm) % m_size;
> > |     if (!partial && length > space)
> > |       return 0;
> > |     while (length)
> > |       {
> > |         if (rm <= wm)
> > |           space = m_size - wm + (rm == 0 ? -1 : 0);
> > |         else
> > |           space = rm - wm -1;
> > |         if (!space)
> > |           break;
> > |         space = MIN (space, length);
> > |         for (int i = 0; i < space; i++)
> > |           m_buffer[wm + i] = data[i];
> >
> > Its better to use std::copy, because for some data types (primitive
> > types), then memmove will be used (template specialization).
> 
> the way the code is written above, GCC can auto-vectorize it,
> is that true for std::copy as well?

I think the only way to find out what is faster is a benchmark, and even
then, the benchmark may result in different results, depending on

 * the version of g++, which affects
   - how std::copy is implemented in the STL
   - how good the auto vectorizer is
 * whether auto vecorization can be performed where the ringbuffer was
   instantiated (for instance, on x86 targets, libbse does not have SSE
   instructions, so auto vectorization is no advantage here)
 * the libc version (as the version of std::copy I examined expands to
   memmove, so the question is really: how does memmove compare with a
   vectorized loop, performance wise)
 * whether the code uses the __restrict__ keyword - my experience is
   that g++ does not vectorize loops when the areas pointed to could
   overlap

I wrote a benchmark which performs a similar loop like the one you use,
and here are execution times on my Debian/AMD64:

g++-4.0 --version
g++-4.0 (GCC) 4.0.4 20060904 (prerelease) (Debian 4.0.3-7)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

========== copy_bench =========
char (loop): 11.337754 sec
char (copy): 0.573089 sec
int (loop): 2.841360 sec
int (copy): 0.575760 sec
float (loop): 2.841312 sec
float (copy): 0.572800 sec
========== copy_bench_tv =========
char (loop): 11.357746 sec
char (copy): 0.567677 sec
int (loop): 1.207976 sec
int (copy): 0.572344 sec
float (loop): 0.969715 sec
float (copy): 0.568255 sec
========== copy_bench_res =========
char (loop): 11.337712 sec
char (copy): 0.810970 sec
int (loop): 2.844015 sec
int (copy): 0.802650 sec
float (loop): 2.840387 sec
float (copy): 0.806874 sec
========== copy_bench_res_tv =========
char (loop): 1.202510 sec
char (copy): 0.575979 sec
int (loop): 1.203514 sec
int (copy): 0.575684 sec
float (loop): 0.966977 sec
float (copy): 0.571858 sec

g++-4.1 --version
g++-4.1 (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

========== copy_bench =========
char (loop): 11.336817 sec
char (copy): 0.567631 sec
int (loop): 3.787092 sec
int (copy): 0.569993 sec
float (loop): 3.786067 sec
float (copy): 0.568548 sec
========== copy_bench_tv =========
char (loop): 11.334109 sec
char (copy): 0.971198 sec
int (loop): 3.785412 sec
int (copy): 0.967551 sec
float (loop): 3.784448 sec
float (copy): 0.974924 sec
========== copy_bench_res =========
char (loop): 11.335676 sec
char (copy): 0.572069 sec
int (loop): 3.785080 sec
int (copy): 0.569166 sec
float (loop): 3.785338 sec
float (copy): 0.569591 sec
========== copy_bench_res_tv =========
char (loop): 1.206806 sec
char (copy): 0.573147 sec
int (loop): 3.440046 sec
int (copy): 0.574105 sec
float (loop): 2.845281 sec
float (copy): 0.575720 sec

/usr/lib/gcc-snapshot/bin/g++ --version
g++ (GCC) 4.3.0 20061022 (experimental)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

========== copy_bench =========
char (loop): 11.333177 sec
char (copy): 0.569332 sec
int (loop): 3.784759 sec
int (copy): 0.571180 sec
float (loop): 3.784947 sec
float (copy): 0.572712 sec
========== copy_bench_tv =========
char (loop): 11.336388 sec
char (copy): 0.979739 sec
int (loop): 3.784161 sec
int (copy): 0.985023 sec
float (loop): 3.784255 sec
float (copy): 0.984372 sec
========== copy_bench_res =========
char (loop): 11.336951 sec
char (copy): 0.968223 sec
int (loop): 3.784592 sec
int (copy): 0.975108 sec
float (loop): 3.785653 sec
float (copy): 0.967514 sec
========== copy_bench_res_tv =========
char (loop): 1.320710 sec
char (copy): 0.578194 sec
int (loop): 2.967469 sec
int (copy): 0.575978 sec
float (loop): 3.318709 sec
float (copy): 0.570503 sec

A _tv suffix indicates that the tree vectorizer is enabled, a _res
suffix indicates that the __restrict__ keyword was used to mark the
blocks in the assignment loop as non overlapping (otherwise the
vectorizer can refuse the loop).

What you see is that on my system, in every case std::copy was faster
than a hand written loop. Vectorization does make a difference, but
still the vectorized loop is slower than std::copy in every case.

I'll attach the benchmark and a Makefile below. Your machine(s) may
behave quite a bit different from mine. Run something like:

for i in g++-4.0 g++-4.1 /usr/lib/gcc-snapshot/bin/g++; do make clean all CXX=$i >/dev/null; make bench CXX=$i; done > copy_bench_results.txt

To test copy bench with a few compilers.

And no, I can not explain why sometimes std::copy only takes 0.57
seconds or so, whereas sometimes it takes 0.98 seconds. I have no idea.

> > |         wm = (wm + space) % m_size;
> > |         data += space;
> > |         length -= space;
> > |       }
> >
> > Systems with memory barriers need a write memory barrier here. In the
> > jack driver code this looks like:
> 
> write/read memory barriers are implemented as part of Atomic::uint_get
> and Atomic::uint_set. also, volatile variables haven't been updated
> here, so there's no need for a barrier at this point.

Are you sure? Suppose Atomic::uint_set looked like this:

void
Atomic::uint_set (volatile int *aint,
                  int           arg)
{
	*aint = arg;
	write_memory_barrier();
}

then setting m_wmark would set the index to the new value in the write
cache of the processor (core) executing the thread. Now if the memory
barrier is encountered, the write cache is written to main memory in
some order. But "in some order" could mean that the new value of w_mark
is written to main memory before the new value of the writes before
arrive in main memory.

http://trac.zeitherrschaft.org/zzub/browser/trunk/src/portaudio/src/hostapi/coreaudio/ringbuffer.c?rev=663

for instance has these barriers (before updating the index). Here is a
mail which briefly justifies their existence:

http://techweb.rfa.org/pipermail/portaudio/2006-May/005651.html

> > |         space = MIN (space, length);
> > I use std::min where I can, because it doesn't have the typical macro
> > problems (double args evaluation).
> 
> double evaluation isn't a problem here (on variables) and
> std::min has other problems (mostly wrg typing).

Yes, I know, so its a tradeoff between two solutions which don't really
exactly do what I think they should do...

   Cu... Stefan
-- 
Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan

#include <algorithm>
#include <sys/time.h>
#include <time.h>

#ifdef USE_RESTRICT
#define RESTRICT __restrict__
#else
#define RESTRICT
#warning "no restrict used, so probably the compiler will not vectorize the code"
#endif

template<typename T, bool USE_COPY>
class Buffer
{
  const int   m_size;
  T* RESTRICT m_data;
public:
  Buffer (int size) :
    m_size (size),
    m_data (new T[size])
  {
  }
  ~Buffer()
  {
    delete[] m_data;
  }
  void
  xcopy (const T* RESTRICT src, T* RESTRICT dest)
  {
    for (int i = 0; i < m_size; i++)
      dest[i] = src[i];
  }
  void
  set_data (const T* new_data)
  {
    if (USE_COPY)
      {
	std::copy (new_data, new_data + m_size, m_data);
      }
    else
      {
	xcopy (new_data, m_data);
      }
  }
};

double
gettime ()
{
  timeval tv;
  gettimeofday (&tv, 0);

  return double(tv.tv_sec) + double(tv.tv_usec) * (1.0 / 1000000.0);
}

template<typename T, bool USE_COPY> void
bench (const char *type)
{
  const int size = 4096 / sizeof (T);
  T	    src1[size];
  T	    src2[size];

  double start = gettime();
  Buffer<T, USE_COPY> buffer (size);
  for (int i = 0; i < 1000000; i++)
    {
      buffer.set_data (src1);
      buffer.set_data (src2);
    }
  double end = gettime();
  printf ("%s: %f sec\n", type, end - start);
}

int
main()
{
  bench<char,  false> ("char (loop)");
  bench<char,  true>  ("char (copy)");
  bench<int,   false> ("int (loop)");
  bench<int,   true>  ("int (copy)");
  bench<float, false> ("float (loop)");
  bench<float, true>  ("float (copy)");
  return 0;
}

CXXFLAGS = -O3
TARGETS = copy_bench copy_bench_tv copy_bench_res copy_bench_res_tv

all: $(TARGETS)

bench:
	$(CXX) --version
	@for i in $(TARGETS); do echo "========== $$i ========="; $$i; done

clean:
	rm -f $(TARGETS)

copy_bench: copy_bench.cc
	$(CXX) $(CXXFLAGS) -o $@ $<

copy_bench_tv: copy_bench.cc
	$(CXX) $(CXXFLAGS) -ftree-vectorize -ftree-vectorizer-verbose=1 -o $@ $<

copy_bench_res: copy_bench.cc
	$(CXX) $(CXXFLAGS) -DUSE_RESTRICT -o $@ $<

copy_bench_res_tv: copy_bench.cc
	$(CXX) $(CXXFLAGS) -ftree-vectorize -ftree-vectorizer-verbose=1 -DUSE_RESTRICT -o $@ $<

Follow-Ups:
- Re: Review of the new ringbuffer
  - From: Tim Janik

References:
- Review of the new ringbuffer
  - From: Stefan Westerfeld
- Re: Review of the new ringbuffer
  - From: Tim Janik

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]