Re: URGENT PATCH: pthreads lockup fix



Hi,

On 2001.08.20 01:25 Ali Akcaagac wrote:
> i went through all the pthreads related code and hung in the
> src/main-window.c source. where you give a gpointer of (void *)0.
> who wants the contents of address 0 ?

Sorry to deflate your balloon some, but the definition of NULL _is_ (void
*)0 in most implementations. You most likely just replaced one apple with
another apple :)
In any case, a pointer is, in assembly, just passed as a 32 bit value.
WHatever it looks like in source, if thet value is 0, you have a null
pointer.

So, (void *)0, (char *)0, (struct foo *)0 are all just the same, they're
translated into loading a 0 into a register or onto the stack. That, by
definition, is a NULL pointer. NULL is defined as a (void *) just to make
the compiler happy when giving NULL to a function expecting a typed pointer
because C will accept a void * anywhere a typed pointer is legal, except in
pointer arithmetic or dereferencing it.

In my opinion, the lockup is more common on SMP machines because only SMP
can achieve true concurrency. The lockup occurs on UP machines as well.
Assume the following:

func a(...)
{
	lock_mutt();
	gdk_threads_enter();
	.
	.
	.
	gdk_threads_leave();
	unlock_mutt();
}

func b(...)
{
	gdk_threads_enter();
	lock_mutt();
	.
	.
	.
	unlock_mutt();
	gdk_threads_leave();
}

On UP machines, the possibility of the scheduler interrupting function a
exactly between lock_mutt() and gdk_threads_enter() and running a thread
that calls function b during that time is very low, it may happen once a
week, or maybe not.

On SMP, on the other hand, it may frequently happen that one processor
executing function a will be rescheduled to some other thread right after
calling lock_mutt(). In itself, that is not a real problem, however, since
there is more than one processor, another one may pich up a runnable thread
executing function b. Function b now executes gdk_threads_enter(), then
block on lock_mutt(). The next time the thread running function a becomes
runnable, it will block on gdk_threads_enter  because function b already did
that. However, function b cannot proceed because a has already called
lock_mutt(). This is the classical deadlock than can only be prevented by
defining a lock hierarchy.
That would look like this:

Never call lock_mutt() unless you have called gdk_threads_enter() first.
Always release locks in the reverse order of acquiring them.

Used consistently, this would ensure that no deadlock between these 2 locks
can ever occur.
Balsa uses 3 different mutex locks i'm aware of: libbalsa has one, mutt has
another and gdk also has a lock.
It is a requirement that any function using any of these locks also be able
to access gdk, so gdk_threads_enter must be the outermost lock.
The next level of locking would be libbalsa, followed by mutt.

In the case of Balsa, the lock rules would have to look like this.

Always call gdk_threads_enter before any other lock.
Always call the libbalsa lock before locking mutt
Always release locks in reverse order.

The entire source would have to be examined to determine whether this rule
is followed everywhere.
The easiest way to do this is by redefining these functions as macros
logging the function and line number of the call and checking the status of
the required locks. It would then be possible to determine which functions
do not adhere to these rules and make the proper changes.
Also, this may be a good time to rethink the locking strategy as a whole.
There is really no reason to differentiate between locking libbalsa and
locking libmutt. Both could always be locked at the same time and released
at the same time. That would simplify much of the Balsa locking already.
Also it could be possible that no function really needs to hold the gdk lock
and the library locks at the same time. That would make for more frequent
locking and unlocking, but possibly improve concurrency somewhat.
Of course, tight loops getting data from a library call that needs locking
and displaying it within the same loop _should_ be reworked such that the
data is acquired in one loop (unter the library lock) and stored into a
g_list, then, after releasing the library lock and acquiring the gdk lock,
put into the display element from the g_list. Locking isn't cheap, so
locking shouldn't be performed from within tight loops.

I'm reasonably sure that, if these steps are followed, a definite cause for
the lockups can be found.
I also refer to earlier observations that lockups do not occur when all io
from the background mail checking thread is suppressed by checking the
"Quiet background check" option. It appears that the interaction between the
background mail check (requiring the library lock) and the foreground
display task (requiring the gdk lock) may be instrumental in this lockup
scenario.

I really don't have the time to dig into this, I haven't had any lockups
since I implemented "Quiet background check" and checked it, so I have found
a workaround that keeps Balsa usable for me. But maybe, using the pointers
given above, someone else more familiar with the code involved can find
something...

Melanie





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]