Re: An interesting deadlock in ORBit2...



Hi Justin,

On Tue, 2003-11-18 at 09:19, Justin Schoeman wrote:
> I have just run into an interesting deadlock in ORBit2... In a fairly 
> heavily multithreaded environment, after about 4 days of operation, the 
> system hangs up in the following state:

	So; I just had a look at this;

> Thread 3 (Thread 2051 (LWP 13362)):
> #0  0x4017a384 in write () from /lib/libc.so.6
> #1  0x402479d4 in __DTOR_END__ () from 
> /home/justin/orbit2/lib/libORBit-2.so.0
> #2  0x4020fa89 in giop_thread_push_recv (ent=0xbffff210) at giop.c:568

	This seems likely to be:

	giop_incoming_signal_T with wakeup_mainloop inlined inside it (I
guess).

	wakeup mainloop spins trying to write an 'A' to the wakeup pipe of the
'main context' - we should really be using something associated with the
'wake_context' itself here I think; but that's not the problem.

	Unfortunately if the buffer is full we spin indefinately - which is
pretty silly, since if we get EAGAIN, we know that the mainloop is being
woken up :-)

	OTOH - the cause of the problem for you looks like you're not running
the glib mainloop in the first thread; either through CORBA_ORB_run or a
g_main_loop_do_foo type thing.

> #12 0x0805b313 in ien_init_fn (id=0) at hc_input_core.c:2196
> #13 0x08062a28 in threadpool_init_item (item=0x80b6480, pool=0xbffff580, 
> id=0)
>      at threadpool.c:119
> #14 0x08062ebe in threadpool_retrieve (pool=0xbffff580) at threadpool.c:236
> #15 0x0805c45e in main (argc=1, argv=0xbffff744) at hc_input_core.c:2531
> #16 0x400b4280 in __libc_start_main () from /lib/libc.so.6

	Looks slightly odd to me this main thread :-)

> The system deadlocks at this point, with all threads permanently stuck 
> in that state.

	Sure; it's spinning on EAGAIN; I think I'll stop it spinning on EAGAIN,
and only on EINTR - that may fix the immediate problem for you.

> The one thing I do see is that giop_incoming_signal_T is being called 
> from giop_thread_push_recv, apparently without the lock being held.  I 
> am correct that this is the problem, or does anybody else have any other 
> ideas as to what I should look at?

	I don't think that's an issue - we can signal the condition without
problems and using the wakeup is fairly rare and safe anyhow.

	HTH,

		Michael.

-- 
 michael@ximian.com  <><, Pseudo Engineer, itinerant idiot




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]