Re: An interesting deadlock in ORBit2...



Michael Meeks wrote:
> Hi Justin,
> 
> On Tue, 2003-11-18 at 09:19, Justin Schoeman wrote:
> 
>>I have just run into an interesting deadlock in ORBit2... In a fairly 
>>heavily multithreaded environment, after about 4 days of operation, the 
>>system hangs up in the following state:
> 
> 
> 	So; I just had a look at this;

Thanks!

> 
>>Thread 3 (Thread 2051 (LWP 13362)):
>>#0  0x4017a384 in write () from /lib/libc.so.6
>>#1  0x402479d4 in __DTOR_END__ () from 
>>/home/justin/orbit2/lib/libORBit-2.so.0
>>#2  0x4020fa89 in giop_thread_push_recv (ent=0xbffff210) at giop.c:568
> 
> 
> 	This seems likely to be:
> 
> 	giop_incoming_signal_T with wakeup_mainloop inlined inside it (I
> guess).
> 
> 	wakeup mainloop spins trying to write an 'A' to the wakeup pipe of the
> 'main context' - we should really be using something associated with the
> 'wake_context' itself here I think; but that's not the problem.
> 
> 	Unfortunately if the buffer is full we spin indefinately - which is
> pretty silly, since if we get EAGAIN, we know that the mainloop is being
> woken up :-)
> 
> 	OTOH - the cause of the problem for you looks like you're not running
> the glib mainloop in the first thread; either through CORBA_ORB_run or a
> g_main_loop_do_foo type thing.


Hmmm... I did not realise that I needed to start up a glib mainloop in 
the CORBA client... Is CORBA_ORB_init() sufficient to start the main 
loop?  If so, then this is not the problem, as CORBA_ORB_init() is 
called before any threads are started.

>>#12 0x0805b313 in ien_init_fn (id=0) at hc_input_core.c:2196
>>#13 0x08062a28 in threadpool_init_item (item=0x80b6480, pool=0xbffff580, 
>>id=0)
>>     at threadpool.c:119
>>#14 0x08062ebe in threadpool_retrieve (pool=0xbffff580) at threadpool.c:236
>>#15 0x0805c45e in main (argc=1, argv=0xbffff744) at hc_input_core.c:2531
>>#16 0x400b4280 in __libc_start_main () from /lib/libc.so.6
> 
> 
> 	Looks slightly odd to me this main thread :-)

That is Justin's Whacky Threading(TM) ;-)  All the init functions are 
called from within the parent thread...  There shouldn't be any problem 
with this though, it has been pretty thoroughly tested.

>>The system deadlocks at this point, with all threads permanently stuck 
>>in that state.
> 
> 
> 	Sure; it's spinning on EAGAIN; I think I'll stop it spinning on EAGAIN,
> and only on EINTR - that may fix the immediate problem for you.
> 
> 
>>The one thing I do see is that giop_incoming_signal_T is being called 
>>from giop_thread_push_recv, apparently without the lock being held.  I 
>>am correct that this is the problem, or does anybody else have any other 
>>ideas as to what I should look at?
> 
> 
> 	I don't think that's an issue - we can signal the condition without
> problems and using the wakeup is fairly rare and safe anyhow.

OK - I just thought that all _T functions were supposed to be called 
with a lock held.  I did not look into it in too much detail.

BTW - I ran into a race in that last unref in IO thread patch...  It is 
rather difficult to trigger, but if the main thread adds another 
reference to the connection after the shutdown has been dispatched, then 
it will block indefinitely on g_cond_wait...

Thanks!
-justin




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]