Re: Serious Bonobo Problem for Sun



Michael:

Thanks for your quick reply!

Note.  I am doing my tests using the Gnome 1.4 final tarballs (in other words)
Bonobo 0.37 and OAF 0.6.5.  My system is Solaris 8 running on an Ultra 10.

>         Thanks for your mail, luckily I think the problem is not as acute
> as it may appear on first glance.

Good news to hear.

> > This problem can be demonstrated using the Bonobo test program
> > test-container which is found in the Bonobo source tree under tests.
> > First I run oaf-slay to make sure no oaf processes are running (and
> > then I verify this by running "ps -ef | grep oaf | grep -v "grep oaf")
> > Then I follow these steps:
> >
> > $ test-container
> 
>         test-container is _incredibly_ old, horribly stale code that is
> not intended to be used by the public, I have just disabled it in the CVS
> build and added some horror type warnings about basing new code on it,.
>  
>         If you try samples/controls/sample-control-container or
> samples/compound_doc/container/sample-container you will notice that these
> cleanup nicely - as expected.

Thanks for pointing me in the right direction.  I am now using the
/samples/compound-doc/container/sample-container program for my tests.

I start this program after running oaf-slay and after verifying that no oaf
processes are running.

In the sample-container program I select File->Add Embeddable and add
Bonobo-Hello, then I select File->Add Embeddable and add "Test Canvas Item"
I see these processes when I run "ps -ef | grep oaf | grep -v "grep oaf"

 bc99092 29601     1  0 16:40:54 ?        0:01 oafd --ac-activate --ior-output-fd=10
 bc99092 29605     1  0 16:41:06 ?        0:00 bonobo-sample-hello --oaf-activate-iid=OAFIID:Bonobo_Sample_Hello_EmbeddableFac
 bc99092 29609     1  0 16:41:21 ?        0:00 bonobo-sample-canvas-item --oaf-activate-iid=OAFIID:Bonobo_Sample_CanvasItem_Fa

When I select File->Exit, and then re-run the command
"ps -ef | grep oaf | grep -v "grep oaf", I see the following:

bc99092 29601     1  0 16:40:54 ?        0:01 oafd --ac-activate --ior-output-fd=10

You say that this should clean up nicely.  I do not think that this is a Solaris
specific problem because I tested the same thing on RedHat 6.2 using the same
versions of the libraries and it leaves the same process behind.  I've waited
over 10 minutes for the process to clean itself up, but it is still lingering
around.

When I re-run the same steps, but do a "kill -9 <pid of sample-container>, then
I see all three processes still running as follows:

 bc99092 29745     1  0 16:43:52 ?        0:00 bonobo-sample-canvas-item --oaf-activate-iid=OAFIID:Bonobo_Sample_CanvasItem_Fa
 bc99092 29737     1  0 16:43:46 ?        0:01 oafd --ac-activate --ior-output-fd=10
 bc99092 29741     1  0 16:43:48 ?        0:00 bonobo-sample-hello --oaf-activate-iid=OAFIID:Bonobo_Sample_Hello_EmbeddableFac

Obviously it would be nice if Bonobo cleaned such processes up upon
situations like a crash.

> On Wed, 4 Apr 2001, Brian Cameron wrote:
> > Here at Sun we have serious problems running programs which use the
> > Bonobo architecture (like Evolution and Nautilus).  When a program
> > that uses Bonobo exits (either by exiting normally or by crashing),
> > oaf-processes are left running in the background.   
>  
>         Ok, so, there are several problems here. The first problem is
> built into the referencing scheme. An object has a reference count, but
> this contains no concept of ownership whatsoever. ie. it is impossible to
> tell who owns the 5 references it has. Consequently if someone comes
> along, references the server and then crashes or just leaks the reference,
> there is no possible way to detect this.
>  
>         Looked at from another angle, a process can be serving 10 other
> processes with controls. Should 1 process die, it is not correct to go
> round killing all the processes that it communicated with - even if this
> were possible.
>  
>         Consequently, in the case of pathalogical component failure, it
> will be the case that we get process leaks. The only solution to this is
> to minimise the liklihood of component failure.

I'm not sure that I completely agree with you on this point.  It seems
to me that Bonobo would be most robust if it did keep track of ownership.
It seems that Bonobo would be most effective if each oafd process kept
track of which processes were using it, so that it could periodically
verify that these processes are still running.  If the processes are
gone, oafd would know to terminate properly.  While "minimising the
liklihood of component failure" may be a goal to stive for, I do not
think it is particularly realistic.

> >  This may not be a serious problem for single user machines, but on
> > multi-user servers this quickly becomes critical.  We have seen
> > multi-user servers with hundreds and sometimes thousands of oaf
> > processes left behind,
>  
>         Now, I suspect that the more telling problem is that AFAIK oafd
> doesn't time out after a while and shut down - and oafd can chew some
> serious resources ~ 1Mb + per process on my system.
>           
>         So, I suspect that 'oafd' proliferation is the real problem -
> could this be the case ? in which case we need to ensure that it shuts
> itself down after a while [ this may already have been done ].

While I am not sure if this has been done, this certainly sounds like
part of the solution.  While not as robust as oafd keeping track of the
processes which are using it, this sounds like it would resolve most of
the problems.  Perhaps I am wrong, but it seems that the following
situation would still be a problem with only implementing such a timeout
solution:

  1. The user runs a program like Nautilus or Evolution which uses Bonobo.
  2. The program crashes, leaving oafd in a sick state.
  3. The user tries to re-run the program which re-attaches to the sick
     oafd process and causes the program to crash or otherwise not work
     properly.
  4. After waiting a mysterious period of time, oafd times out and then
     the program starts working again.  Or, if the user is experienced
     enough to know that running oaf-slay will help settle things, this
     could be done manually.

Not sure if the above situation is ideal, but certainly better than the
current state of affairs.  As you say, the "oafd proliferation" is the
major issue here.  The fact that programs can sometimes be left in a
sick state is a secondary, but still troublesome, issue.

> > We notice that when programs that use the Bonobo architecture crash,
> > the oaf processes are sometimes left in a unusable state.  Trying to
> > re-run the program causes the program to immediately crash or behave
> > strangely.  Running oaf-slay and then restarting corrects this
> > problem.
> 
>         oaf-slay is a program that causes a discontinuity in your
> component world - when you run it you inform the system that although 
> idle references may still exist, that this is not due to idle or slow
> TCP connections, and that there are no clients currently using their
> services, thus everything can be raised.

Shouldn't the oaf-slay script should be used as a tool of last-resort?
It is disturbing the frequency that a user needs to run this script
in order to use programs like Nautilus and Evolution.
  
> > In both cases, the Bonobo architecture should be robust enough to
> > handle the situation.  When the program that launched the oaf process
> > exits (whether by choice or by crash), the oaf processes should
> > recognize this and quit.
> 
>         I agree that oafd should timeout after a while and exit.

Yes, that seems reasonable.

> > It would be very useful if we could ship a version of Bonobo without
> > this problem with the Sun version of Gnome 1.4.  If it is possible to
> > correct this problem in the very near future, then perhaps we could
> > explore the possibility of doing this.  Any other ideas/suggestions
> > would be appreciated.
> 
>         It should be relatively simple in fact, a few hacks in oaf/oafd,
> but perhaps it has been done already and / or is underway; Maciej ?

Keep us posted.

> 	Since we'll all be at GUADEC for a while, it will probably look
> like we're ignoring you, but we're really not ... :-)

Understood, and thanks.

Brian





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]