Re: GDM interprocess communication problems



On Mon, Jan 19, 2004 at 03:31:56PM +0000, Anton Altaparmakov wrote:
> Normally, the GDM slave runs the Greeter and when a user clicks on the
> "Reboot" button, the Greeter exits returning DISPLAY_REBOOT.  The slave
> is waiting on its child (the Greeter) to return, picks up the return
> code (DISPLAY_REBOOT) and reboots the server.
> 
> Our problem is that after a user logs in, then logs out, and then clicks
> on the "Reboot" button, the Greeter exits returning DISPLAY_REBOOT but
> the slave never notices that the Greeter has exited and hence GDM just
> hangs and requires a complete restart with Ctrl+Alt+Backspace or "pkill
> gdm" on the command line.
> 
> Basically using waitpid() et al doesn't work for us because the login
> and logout procedure via PAM causes all sorts of private processes to be
> spawned (including ones that live even after logout) and this seems to
> completely mess up GDM's process handling.  This is because home
> directories are mounted at login time (when PAM authenticates the
> /home/userid directory is mounted) and only unmounted a few minutes
> after the user has logged out.

Hmmm, the way this works is actually not a waitpid, but a fake waitpid.  We
use waitpid ONLY to read out the exit status of the child.  If PAM reaps
children it doesn't own, then all kinds of things are bound to go wrong, or
if PAM messes with the SIGCHLD signals.  So I think just trying to solve it
by not using waitpid is just shifting the problem somewhere else.  GDM design
is really dependent on being able to reap its children and to recieve
signals properly.  If something else for example reaps a child for us, there
is an obvious race where GDM could also just figure that things went very
wrong and would start killing pid's it thought it owned, but they were now
already given back to the system, thus it could kill fairly arbitrary stuff
(not a huge problem because of the time it takes for pid's to roll around,
but it's still a bug).  Without being able to KNOW that nobody else is
reaping our children we might have references to processes which no longer
exist in the process table.  It's sort of like someone freeing our memory
without us knowing.

This is one of the crap things in PAM, that it's behaviour is not well
defined in these respects.  But more then just GDM will have trouble with
this.  Any library or module (as PAM) should not mess with application global
things, such as reaping arbitrary children etc...  I think PAM modules should
be designed with a strict reading of the PAM documentation in mind.  If extra
things need to happen, then an extra daemon needs to be started which does
the extra magic, and then the pam module just communicates with this daemon.

> I made a patch to the slave which notices that the Greeter has died
> (note that it normally doesn't notice as it just keeps writing to a dead
> file descriptor and it gets EPIPE back but gdm_fdprintf() just throws
> away the error (it returns void!) instead of returning the error
> upstream...)  My patch changes gdm_fdprintf() so it returns the error
> from the write() system call and slave.c::gdm_slave_greeter_ctl() to
> detect the error and to perform a
> "gdm_slave_quick_exit(DISPLAY_REMANAGE);".  This makes GDM no longer
> hang, as when the user clicks "Reboot" GDM restarts and the user can
> again click "Reboot" and the computer will now really reboot.  This is
> obviously not ideal, we would like the computer to just reboot as the
> user asked or in fact do whatever else the user asked (it might have
> been shutdown or something else entirely)...

I don't understand why it would hang.  The slave tries to write, but can't so
just returns.  But the next thing that happens is that it tries to read from
the pipe and that should just fail if the greeter is dead and gdm_fdgetc
should return EOF.  In that case the ctl thing will return NULL and
'interrupted' is set.  The SIGCHLD signal should arrive and that should
tickle the slave_waitpid select function, when we get back to it from
whatever we were doing with the ctl.  So what I'm thinking happens is that
the SIGCHLD is somehow blocked and thus the slave_waitpid select never gets
tickled that the child died.  So the PAM module must be either wiping out or
blocking the SIGCHLD (or there is a GDM bug where it screws it up itself).

> So, I want to ask how we should solve this problem?  Can we perhaps
> extend the SUP or SOP protocols or something along those lines to handle
> communication between the slave and the Greeter rather than relying on
> exit status codes and waitpid()?

But you can't rely on this.  What if the greeter crashes?  Then without
SIGCHLD we never find out, because it obviously won't tell us.  There are
other places where some standard libs just exit which is evil, but happens
and is the same as a 'crash' and thus the greeter will again not tell us
anything.

> Note I am happy to invest programming time in this to fix it, I would
> just like to know in which way you would like it fixed so that it can be
> merged into the GDM distribution and everyone can benefit rather than us
> fixing it only for ourselves...

I think trying to 'fix' GDM in this case is futile, as if you fix this
'problem' (if it's really what I'm thinking it is), then you will hit another
problem later since it's not really fixing the fact that PAM is messing with
GDM's idea of it's processes which is private info of GDM and which PAM
shouldn't be messing with.

I think the best way is to fix PAM to not do crazy stuff.  Make it spawn an
independent daemon and then have the PAM module just tell the daemon what to
do.  Perhaps D-BUS would be useful here.

George

-- 
George <jirka 5z com>
   Originality is undetected plagiarism.
                       -- Dean W. R. Inge



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]