Re: gnome-boxes lockup issue



On Wed, May 09, 2012 at 10:47:22AM +0100, Daniel P. Berrange wrote:
> On Wed, May 09, 2012 at 11:34:14AM +0200, Alexander Larsson wrote:
> > I've been having some hangs in gnome-boxes, they got a lot better with
> > the latest patches to avoid blocking i/o on the main thread, but
> > apparently Jonathan still got them sometimes, so I backed out the latest
> > fixes and set out to track it down. Here is what happens when it hangs:
> > 
> > * gnome-boxes does a blocking libvirt call on the main thread, for
> >   instance virDomainGetXMLDesc()
> > * The libvirt worker thread for the call does a qemu monitor call to get
> >   some info. 
> >   For instance the qemu driver for virDomainGetXMLDesc() calls
> >   qemuMonitorGetBalloonInfo() which formats a json command, sends it and
> >   waits for a reply.
> > * In parallel to the above, the guest did some kind of GUI call which
> >   got into the qxl driver by doing i/o on a hw port tied to qxl. This
> >   exits the cpu emulation and calls into qxl_spice_update_area() ->
> >   red_dispatcher_update_area, which sends a message on a pipe
> >   telling the qxl thread to send updates for the area. Then it waits for
> >   a reply.
> > * The qxl thread gets the message, but before updating the area it
> >   flushes outstanding messages by calling flush_display_commands(),
> >   where it keeps trying to flush the pipe going to the client to make
> >   the data in the pipe < MAX_PIPE_SIZE.
> > * However, the client is blocking in the main thread, so it will never
> >   read from the spice channel, so we have here a 4-thread circular
> >   deadlock, which will not be solved until eventually there is a timeout
> >   somewhere. In the example above that is the QXL thread waiting
> >   DISPLAY_CLIENT_TIMEOUT*10, i.e. 150 seconds, but maybe there are other
> >   timeouts in different deadlock paths.
> > 
> > So, since the spice client in boxes recieves data on the main thread we
> > can absolutely never do blocking i/o calls on the main thread that can
> > reach the qemu instance, as that will reproduce this deadlock.
> 
> Urgh, ultimately I think this is a serious SPICE server flaw. The
> spice thread in QEMU must not block itself waiting for the SPICE
> client todo something. If it really must block itself, then it must
> absolutely never block the rest of the QEMU process by holding locks.
> 
> As it stands it looks like a evil spice client can DOS the entire
> operation of the guest, or an evil guest QXL driver can lock up QEMU
> or SPICE client or both.

For the sake of archiving, on IRC we decided there are multiple flaws
here:

 - F16 has an old QXL driver which does synchronous updates. This
   will be fixed by updating F16 to the F17 QXL driver which is
   fully async
 - The SPICE server / QEMU ought to forbid use of the legacy
   synchronous APIs with QXL
 - QEMU ought to issue a notification when balloon memory changes,
   so libvirt can then avoid needing to call the monitor in this
   scenario
 - libvirt ought to timeout gracefully when querying the balloon
   memory level

Fixing any one of these issues would solve the hang, but we should
aim to fix all 4.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]