Re: live.gnome.org down

From: Owen Taylor <otaylor redhat com>
To: Olav Vitters <olav vitters nl>
Cc: gnome-infrastructure gnome org
Subject: Re: live.gnome.org down
Date: Tue, 06 Sep 2011 17:42:55 -0400

On Tue, 2011-09-06 at 21:57 +0200, Olav Vitters wrote:
> On Tue, Sep 06, 2011 at 02:07:36PM -0400, Owen Taylor wrote:
> > On Fri, 2011-09-02 at 19:17 -0400, Owen Taylor wrote:
> > > label.gnome.org ran out of memory this morning, it was looping hard in
> > > the OOM killer. I got Red Hat IT to power cycle the machine a few hours
> > > ago, but when I just looked the load average was at 24 and the machine
> > > looked like it was heading for another OOM death.
> > > 
> > > So, I stopped httpd, so ldap would stay up and we wouldn't have to get
> > > the machine rebooted again. I also stopped puppet, since I think that
> > > would start httpd when run.
> > > 
> > > If anybody wants to dig in and try to figure out what is going on, that
> > > probably would be a good idea.
> > 
> > What was going on is that a user accidentally (I think) uploaded a 97M
> > binary file as the content of a fairly frequently accessed page.
> > 
> > When anybody tried to access that page, it would spin forever eating
> > enormous amounts of memory, which would eventually take the server down.
> > 
> > I reverted the page change, and removed the offending revisions.
> 
> Damn.. I like your investigation skills. How were you able to figure
> this out? I noticed the heavy memory usage and CPU, but couldn't do much
> more :-(

Basically 'apachectl fullstatus' and cross-correlating with the output
of top allowed me to see that the httpd processes that were running away
and getting huge were all for one particular request URL.

[ Things were actually misconfigured so apachectl fullstatus didn't work
  but going to live.gnome.org/server-status from the public internet
  did, which wasn't intended. Subsequently fixed those problems. ]

I went into the moin/live.gnome.org/pages/<pagename> directory for that
page and found the gigantic binary file. Looking in Apache's access_log
let me figure out how it happened and confirm that it looked like an
innocent mistake rather than an attempt to do something malicious.
(The binary file got thoroughly corrupted with encoding and line-ending
conversions on upload.)

Basically, 'apachectl fullstatus' is incredibly useful for figuring out
what's going wrong when a webserver is falling over, and what I usually
turn to first.

> could you also setup 25GB or space on the wiki VM? I wanted to migrate
> it but don't have knowledge on VMs. live.gnome.org + other smaller sites
> take up 9.7GB, so 25GB should be ok.

Steps on combobox.gnome.org:

# lvcreate VolGroup00 -n wiki-data --size 25G
# mke4fs /dev/VolGroup00/wiki-data
# virsh shutdown wiki-test
# virsh edit wiki-test
<copy other disk configuration to create /dev/vdb for the newly created
logical volume>
# virsh start wiki-test

Then I fixed the puppet configuration for wiki in the way you can see
in the puppet git logs.

What we've generally been doing for other machines is bind mounting from
/mnt/wiki-data where the logical volume is into more meaningful file
system locations - /usr/local/www/moin if we want to copy what we had
before on label, or whatever. If you set up bind mounts, make sure to
update $backup_exclude as well so we don't back up the wiki data twice.

- Owen

References:
- live.gnome.org down
  - From: Owen Taylor
- Re: live.gnome.org down
  - From: Owen Taylor
- Re: live.gnome.org down
  - From: Olav Vitters

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]