Downtime Report: container.gnome.org



Team,

As many of you are aware we had downtime for a few hours this morning.
I wanted to document the what, when and why so that everyone
understands what the issue was, how it was handled, and what has and
is being done to avoid it in the future. For those that were involved,
if I've missed anything please feel free to append to the thread so
that we get all the details accurate.

Over the last 24hrs each of our Red Hat servers received kernel
updates from the RHN. In order to complete these updates, each box
needed to be rebooted. We discussed (in IRC) which boxes should be
rebooted first based on security concerns. window and container were
at the top of that list, due to these boxes having the most direct
console interaction with users.

window was rebooted first and came back up promptly about 7:15am (MST).

container was rebooted second at about 7:45 (MST) and did *not* come
back up until 10:00 (MST). This required manual intervention by Owen,
who tried console and KVM access and eventually had to get data-center
staff to manually reboot the machine.

Downtime was not limited to container only as it hosts the home
folders for all machines and exports them via NFS. With container
being down, many other services were affected and no one was able to
login to any other server. Known affected services were git,
www.gnome.org (why?) and mail (why?). bugzilla and the wiki were
unaffected during this outage.

The reason container didn't come back up was an SELinux issue. Upon
rebooting the system halted with the error: "Unable to load SELinux
Policy. Machine is enforcing Mode. Halting now." It is unknown why
SElinux was active on container while it is disabled on all other
hosts.

SELinux has now been disabled on container by appending "selinux=0" to
the kernel in grub as well as defining "selinux=disabled" in
/etc/sysconfig/selinux. It has also been verified that all other hosts
have SELinux disabled so, again, it is a mystery as to why it may have
been enabled on container.

This downtime brought two issues to the forefront that I think need to
be addressed.

1) We need new hardware! Other than the server that Jeff donated, just
about everything is out of warranty. We are seriously asking for
trouble running critical services on machines that are no longer
supported. It is only a matter of time until one of these boxes goes
down for good.

2) We need out-of-band access to the hardware. The solution to our
problem today required Owen. Unless I am mistaken, he is the only
admin with console/kvm access to any hardware, and is the only one
able to file tickets with Red Hat IT. I propose that out-of-band
access be configured and allowed by more admins to our servers. This
alleviates Owen as the single point of contact between the team and
Red Hat IT, allows us to connect to hardware via more reliable
methods, and allows us to respond more quickly to issues such as this.
Do we have a solution for todays problem if Owen had been unavailable?

I know Paul has been working on a proposal for replacement hardware,
and I think that should be a priority. If anyone can give him any
additional reasons to present to the board, please comment here or
contact him directly. If we plan to move forward and really solidify
the infrastructure we need new, supported hardware.

I do want to thank Owen for being so prompt in attending to this issue
and communicating with the team with regular status reports. I'm glad
we were able to resolve the issues relatively quickly. I hope everyone
can look at this situation not as a failure of service, but as
critical lessons learned toward improving our infrastructure and
shoring up problems.

As usual, if you have any questions or comments about todays downtime
please feel free to contact me.

Christer


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]