Unexpected outage: 20:00 CET 26-05-2014 to 03:00 CET 27-05-2014

From: Patrick Uiterwijk <puiterwijk gnome org>
To: infrastructure-announce gnome org, foundation-list gnome org, desktop-devel-list gnome org
Subject: Unexpected outage: 20:00 CET 26-05-2014 to 03:00 CET 27-05-2014
Date: Mon, 26 May 2014 21:10:39 -0400 (EDT)

Hello everyone,

As you might have noticed, we had a major issue in the GNOME infrastructure last night, which extended as far
as to render almost every service we provide unavailable.
This was caused by our main file server stopping to serve the file systems required for home directories and
mailing lists.

The cause about the outage is current not clear as the logs are not showing up anything relevant.
We've sent them to gluster engineers to ask them for help on analyzing them.

On rebooting the server, something went wrong, requiring a powercycle of the affected machine.
When trying this, we were hit by a bug in the management cards that made us unable to use them to reboot the
server.

Because of this, we have requested hands-on service to get the server power cycled, which had us waiting for
some time.
Within minutes after the server was rebooted, the file systems came back online, and with it all of the GNOME
services.

To prevent all services from going down when the primary file server would go down, we had previously setup a
synchronized secondary file server.
The reason we were unable to make all servers fallback to this one was because we weren't able to login to
the affected servers to update the target IP.

To prevent this problem from pulling down the entire GNOME infrastructure in the future, we have taken some
steps:
- We have added a way for us to login to any server even if the home directories are down.
- We'll be introducing automatic failover to the other available file server
- We'll be spreading our documentation off-site to prevent the relevant documentation to disappear when
the machine hosting
is experiencing problems
- We will be making sure to get access to the power management to our servers, so we can reboot them
even if the management
cards are not functioning

We really hope that this will prevent such drastic failures in the future, and make it easier to recover if
problems do occur.

If you have any additional questions, don't hesitate to contact either of us on IRC (#sysadmin) or by sending
us an email.

With kind regards,
Patrick Uiterwijk and Andrea Veri
System Administrators, GNOME

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]