Contingency planning for move

From: Owen Taylor <otaylor redhat com>
To: gnome-infrastructure gnome org
Subject: Contingency planning for move
Date: Thu, 10 Dec 2009 14:52:36 -0500
Spent a little time thinking about contingency plans if servers don't
survive the move. (Actually, this should be turned into a standing
contingency plan - almost nothing here is specific to the move, I'm
just worried about jostling during a move triggering latent hardware
failures.)

Of particular concern are the four old servers that don't have active
service contracts; if these suffered a failure, we wouldn't have a
easy path to getting them repaired in a timely fashion. We might be able
to cajole someone in Red Hat IT into putting in a replacement drive if
we mailed one out there, but that's about all.

 container.gnome.org (Sep. 2003, HP donation)
 window.gnome.org    (Apr. 2004)
 menubar.gnome.org   (Apr. 2004)
 button.gnome.org    (Apr. 2004)

(Clearly in the near future we need to look into replacing these
machines; it might be possible to recertify them but I doubt it makes
sense.)

The three newer Red Hat donated servers should have active 24x7 onsite
service contracts:

 label.gnome.org     (May  2006)
 vbox.gnome.org      (Dec. 2008)
 drawable.gnome.org  (Dec. 2008)

So basic contingency plan for these would be to get them repaired
(restore from backups if necessary, but they are all RAID-1 or
RAID-10, so hopefully not.) That should be faster than trying to 
move stuff around.

The main potential problem would be if they got dropped and destroyed
or lost during the move; there is insurance, but it could be
weeks to get them replaced, especially with the holidays.

I'm not sure about the Sun donated server:

 fixed.gnome.org     (2006?)

But it doesn't run any essential services, so I'm less concerned about
it. Diving into detail:

container.gnome.org
===================

What it runs: 
  NFS export of /home/users, /home/admin, and mail archives
  Cobbler
  sysadmin.gnome.org

Contingency plan:
  We have 90G of unallocated disk space on drawable.gnome.org
  (and 40G more that could be used at a pinch). /home/user
  is 35G, /home/admin 1G, so there's no problem putting them
  onto a partition on drawable, and drawable has tons of
  spare IO capacity. Bugzilla isn't stressing it at all.
 
  Mail archives are 30G, could also put them on drawable.gnome.org
  to keep things simple, or could export them from vbox
  vbox where we have lots of unallocated (slow) disk space.

menubar.gnome.org
=================

What it runs: 
  ns-master.gnome.org
  Cluster email
  Mailman

Contingency:
  Create a VM on vbox.gnome.org, restore to that. Menubar is
  actually not very loaded either for CPU or disk, so I think
  we could get away with running it on vbox.gnome.org without
  impacting mail or the other services on vbox (git, bugzilla)

window.gnome.org
=================

What it runs: 
  master.gnome.org
  www.gnome.org, planet.gnome.org, art.gnome.org,
    library.gnome.org, other miscellaneous websites

Contingency:
  Create a VM on vbox.gnome.org, restore to that. window is
  pretty heavily loaded, and I wouldn't be happy having it
  putting more load on vbox.gnome.org's disks, but it should
  OK for a short period of time. We could investigate moving
  high load services (art.gnome.org, planet.gnome.org) to
  fixed.gnome.org which is basically unused, or scramble to
  find new hardware.

button.gnome.org
================

What it runs: 
  Mango
  Miscellaneous databases:
    blogs.gnome.org, artweb, gnomejournal, rt3, mango

Contingency:
  Create a VM on vbox.gnome.org, restore to that. Migrate databases
  to drawable.gnome.org after getting initial functionality back.
  Mango could stay on a VM on vbox.gnome.org indefinitely.

label.gnome.org
===============

What it runs: 
  LDAP
  Wikis (live.gnome.org, gnome-db.org, pango.org)
  XMPP server (Openfire)

Contingency:
  Get machine repaired if possible.
  In case of complete loss, temporarily migrate services to
  fixed.gnome.org, which is basically unused, while waiting for
  replacement.

drawable.gnome.org
===============

What it runs: 
  bugzilla.gnome.org database

Contingency:
  Get machine repaired if possible.
  In case of complete loss, get a replacement as fast as possible,
  try to get a loaner machine from Red Hat IT.

  (Maybe could run the database on vbox.gnome.org, but it is
  doing a lot already, and its disks weren't spec'ed for database
  operation.)

vbox.gnome.org
==============

What it runs:
  bugzilla.gnome.org
  git.gnome.org
  puppet

Contingency:
  Get machine repaired if possible.
  In case of complete loss, get a replacement as fast as possible,
  try to get a loaner machine from Red Hat IT.

  (There's not really any machine where we could move stuff; maybe
  could set up git on fixed.gnome.org temporarily.)

fixed.gnome.org
===============

What it runs:
  build.gnome.org (master server for buildbot)
  Mock environment for package builds

Contingency:
  build.gnome.org could be set up in a VM on vbox
Follow-Ups:
- Re: Contingency planning for move
  - From: Jeff Schroeder
- Re: Contingency planning for move
  - From: Sriram Ramkrishna
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]