Wrap-up on server move
- From: Owen Taylor <otaylor redhat com>
- To: gnome-infrastructure gnome org
- Subject: Wrap-up on server move
- Date: Sun, 13 Dec 2009 13:48:20 -0500
Since things seem to be working pretty well, thought I'd do a
quick post-mortem brain dump on the move.
What worked well:
* We got things running again well within the advertised window
(Were planning for up to 35 hours, had almost everything runing
again in 17 hours after shutting things down.)
* All 8 servers came back on line in the new location with no
hardware failures.
* Proper planning, lowering the TTL and contacting the slave DNS
servers in advance caused change the IP address for
ns-master.gnome.org and other servers to be much less disruptive
than the 3-4 times we've done it previously.
* I tried to identify and update all the places in the machine
configuration that referenced IP addresses before shutting
the machines down; this worked out very well, most of the
machines were immediately operational when brought up on
the new IP adddress.
* puppet worked well; all of the puppet managed machines only
needed a single line changed in the node configuration
rather than changing a number of different files in /etc.
* I'm pretty confident that we had a good set of backups and
reasonable contingency plans if hardware failure did occur.
(And when validating the backups I didn't find any significant
problems. We were backing up too much data in some cases - in
particular backing up the mail archives twice. And we weren't
backing up vbox.gnome.org, so I added that to the backup system,
but there isn't data on that system image anyways, so it could
have been rebuilt without issue.)
Problems encountered:
* I didn't realize that vbox and drawable were back up for a
couple of hours because I had the old IP address in
my /etc/hosts, so they seemed unresponsive.
* There were some problems in moving over the network ACLs,
so it took a couple of hours before we got http access to
the machines. (I had a dump of the new network ACLs in advance
but didn't think to actually review it, so didn't notice
that port 80 wasn't included.)
* The planet.gnome.org update scripts have a locking scheme
that causes planet to silently stop updating if the server
is rebooted during an update. (I found and removed the lock file
a few minutes ago when someone mentioned that planet.gnome.org
wasn't updating.)
* Despite our best efforts, we still have some sporadic problems
with cached old DNS entries being reported today.
My detailed knowledge of the DNS system isn't good enough to
understand why; if I had to guess it might be related to delays
related to updating the IP address of ns-master.gnome.org with
the registrar (Network Solutions warns that may take 72 hours
to fully propagate.)
* As seen in my earlier mail, we have very poor contact information
for the domains that we host DNS for.
* The backlog of messages when mail.gnome.org came back up seems
to have triggered spam protections on some systems. Both Google
and Yahoo were rejecting our messages for a while. The Google
situation resolved itself quickly, but we're still getting:
Messages from 209.132.180.169 temporarily deferred
due to user complaints - 4.16.56.1;
see http://postmaster.yahoo.com/421-ts02.html
From Yahoo.
Possible improvements:
(hopefully won't have to do this for again for a while...)
* Aleksi Suhonen at axu.tm pointed out that it's possible to
configure a DNS slave with bind to try multiple IPs for the
master server; he handled the gnome.org transition that way.
That would have been a useful suggestion for us to provide when
contacting the slave DNS servers.
* Since we have multiple servers on multiple continents, we
would probably ideally host our own secondary DNS.
* It would be blue-sky nice if we had enough redundancy in our system
and sysadmin team volunteer time to provide backup service
during this type of move.
Still, I don't feel *too bad* about telling people to take
a Saturday off two weeks before Christmas. And the fact we
got things up without major snags seems to validate taking the
keep-it-simple approach.
Continuing issues:
* I'm not convinced that the slave DNS servers for domains other
than gnome.org that we host DNS for are properly updated to the
new master IP address; we probably won't find all the problems
with this until we need to actual change an entry in one
of these domains. (Some of the domains are CNAMES only and
change almost never.)
* We still need to get a couple of more ports opened for buildbot
and jabber.gnome.org; hopefully that will happen today.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]