Wrap-up on server move



Since things seem to be working pretty well, thought I'd do a
quick post-mortem brain dump on the move.

What worked well:

 * We got things running again well within the advertised window 
   (Were planning for up to 35 hours, had almost everything runing
   again in 17 hours after shutting things down.)

 * All 8 servers came back on line in the new location with no
   hardware failures.

 * Proper planning, lowering the TTL and contacting the slave DNS
   servers in advance caused change the IP address for
   ns-master.gnome.org and other servers to be much less disruptive
   than the 3-4 times we've done it previously.

 * I tried to identify and update all the places in the machine
   configuration that referenced IP addresses before shutting
   the machines down; this worked out very well, most of the
   machines were immediately operational when brought up on
   the new IP adddress.

 * puppet worked well; all of the puppet managed machines only
   needed a single line changed in the node configuration
   rather than changing a number of different files in /etc.

 * I'm pretty confident that we had a good set of backups and
   reasonable contingency plans if hardware failure did occur.

   (And when validating the backups I didn't find any significant
   problems. We were backing up too much data in some cases - in
   particular backing up the mail archives twice. And we weren't 
   backing up vbox.gnome.org, so I added that to the backup system, 
   but there isn't data on that system image anyways, so it could 
   have been rebuilt without issue.)

Problems encountered:

 * I didn't realize that vbox and drawable were back up for a
   couple of hours because I had the old IP address in
   my /etc/hosts, so they seemed unresponsive.

 * There were some problems in moving over the network ACLs,
   so it took a couple of hours before we got http access to
   the machines. (I had a dump of the new network ACLs in advance
   but didn't think to actually review it, so didn't notice 
   that port 80 wasn't included.)

 * The planet.gnome.org update scripts have a locking scheme
   that causes planet to silently stop updating if the server
   is rebooted during an update. (I found and removed the lock file
   a few minutes ago when someone mentioned that planet.gnome.org
   wasn't updating.)

 * Despite our best efforts, we still have some sporadic problems
   with cached old DNS entries being reported today.

   My detailed knowledge of the DNS system isn't good enough to
   understand why; if I had to guess it might be related to delays
   related to updating the IP address of ns-master.gnome.org with
   the registrar (Network Solutions warns that may take 72 hours
   to fully propagate.)

 * As seen in my earlier mail, we have very poor contact information
   for the domains that we host DNS for.

 * The backlog of messages when mail.gnome.org came back up seems
   to have triggered spam protections on some systems. Both Google
   and Yahoo were rejecting our messages for a while. The Google
   situation resolved itself quickly, but we're still getting:

    Messages from 209.132.180.169 temporarily deferred 
      due to user complaints - 4.16.56.1;
     see http://postmaster.yahoo.com/421-ts02.html

   From Yahoo.

Possible improvements:

(hopefully won't have to do this for again for a while...)

 * Aleksi Suhonen at axu.tm pointed out that it's possible to
   configure a DNS slave with bind to try multiple IPs for the
   master server; he handled the gnome.org transition that way. 
   That would have been a useful suggestion for us to provide when
   contacting the slave DNS servers.
 
 * Since we have multiple servers on multiple continents, we
   would probably ideally host our own secondary DNS.

 * It would be blue-sky nice if we had enough redundancy in our system
   and sysadmin team volunteer time to provide backup service
   during this type of move.

   Still, I don't feel *too bad* about telling people to take
   a Saturday off two weeks before Christmas. And the fact we
   got things up without major snags seems to validate taking the
   keep-it-simple approach.

Continuing issues:

 * I'm not convinced that the slave DNS servers for domains other
   than gnome.org that we host DNS for are properly updated to the
   new master IP address; we probably won't find all the problems
   with this until we need to actual change an entry in one
   of these domains. (Some of the domains are CNAMES only and
   change almost never.)

 * We still need to get a couple of more ports opened for buildbot
   and jabber.gnome.org; hopefully that will happen today.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]