OK, so we just had all services down for a couple hours because: A) label is our LDAP master B) label ran out of memory (probably because of live.gnome.org) We need to avoid this single point of failure. Some things: * Shouldn't we move the LDAP master to a box that isn't as liable to be run out of memory (doesn't handle web requests in Python)? I think we moved LDAP to label from button because at that point label was RHEL 4 and button RHEL 3? But they are all RHEL 5 at this point. * We seem to have been replicating to box, which is out of service at the moment. Should we be replicating to a different machine? Can we configure fallover to the replicant? * Can we figure out how to make login work for the wheel group, who should be in /etc/passwd, /etc/group on all machines, even when LDAP is down? Or is nss-ldap just irretrievably busted? - Owen P.S. - Two notes on recovery: * When we brought label back up, slapd immediately ran out of file descriptors because all the other machines flooded it. I worked around this by shutting off the other machines with iptables and opening up to them one by one. * slapd was complaining: Checking configuration files for slapd: bdb_db_open: unclean shutdown detected; attempting recovery. bdb_db_open: Recovery skipped in read-only mode. Run manual recovery if errors are encountered. config file testing succeeded I shut it down, and ran: # /usr/sbin/slapd_db_recover -v -h /var/lib/ldap (Found via google), and after two more restarts things were happy, but I'm not sure this step was necessary. Maybe it would have done the recovery itself if given time.
Attachment:
signature.asc
Description: This is a digitally signed message part