We need to fix our ldap setup



OK, so we just had all services down for a couple hours because:

 A) label is our LDAP master
 B) label ran out of memory (probably because of live.gnome.org)

We need to avoid this single point of failure. Some things:

 * Shouldn't we move the LDAP master to a box that isn't as 
   liable to be run out of memory (doesn't handle web requests
   in Python)? I think we moved LDAP to label from button because
   at that point label was RHEL 4 and button RHEL 3? But they
   are all RHEL 5 at this point. 

 * We seem to have been replicating to box, which is out of 
   service at the moment. Should we be replicating to a different
   machine? Can we configure fallover to the replicant?   

 * Can we figure out how to make login work for the wheel group,
   who should be in /etc/passwd, /etc/group on all machines,
   even when LDAP is down? Or is nss-ldap just irretrievably
   busted?

- Owen

P.S. - Two notes on recovery:

 * When we brought label back up, slapd immediately ran out of
   file descriptors because all the other machines flooded
   it. I worked around this by shutting off the other machines
   with iptables and opening up to them one by one.

 * slapd was complaining:	

    Checking configuration files for slapd:  bdb_db_open: unclean shutdown detected; attempting recovery.
    bdb_db_open: Recovery skipped in read-only mode. Run manual recovery if errors are encountered.
    config file testing succeeded

   I shut it down, and ran:

    # /usr/sbin/slapd_db_recover -v -h /var/lib/ldap

   (Found via google), and after two more restarts things were happy,
   but I'm not sure this step was necessary. Maybe it would have 
   done the recovery itself if given time.

Attachment: signature.asc
Description: This is a digitally signed message part



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]