So, after the nfs capabilities were added to collectd, we were able to track down the source of the heavy, constant load which was plaguing our nfs server. As you can see from these weekly graphs: http://vertex.ices.utexas.edu:9999/weekly/Nov13-Nov20/index.html we have dramatically reduced our load. We went from a sustained load of over 14,000 rpc (nfs) operations per second to just over 1,000 on average. The culprit turned out to be an older and poorly configured version of "gamin" (the file alteration monitor, see http://www.gnome.org/~veillard/gamin/). Our solution was to create an rpm of the latest version (0.1.7), and put the following in /etc/gamin/mandatory_rc on all clients: fsset nfs none fsset autofs none (the autofs line may not be neccessary) (/etc/gamin/mandatory_rc was not supported until a recent version of gamin) additionally, the attached client/server python script I wrote was instrumental in tracking down the hosts which were causing most of the problem. here is some sample output, which shows the average total nfs activity of each host in ops/sec: *** top 20 offenders *** 3346 fire.ices.utexas.edu 3080 orinoco.ices.utexas.edu 1927 antigua.ices.utexas.edu 1798 super.ices.utexas.edu 1448 tronix.ices.utexas.edu 755 cozumel.ices.utexas.edu 463 reunion.ices.utexas.edu 305 tobago.ices.utexas.edu 265 cletus.ices.utexas.edu 237 otto.ices.utexas.edu 207 promise.ices.utexas.edu 161 velma.ices.utexas.edu 160 sally.ices.utexas.edu 135 water.ices.utexas.edu 108 sauron.ices.utexas.edu 99 nugloo.ices.utexas.edu 60 cancun.ices.utexas.edu 46 retina.ices.utexas.edu 38 carbon.ices.utexas.edu 27 aussie.ices.utexas.edu It was interesting that after installing the new version of gamin, having the users log out, kill their gam_server process, and log back in was not enough to fix the problem. Each machine had to be rebooted for the changes to take effect. My guess is that this had something to do with the fact that autofs doesn't appear to actually get restarted until you reboot ("service autofs restart" appears to do nothing significant while you have nfs filesystems automounted). -jason pepas
Attachment:
client-wrap.sh
Description: Bourne shell script
#!/usr/bin/python # see http://gnosis.cx/publish/programming/sockets2.html import time import random import socket import sys def randsleep(interval): rand_fudge = interval * random.random() myinterval = interval + rand_fudge then = time.time() while True: now = time.time() elapsed = now - then if elapsed >= myinterval: break else: remaining = myinterval - elapsed time.sleep(remaining) return elapsed def get_stats(): for line in file("/proc/net/rpc/nfs"): fields = line.split() if fields[0] == "proc3": numfields = int(fields[1]) firstfield = 2 sum = 0 for i in range(firstfield, firstfield+numfields): sum = sum + int(fields[i]) return sum previous_get_stats_time = time.time() previous_stats = 0 def get_rates(): now = time.time() stats = get_stats() global previous_get_stats_time elapsed = now - previous_get_stats_time global previous_stats delta_stats = stats - previous_stats stats_rate = delta_stats / elapsed previous_get_stats_time = now previous_stats = stats return (stats_rate, elapsed) def print_stats(): (stats_rate, stats_elapsed) = get_rates() print stats_elapsed, stats_rate # throw away the first set of values, as they are invalid. get_rates() if len(sys.argv) < 2: print "Usage: %s print|net" % sys.argv[0] sys.exit(1) if sys.argv[1] == "print": while True: time.sleep(1) print int(get_rates()[0]) elif sys.argv[1] == "net": myhostname = socket.gethostname() sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) while True: randsleep(30) (stats_rate, stats_elapsed) = get_rates() sock.sendto(str(int(stats_rate)), (sys.argv[2],1337)) else: print "Usage: %s (print)|(net hostname)" % sys.argv[0] sys.exit(1)
#!/usr/bin/python # see http://gnosis.cx/publish/programming/sockets2.html import socket def rsortfunc(x, y): # from http://2701.org/archive/200311230000.html if x[1] > y[1]: return -1 elif x[1] == y[1]: return 0 else: return 1 sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) sock.bind(('', 1337)) rates = {} while True: (data, address) = sock.recvfrom(256) rate = int(data) hostname = socket.gethostbyaddr(address[0])[0] rates[hostname] = rate items = rates.items() items.sort(rsortfunc) print print print print " *** top 20 offenders ***" for i in range(min(len(items),20)): print items[i][1], items[i][0]