CE oveloaded

From T2B Wiki
Jump to navigation Jump to search

CE Oveloaded

When the CE is overloaded, this can cause issues with publishing BDII info (when the site BDII is running on the CE at least). When top shows high system cpu usage and vmstat 1 shows lots of cs (context switches), there are probably too many globus-job-mamangers running.

Possible causes

  • sendmail + torque: torque sends an email to the pool user on job completion. these will (probably) fail and retrying these submission can bring the system to a halt
    • solution:
    • disable sendmail completely
    • clean out /var/spool/clientmqueue and /var/spool/mqueue
  • globus-job-manager tracks jobs from the past: /opt/globus/tmp/gram_job_state contains the lock files that are ids to be checked. some of them can be quite old
    • solution: clean it up by removing the lock files and corresponding regular files
cd /opt/globus/tmp/gram_job_state
find $PWD -ctime +20 -regex '.*lock'|sed 's/.lock//' > list
for i in <tt>cat list|grep /opt/globus/tmp/gram_job_state</tt>; do echo $i; rm -f $i $i.lock; done
rm -f list



Template:TracNotice