ProdAgentProd

From T2B Wiki
Jump to navigation Jump to search

Prodagent for LCG5 Production Team

operations

  • summary of all steps
  • subscribe to hn-cms-mcOpsAATTcern.ch
  • login as cmsprod1 on master3.iihe.ac.be (keys should be in place)
    • bash is the default shell
    • the login ~/.baschrc sets the environment and goes to the correct directory
  • make your environment is ok and that you are using the correct production VOMS role.
  • check the Production Operations page. The Belgian team is LCG5.
    • look for assigned workflows and the ID of this workflow. To copy them to disk use the get_workflows.sh script in the directory workflows. For this script you have to give two numbers. These numbers are the ID numbers of first and the last workflow. The script will get ALL (also the ones which are not assigned to LCG5 ! ) the workflows with ID numbers between those two numbers.
  • by clicking on the ID number of the workflow you can check if the workflow is GEN-SIM or reRECO
 1. If it is GEN-SIM use the SubIIHE.py script. This script has the option: --map
    • If you look in SubIIHE.py tou see that as argument of map you have to give a name which you have specified yourself.
    • For example to submit testjobs you can
SubIIHE.py --map=iihe-test
    • you see in SubIIHE.py that iihe-test replaces the workflow, njobs, nevts and sitepref
map['iihe-test']={'workflow':'/beo5/cmsprod1/first-test/test-iihe-fullprod-1-Workflow.xml','njobs':'10','nevts':'10','sitePref':'maite.iihe.ac.be'}
    • If you give as extra option e.g. --nevts=5, it will overwrite the previous option (-nevts=10)
    • If you give more then one sitepref the RB will chose the site it sends the job to
 2. If it is reRECO use the SubSkimIIHE.py script. 
    • detailed explanation: insert the workflows in the PA system: full details
    • the InjectTestLCG.py script is a wrapper for most of the following steps
    • the start script creates an alias publi for python2.4 $PRODAGENT_ROOT/util/publish.py
  publi RequestInjector:SetWorkflow  /some/path/to/MyWorkflow.xml
  publi NewDataset /some/path/to/MyWorkflow.xml


  • Phedex the dataset (closing blocks)
    • After producing a certain amount of events they should be migrated to Global DBS/DLS and injected into PhEDEX. The steps required are described in: opertation step for injection.
    • In /beo5/cmsprod1/tools you can find a script which does the first three step for you: phedex_the_dataset.sh. The script takes as argument the path of the dataset
    • you can look for the datasetpath on DBS.
  phedex_the_dataset.sh /mc-onsel-120_double_chHiggs/FEVT/CMSSW_1_2_0-FEVT-1166726234
  • Closing the dataset
    • When all required events are produced you have to close the dataset
    • The script CloseDataSet.py will do this for you
  CloseDataSet.py --datasetpath=/mc-onsel-120_double_chHiggs/FEVT/CMSSW_1_2_0-FEVT-1166726234
    • Check DBS to see that the status is closed.

finishing a workflow

  • check for failed jobs or check the DBS discovery page if there are enough events in the unmerged dataset
  • check if there are enough events in the merged dataset or see if there are lots of merge failures
    • use the double_check.py script to check the status of the jobs (in the SuccessArchiveDir)
    • use the check_merge.py script to check for current merges and to resubmit failed ones
    • use the force_merge.py script to force all remaining unmerged events to be merged
  • wait until previous jobs are finished
  • use the phedex_the_dataset.sh script to close all fileblocks, migrate to global dbs and to produce the phedex injection files
  • use CloseDataSet.py to close the dataset completely

using production role

  • destroy current proxy and start using new one with production VOMS role
  voms-proxy-destroy
  source /msa3/prodagent/new_proxy_prod.sh

bari monitoring system

  • start/stop it with
  Start the deamon to get component log messages updated regularly :
  (cd /scratch/PA/cmsprod1/monitoring/apache/htdocs/PA; nohup sh ComponentLog.sh &)

  Start apache:
  /scratch/PA/cmsprod1/monitoring/apache/bin/apachectl start

  Point your browser to http://localhost:8080/PA/index.php

  When necessary, stop the apache server:
  /scratch/PA/cmsprod1/monitoring/apache/bin/apachectl stop
  • remote connection: since the webpage is only accessible from master3 (for security reasons), some special steps need to be taken to use it.
    • SSH tunnel setup on linux
    • standard tunnel setup. (for more info see man ssh, -L option). this sets up a tunnel for a local single port <localportnumber> to a single destination <destinationhost>:<destinationport> through host <machineusedfortunnel>. all your local apllication can use this without modification.
  ssh -L <localportnumber>:<destinationhost>:<destinationport> <machineusedfortunnel>
    • eg using following
   ssh -L 8081:localhost:8081 master3.iihe.ac.be
  you can now point your browser to http://localhost:8081/PA/index.php without any problem
    • application forwarding: this forwards all traffic through this tunnel, but needs extra settings to your local apllication
      • connect to master using this comment
 ssh -D 10001 master3.iihe.ac.be
      • in firefox, go to Edit, Preferences, General, Connection Settings
      • select manual proxy configuration
      • setup as in the following picture, then click OK and Close.
    Image(firefox-proxy-socks5.png, 25%)
  • Looking for the reason of Failure/Abortion
    • Once a job has failed it will be resubmitted automatically. When it fails ten times it will appear in /beo5/cmsprod1/FailureArchiveDir
    • You can look in the stdout (or stderr) file to look where it went wrong
    • If a job has not yet failed ten times you have to look in the JobCreator directory (but there are a lot of files in there)
    • You also can use. This also gives the same information as the stdout file
edg-job-get-logging-info -v 2 https://rb122.cern.ch:9000/j0Eiki7h__eyBcKLpW12Pw
///bossq -taskid 23202 -full

LCG5 details

  prodAgent-edit-config --component=LocalDBS --parameter=DBSURL --value=https://cmsdbsprod.cern.ch:8443/cms_dbs_prod_local_05_writer/servlet/DBSServlet
  prodAgent-edit-config --component=RequestInjector --parameter=FirstRun --value=40000000
 lcg-infosites --vo cms closeSE 
 Use the closeSE in the --site-pref= option 
    • ingrid.cism.ucl.ac.be
    • closeSE: ingrid-se02.cism.ucl.ac.be
    • contact: gridadmAATTfynu.ucl.ac.be
    • slots: 30-70
    • gridce.iihe.ac.be
    • closeSE: maite.iihe.ac.be
    • contact: grid_adminAATTlistserv.vub.ac.be
    • slots: 40
    • grid10.lal.in2p3.fr
    • closeSE: grid11.lal.in2p3.fr, grid05.lal.in2p3.fr, grid03.lal.in2p3.fr
    • contact: igor.semenioukAATTpoly.in2p3.fr, jouvinAATTlal.in2p3.fr, grid.supportAATTgrif.fr
    • slots: 100
    • cclcgceli02.in2p3.fr
    • closeSE: ccsrm.in2p3.fr
    • contact: cms-supportAATTcc.in2p3.fr
    • slots: 210
    • polgrid1.in2p3.fr
    • closeSE: polgrid2.in2p3.fr, polgrid4.in2p3.fr
    • contact: igor.semenioukAATTpoly.in2p3.fr, grid.supportAATTgrif.fr
    • slots: 20
    • CEA: node07.datagrid.cea.fr
    • closeSE: node12.datagrid.cea.fr
    • contact: grid.supportAATTgrif.fr
    • ce.polgrid.pl
    • closeSE: se.polgrid.pl
    • contact: lcgAATTfuw.edu.pl
    • slots: ??
    • oberon.hep.kbfi.ee
    • closeSE: io.hep.kbfi.ee
    • contact: gridAATTnicpb.ee
    • slots: ?
    • a01-004-128,ce-fzk .gridka.de
    • closeSE:
    • contact: lcg-adminAATTlistserv.fzk.de, Artem.TrunovAATTcern.ch
    • slots: ?
    • desy
    • contact: egeeAATTdesy.de
    • RWTH
    • contact: lcg-contactAATTphysik.rwth-aachen.de
    • All CEs/SEs: CE list for LCG (5)
    • Contacting information for (former) LCG2 sites: Gstat monitoring


Template:TracNotice