DDT

From T2B Wiki
Revision as of 12:28, 26 August 2015 by Maintenance script (talk | contribs) (Created page with " === Debugging Data Transfers Instructions for the Brussels DDT team === '''Useful links''': *[https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercisingGroup4 Wiki with our...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Debugging Data Transfers Instructions for the Brussels DDT team

Useful links:

Registrations:

  • Subscribe to the following hypernews lists:
    • hn-cms-facilitiesOpsAATTcern.ch
    • hn-cms-phedexAATTcern.ch
    • hn-cms-phedexOpsAATTcern.ch
    • hn-cms-ddt-tfAATTcern.ch (main ddt hypernews)
    • hn-cms-gridAnnounceAATTcern.ch (announcements for downtimes etc.)
  • Make sure you have the PADA Admin: phedex role, otherwise you can't inject load tests!
  • You should also register to savannah for the troubleshooting (you can use your CERN AFS account):

Enabling new links (disabling links with problems):

  • login as ddt on m1.iihe.ac.be (ssh -2X ddt@m1.iihe.ac.be)
  • In the ddt directory you can find the main script for monitoring links: calculate_links.py
    • This script gives the current status of ddt exercising and downloads the transfer rates from the replica phedex database. It then calculates if links are commissioned or not.
    • The complete information on how to run this script and how to interpret the output can be found in DDTToolInstructions
    • Note: if for some reason you were not able to check the links during more than three days, make sure that you use the options -sy, -sm, -sd to see ALL changes!
    • Execute the script and store the output in a file, e.g.:
./calculate_links.py -sy=yyyy -sm=mm -sd=dd >& yyyymmd2d2_links_sinceyyyymmdd.txt
   where d2d2 stands for the current day.
    • The definition and explanation of the lists in the output can be found on the twiki DDTToolInstructions
    • To find the newly commissioned links over the period you're checking, put the first two lists in separate files of the form yyyymmd2d2_links_sinceyyyymmdd_1list.txt and yyyymmd2d2_links_sinceyyyymmdd_2list.txt.
    • Do a diff to check for the differences, e.g:
diff yyyymmd2d2_links_sinceyyyymmdd_2list.txt yyyymmd2d2_links_sinceyyyymmdd_1list.txt | grep "<"
    • Remove all links where CAF shows up
    • Remove all T2 --> T2 links as long as CCRC is going on
    • Put the remaining new links in a file with name yyyymmdd_changes.txt, they should have the format:
TX_Source_Site TY_Destination_Site enable
    • Completely similar, if a link is in the problem-rate list after 3 days you tried to commission it, you put it in yyyymmdd_changes.txt with "disable" (before you disable: always send a warning!!!):
TX_Source_Site TY_Destination_Site disable
    • Attach the yyyymmdd_changes.txt to DDT twiki If the file already exists with that date update the current one.

Checking the output of the previous injections:

  • Go to our twiki DDTLinkExercisingGroup4.
  • Check the links you started the day before:
    • Go to IIHE monitoring page
    • Check if the link is green
    • If yes, it passed the test:
    • Adapt the twiki with "passed"
    • Go to the PhEDEx page, make sure you are logged in with "Debug". Click on "Data" and "LoadTest Injections". Display with "Show Options" and "Nodes shown" the correct SOURCE Tier. Search in the list and select the correct link. Lower the injection rate to a monitoring rate of 0.5 MB/s.
    • Click on "Subscriptions". Display with "Show Options" and "Nodes shown" the correct DESTINATION Tier. Search in the list and select the correct link. Suspend the subscription, by selecting this action from the drop-down box.
    • Check in "Activity" the "Quality Plot" of the link.
    • If the link didn't pass the test, scroll down this page to the section about Troubleshooting

Exercising links:

  • Go to our twiki DDTLinkExercisingGroup4.
  • If needed, start injections for other links:
    • Go to Nicolo's monitoring page
    • Check the color of the link. If it's already green, put it as "passed" on the twiki and choose an other link.
    • If not, you have to inject a loadtest:
    • Go to the PhEDEx page, make sure you are logged in with "Debug". Click on "Components" and "Links" to check if PhEDEx is up and running (green). If not, you have to open a savannah ticket (if not already done). If yes, proceed:
    • Click on "Data" and "LoadTest Injections". Display with "Show Options" and "Nodes shown" the correct SOURCE Tier. Search in the list and select the correct link. Raise the injection rate to 30 MB/s for a downlink (T1 -> T2) and 8 MB/s for an uplink (T2 -> T1). Make sure that the link is Active. If this is not the case, select the link and choose the correct action start injections from the drop-down box.
    • Click on "Subscriptions". Display with "Show Options" and "Nodes shown" the correct DESTINATION Tier. Search in the list and select the correct link. Unsuspend the subscription if needed, by selecting this action from the drop-down box.

Keeping people informed and be informed:

  • Send daily a list to the ddt hypernews to say what you checked, what the status is, what you will check and problems encountered.
  • Biweekly there is the PADA meeting on Thursday 16h
  • Weekly there is the FacOps meeting on Friday 16h (short report if on Thursday PADA meeting, full report if no PADA meeting)

Troubleshooting (metric is not reached):

  • Go to the PhEDEx page
    • Check everything you did to exercise the link (subscription/injection/component active...)
    • If in Components links is not active:
    • Check same link in production
    • Check if scheduled downtime on Scheduled downtimes of sites
    • Send notification to site admins with savannah ticket to put PhEDEx agent up
    • If status unchanged after more than 1 month you should disable the link (always send a warning before you do this)!
    • Check under "Activity" the "Quality Plots": is the quality bad or good?
    • Check also in "Production" mode the Quality
    • Go to "Recent Errors" and check for errors (check also if errors are there in production):
    • Look carefully if the recent errors are really recent ;-)
    • Try to find out the problem and where it is located (source or destination Tier), this is the most difficult part. Every time you encounter a new problem put it below in the Questions&Answers and try to find the solution for the problem.

Questions&Answers:

  • How do I use Savannah (Link on top of this page)?
    • Login (button left)
    • Go to My Groups (button left)
    • Click on CMS Computing Infrastructure Support
    • Put your mouse on the "Support" button on top of the page and choose one of the items in the drop-down list.
    • When you submit a Savannah ticket (link on top of page) and the site administrator (you get the name from siteDB) is not in the list of people you can assign the ticket to, then you should
      • Assign the ticket to our group: cmscompinfrasup-ddt
      • Write the emailaddress of the siteadministrator in the box "Mail Notification CC" when submitting the ticket, otherwise it is not sure this guy sees the ticket.
     --> For the emailaddress you can check on siteDB (link on top of page)
    • When you already sent a savannah ticket about a problem and the problem persists and you want to warn the people again, then you should do the following:
      • Find the savannah ticket back you previously send (on the savannah website) (browse open tickets)
      • Click on it
      • Use the "post comment" button to reply to your own savannah ticket
  • How do I use GocDB (Link on top of this page)? (i.e. the website where you can see the scheduled downtimes of gridsites)
    • Click on "Downtime overview" at the left side. This gives you the list of all works going on at all sites over the world (not only LHC related!)
    • Check the Start date/time and End date/time in the table and click on the description to see the site name. In this way you should extract the relevant information. Now you can decide which links you can exercise.
    • If you already started transfers for a certain link and you are wondering why they are not going on, you have to check if the downtime is still in progress. The works in progress are shown in red and are marked with "[IN PROGRESS]".
  • Which loadtest should I choose for IIHE and UCL?
    • Depending on the T1, sometimes there exist for the Belgian T2 sites 2 loadtest injections in PhEDEx. Both are valid, but when injecting a loadtest, make sure you unsuspend the one for which you injected the loadtest, otherwise the loadtest will not work.
  • Help, the loadtest injection does not exist! How do I create a new loadtest injection?
    • Go to the PhEDEx "create injections" page
    • Login the Debug PhEDEx page
    • Go to Data -> LoadTest Injections.
    • Click on Show Options -> Create Injections.
    • Select source and destination site.
    • Then you need to create the subscription request:
      • Copy the Injection Dataset name (careful not to copy the Source Dataset).
 /PhEDEx_Debug/LoadTest07_UK_RAL/ES_IFCA
 /PhEDEx_Debug/LoadTest07_SourceSite/DestinationSite
      • Go to Requests -> Create Requests -> Transfer Requests.
      • Select DBS "LoadTest".
      • Put the Dataset name in "Data Items".
      • Select the destination in "Destinations".
      • Submit request.
      • On the next page, verify that the correct dataset was selected (remember that subscription for "/PhEDEx_Debug/LoadTest07_SourceSite/DestinationSite" should go to DestinationSite, not to SourceSite) and Confirm The transfers should start after the site admins approve the subscription.
  • Somebody is asking me to check the FTS map for a given site. How should I get that?
    • You can check this here, but maybe it is not the most recent one!
    • Otherwise ask the site admins.
  • Is there another place where I can check transfer rates /status of links / subscriptions / ...?
    • Of course there is, you can find the link to all possible PhEDEx queries here
  • Is there a place were common problems are reported on which i can check if this problem already exists?
    • There is the DDT Common Problems site. Here information is gathered about problems that one might see with solutions and a description of the problem. This list is not complete, but it could give you a hint at some point when something is not working.
  • Detail log: transfer expired in the PhEDEx download agent queue after xxx h. What is my action upon this?
    • This is not a real error, it simply means that there are already too many files in transfer, so PhEDEx will not even start to transfer new files.
    • The first thing to check in this case is transfer quality:
      • If transfer quality is green anyway, it means that the link is already transferring at the maximum rate possible with the current settings. In this case, it's possible to tweak settings at many levels to get higher rate:
        • The first thing to try is to ask if the destination site can increase the number of parallel file transfers in their DownloadAgent. If the rate per stream is low, you can usually increase the total rate simply by transferring more streams at the same time. This is usually the easiest thing to try, unless the number of parallel files needed to get a good rate becomes too high.
        • The destination T2 can tune their storage system to get higher transfer rates per stream, so they can increase the total rate without increasing the number of parallel files. Some basic instructions are here: https://twiki.cern.ch/twiki/bin/view/CMS/DDTNetworkDebugging . This can have a positive effect for transfers in general.
        • In case of FTS transfers (check, you can see this in the errors), the source T1 can also increase the number of files in the FTS channel.
      • If the transfer quality is bad, it means that the link is slowed down by more fundamental errors which are keeping the transfer slots occupied. In this case, you should check which are the fundamental errors by using the Transfer Code filter "!-1" in the "Recent Errors" page (which filters out the "transfer expired" errors). Usually you will find out that there are other errors where the transfer fails after a very long time, wasting the transfer slot.
  • Detail log: DESTINATION error during PREPARATION phase: [INTERNAL_ERROR] No SRM method factory found for the requested version [2]. What now?
    • Possibly the destination site is publishing incorrectly their endpoint, maybe they are publishing the wrong SRM version. Therefore the FTS servers will never know how to handle the transfers resulting in the error you see. Ask the destination site to fix this. If fixed, the source site should update the services.xml.
  • Detail log: Could not submit to FTS. What now?
    • Possibility 1: Transfer log:
---------- RAWOUTPUT ---------- 
delegation: glite_delegation_delegate: Failed proxy validation - it is not yet valid. 
---------- JOB-LOG ----------
1213190778 created... 
backend: PHEDEX::Transfer::FTS 
glite-transfer-submit -s https://fts.pic.es:8443/glite-data-transfer-fts/services/FileTransfer -f /home/phedex/state/Debug/incoming/download/work/job.1212414940.412/copyjob

close Submit: id=undefined
    • This means that the destination site is not making transfers in the correct way. They try to make transfers with proxy delegation. They should use password glite-transfer-submit. Tell this to the destination site, they should fix it.
    • Possibility 2: Transfer log:
---------- RAWOUTPUT ---------- Failed to determine the interface version of the service: getInterfaceVersion: SOAP fault: SOAP-ENV:Client - CGSI-gSOAP: Could NOT load client credentials 
GSS Major Status: General failure 

GSS Minor Status Error Chain: 
globus_gsi_gssapi: Error with GSI credential 
globus_gsi_gssapi: Error with gss credential handle 
globus_credential: Error with credential: The proxy credential: /scratch/phedex/phedex/gridcert/proxy.cert 
with subject: /C=BE/O=BEGRID/OU=VUB/OU=IIHE/CN=Stijn De Weirdt/CN=proxy 
expired 1305 minutes ago. 

---------- JOB-LOG ---------- 
1213860441 created... 
backend: PHEDEX::Transfer::FTS 
glite-transfer-submit -s https://lcgfts.gridpp.rl.ac.uk:8443/glite-data-transfer-fts/services/FileTransfer -m myproxy-fts.cern.ch -p _censored_ -f /scratch/phedex/phedex/Debug_T2_BE_IIHE/state/download-fts-pool2/work/job.1213712760.697/copyjob 

close Submit: id=undefined
    • As you can read, the proxy is expired, so the site-admin needs to renew his proxy.



Template:TracNotice