glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

glideinWMS Training @ UCSD

glideinWMS Frontend
Monitoring
by Igor Sfiligoi (UCSD)

UCSD Jan 18th 2012 Frontend Monitoring 1

Overview

● Refresher
● What is available
● What to look for


Refresher – glideinWMS
● A glidein is just a properly configured Condor
execution node submitted as a Grid job
● Frontend drives submission Configure Condor G.N.

Submit node
Frontend node Worker node
Monitor Submit node
Frontend Condor glidein
Central manager
Startd
Match
Globus Job
Request
glideins Factory node

Condor glidein
Execution node
CREAM
Factory glidein
Execution node
Submit
glideins

Reminder

Condor is king!
(glideinWMS just a small layer on top)


Refresher – Frontend arch
● Many Groups
● With a “Master” Frontend as an aggregator
Submit node
Factory Submit node Factory

Central manager

Frontend node

Group
Entry ... Group glidein

Spawn Web
Server
Frontend


Available monitoring
● Condor monitoring
Even if a dynamic one
● It is just a condor pool!
● Any Condor monitoring tools will work
● VO Frontend monitoring
● The VO Frontend provides some basic
Condor monitoring
● Plus the monitoring of it own internal workings
● Glidein Factory monitoring
You should not need to use it
but it is publicly accessible


Condor monitoring


Condor Monitoring
● Out of the box you get
● Command line tools
● Log parsing

● Several external tools available, e.g.
● CondorView
Condor external package
● CycleServer

Your portal may Commercial tool, (semi-)free for Academia
provide additional
monitoring, too


Glidein monitoring
● The glideins will register with the Collector
● Condor command to monitor them Same syntax as
condor_status Requirements
● -constraint - To select a subset of them
● -total - For a quick summary
● Output formatting options
● No arguments - In use/unused
● -long - Full ClassAds
● -format - Select attributes only
● -xml - xml formatting Easier to
http://www.cs.wisc.edu/condor/manual/v7.6/condor_status.html machine parse


Example

$ condor_status
$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
glidein_17848@alic LINUX X86_64 Claimed Busy 7.440 18037 0+01:06:06
glidein_17825@wn89 LINUX X86_64 Unclaimed Idle 11.990 16056 0+00:15:12
glidein_17825@wn89 LINUX X86_64 Unclaimed Idle 11.990 16056 0+00:15:12
glidein_10082@wn91 LINUX X86_64 Claimed Idle 7.000 16056 0+00:02:46
glidein_10082@wn91 LINUX X86_64 Claimed Idle 7.000 16056 0+00:02:46
…
…
glidein_3964@wp-05 LINUX X86_64 Claimed Busy 24.000 64464 0+16:00:29
glidein_5861@wp-05 LINUX X86_64 Claimed Retiring 22.140 64464 0+00:23:18
glidein_5861@wp-05 LINUX X86_64 Claimed Retiring 22.140 64464 0+00:23:18
Total Owner Claimed Unclaimed Matched Preempting Backfill
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 23249 0 22697 552 0 0 0
X86_64/LINUX 23249 0 22697 552 0 0 0
Total 23249 0 22697 552 0 0 0
Total 23249 0 22697 552 0 0 0


Another example
$ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime
$ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime
-const "GLIDEIN_Max_Walltime>83000"
glidein_10001@we017.grid.hep.ph.ic.ac.uk 86040
glidein_10006@rossmann-a292.rcac.purdue.edu 114840
...
...
glidein_9990@lxbra6310.cern.ch 114840
glidein_9990@lxbra6310.cern.ch 114840
glidein_9993@grid191.lal.in2p3.fr 114840
glidein_9993@grid191.lal.in2p3.fr 114840
$ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime -xml
$ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime -xml
<?xml version="1.0"?>
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
<classads>
<c>
<c>
<a n="MyType"><s>Machine</s></a>
<a n="MyType"><s>Machine</s></a>
<a n="TargetType"><s>Job</s></a>
<a n="TargetType"><s>Job</s></a>
<a n="Name"><s>glidein_10001@we017.grid.hep.ph.ic.ac.uk</s></a>
<a n="Name"><s>glidein_10001@we017.grid.hep.ph.ic.ac.uk</s></a>
<a n="GLIDEIN_Max_Walltime"><i>86040</i></a>
<a n="GLIDEIN_Max_Walltime"><i>86040</i></a>
<a n="CurrentTime"><e>time()</e></a>
<a n="CurrentTime"><e>time()</e></a>
</c>
</c>
...
...


Collector log(s)
Place to look when
things seem fishy!
● The Collector(s) will log any errors
● The interesting errors will likely be in the leaves of
the Collector tree
~condor/glidecondor/condor_local/log/CondorXXXLog
● Logs rotate, so be sure to look in .old as well
Yes, you will
● You also get the glidein have 100s
authentication logs of them!

● And log verbosity can be further increased with
COLLECTOR_DEBUG
http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:SubsysDebug


Example

01/13/12 17:24:13 ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN=
01/13/12 17:24:13 ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN=
uscmspilot47/glidein-1.t2.ucsd.edu'
uscmspilot47/glidein-1.t2.ucsd.edu'
01/13/12 17:24:13 ZKM: 2: mapret: 00 included_voms: 0 canonical_user:glidein47
01/13/12 17:24:13 ZKM: 2: mapret: included_voms: 0 canonical_user: glidein47
01/13/12 17:24:13 ZKM: successful mapping to glidein47
01/13/12 17:24:13 ZKM: successful mapping to glidein47
...
...
01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno ==104 Connection reset
01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno 104 Connection reset
by peer, reading 44 bytesfrom <130.104.133.245:7812>.
by peer, reading bytes from <130.104.133.245:7812>.
01/13/12 17:24:19 DaemonCore: Can't receive command request from 130.104.133.245
01/13/12 17:24:19 DaemonCore: Can't receive command request from 130.104.133.245
(perhaps aa timeout?)
(perhaps timeout?)
...
...
01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=1
01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=1
0716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is
0716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is
currently registered with that id (perhaps ititrecently disconnected).
currently registered with that id (perhaps recently disconnected).


Job monitoring
● You can monitor local jobs
● For jobs still in the queue (still waiting or running)
condor_q
● For finished jobs Limited number of jobs
preserved
condor_history
● Similar cmdline args as condor_status
● Remote condor_q possible with
-name

http://www.cs.wisc.edu/condor/manual/v7.6/condor_q.html
http://www.cs.wisc.edu/condor/manual/v7.6/condor_history.html


Example

$ condor_q
$ condor_q

-- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node
-- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
367788.0 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 1 1
367788.0 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 1 1
367788.1 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 2 1
367788.1 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 2 1
383995.19 uscms1789 1/11 02:26 2+13:35:38 R 0 1953.1 CMSSW.sh 1118 4
383995.19 uscms1789 1/11 02:26 2+13:35:38 R 0 1953.1 CMSSW.sh 1118 4
383995.179 uscms1789 1/11 02:26 2+11:29:06 R 0 1464.8 CMSSW.sh 1310 4
383995.179 uscms1789 1/11 02:26 2+11:29:06 R 0 1464.8 CMSSW.sh 1310 4
383999.32 uscms1789 1/11 02:31 2+09:36:12 R 0 1953.1 CMSSW.sh 299 4
383999.32 uscms1789 1/11 02:31 2+09:36:12 R 0 1953.1 CMSSW.sh 299 4
383999.46 uscms1789 1/11 02:31 2+11:00:25 R 0 1953.1 CMSSW.sh 316 4
383999.46 uscms1789 1/11 02:31 2+11:00:25 R 0 1953.1 CMSSW.sh 316 4
…
…
385002.7 uscms3015 1/13 17:31 0+00:01:51 R 0 0.0 CMSSW.sh 70 2
385002.7 uscms3015 1/13 17:31 0+00:01:51 R 0 0.0 CMSSW.sh 70 2
385002.8 uscms3015 1/13 17:31 0+00:01:49 R 0 0.0 CMSSW.sh 89 2
385002.8 uscms3015 1/13 17:31 0+00:01:49 R 0 0.0 CMSSW.sh 89 2
385002.9 uscms3015 1/13 17:31 0+00:01:29 R 0 0.0 CMSSW.sh 91 2
385002.9 uscms3015 1/13 17:31 0+00:01:29 R 0 0.0 CMSSW.sh 91 2
385002.10 uscms3015 1/13 17:31 0+00:01:00 R 0 0.0 CMSSW.sh 97 2
385002.10 uscms3015 1/13 17:31 0+00:01:00 R 0 0.0 CMSSW.sh 97 2
58707 jobs; 39484 idle, 11694 running, 7529 held
58707 jobs; 39484 idle, 11694 running, 7529 held


Job logs
● Users are encouraged to have a log for jobs
● Provides easy way to monitor the progress without
calling condor_q/condor_history
000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569>
000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569>
...
...
001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422>
001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422>
...
...
005 (001.000.000) 12/16 13:30:32 Job terminated.
005 (001.000.000) 12/16 13:30:32 Job terminated.
Literally ...

(1) Normal termination (return value 0)
(1) Normal termination (return value 0)
Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage
Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage
Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
217 - Run Bytes Sent By Job
217 - Run Bytes Sent By Job
76 - Run Bytes Received By Job
76 - Run Bytes Received By Job
217 - Total Bytes Sent By Job
217 - Total Bytes Sent By Job
76 - Total Bytes Received By Job
76 - Total Bytes Received By Job
...
...


Condor Daemon logs
● By default
● Schedd writes a log
/opt/glidecondor/condor_local/log/ScheddLog
● Shadows share a common log
/opt/glidecondor/condor_local/log/ShadowLog
● The logs rotate, look for .old files as well
● Lots of interesting info in them
● Quite high verbosity by default


ScheddLog Example
01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng
01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng
01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng
01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng
01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.4
01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.4
01/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/
01/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/
CN=rokpilot01/osg.ctbp.ucsd.edu'
CN=rokpilot01/osg.ctbp.ucsd.edu'
01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot
01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot
01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot
01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot
01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100
01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100
...
...
01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824>
01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824>
01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE
01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE
01/13/12 18:05:02 (pid:32035) Negotiating for owner: cfeng@osg.ctbp.ucsd.edu
01/13/12 18:05:02 (pid:32035) Negotiating for owner: cfeng@osg.ctbp.ucsd.edu
01/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected
01/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected
01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_14642@
cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng
01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138)
01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138)
01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on glidein_14642@
01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on glidein_14642@
cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng,
cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng,
(shadow pid = 5238)
(shadow pid = 5238)


ShadowLog Example

01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP>
01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP>
01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586>
01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586>
01/12/12 21:52:36 Setting maximum accepts per cycle 4.
01/12/12 21:52:36 Setting maximum accepts per cycle 4.
01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.108
01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.108
01/12/12 21:52:36 (157171.97) (32318): Request to run on
01/12/12 21:52:36 (157171.97) (32318): Request to run on
glidein_23261@cabinet-2-2-26.t2.ucsd.edu <169.228.131.154:48495?
glidein_23261@cabinet-2-2-26.t2.ucsd.edu <169.228.131.154:48495?
CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED
CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED
01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated:
01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated:
exited with status 0
exited with status 0
01/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW)
pid 10937 EXITING WITH STATUS 100
…
…
01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28)
01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28)
SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>;
SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>;
STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/
STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/
dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz:
dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz:
(errno 28) No space left on device
(errno 28) No space left on device


Submitter ClassAds
● The schedd will advertise two types of
ClassAds to the Collector
● Schedd daemon ClassAds
condor_status -schedd
● Per-user ClassAds
condor_status -submitter
● Can be useful for getting a summary view
of the system


Example

$ condor_status -schedd
$ condor_status -schedd
Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs
Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs
cmsfnal01.fnal.gov cmsfnal01. 0 0 0
cmsfnal01.fnal.gov cmsfnal01. 0 0 0
glidein-2.t2.ucsd.ed glidein-2. 10932 38480 7607
glidein-2.t2.ucsd.ed glidein-2. 10932 38480 7607
submit-2.t2.ucsd.edu submit-2.t 11103 8955 1667
submit-2.t2.ucsd.edu submit-2.t 11103 8955 1667
vocms120.cern.ch vocms120.c 0 4024 2
vocms120.cern.ch vocms120.c 0 4024 2
TotalRunningJobs TotalIdleJobs TotalHeldJobs
TotalRunningJobs TotalIdleJobs TotalHeldJobs

Total 22035 51459 9276
Total 22035 51459 9276
$ condor_status -schedd -l submit-2.t2.ucsd.edu
$ condor_status -schedd -l submit-2.t2.ucsd.edu
Name = "submit-2.t2.ucsd.edu"
Name = "submit-2.t2.ucsd.edu"
MaxJobsRunning = 20000
MaxJobsRunning = 20000
TotalHeldJobs = 1667
TotalHeldJobs = 1667
TotalIdleJobs = 9347
TotalIdleJobs = 9347
…
…
TotalJobAds = 22096
TotalJobAds = 22096
TransferQueueDownloadWaitTime = 0
TransferQueueDownloadWaitTime = 0
MyType = "Scheduler"
MyType = "Scheduler"


Example
$ condor_status -submitter
$ condor_status -submitter
Name Machine Running IdleJobs HeldJobs
Name Machine Running IdleJobs HeldJobs
uscms1789@glidein-2. glidein-2. 344 0 20
…
…
uscms742@submit-2.t2 submit-2.t 405 0 0
uscms742@submit-2.t2 submit-2.t 405 0 0
cms1279@vocms120.cer vocms120.c 0 4000 0
cms1279@vocms120.cer vocms120.c 0 4000 0
RunningJobs IdleJobs HeldJobs
RunningJobs IdleJobs HeldJobs
uscms019@submit-2.t2 11 0 1
uscms1537@glidein-2. 0 0 1
uscms1537@glidein-2. 0 0 1
uscms1811@glidein-2. 176 1141 0
uscms1811@glidein-2. 176 1141 0
uscms1811@submit-2.t 177 3324 0
uscms1811@submit-2.t 177 3324 0
…
…
uscms742@glidein-2.t 3107 289 41
Total 22092 51518 9280
Total 22092 51518 9280


Negotiator Monitoring
● To check for user priorities, use
condor_userprio
● -alluser - Without, only running users
● -all - Provides detailed info
● Negotiator Log useful to troubleshoot
~/glidecondor/condor_local/log/NegotiatorLog
● Look for errors and to monitor cycle times
● Negotiator also advertises a ClassAd
● Use condor_status -negotiator -long


Example 1/2
$ condor_userprio -all -allusers
Last Priority Update: 1/13 18:33
Effective Real Priority Res ...
Effective Real Priority Res ...
User Name Priority Priority Factor Used ...
User Name Priority Priority Factor Used ...
------------------------------ --------- -------- ------------ ---- ...
------------------------------ --------- -------- ------------ ---- ...
cmspa0029@submit-2.t2.ucsd.edu 158.01 15.80 10.00 0 ...
cmspa0029@submit-2.t2.ucsd.edu 158.01 15.80 10.00 0 ...
cmspa0029@glidein-2.t2.ucsd.ed 205.37 20.54 10.00 0 ...
cmspa0029@glidein-2.t2.ucsd.ed 205.37 20.54 10.00 0 ...
uscms506@glidein-2.t2.ucsd.edu 559.11 0.56 1000.00 0 ...
uscms506@glidein-2.t2.ucsd.edu 559.11 0.56 1000.00 0 ...
uscms2450@glidein-2.t2.ucsd.ed 576.15 0.58 1000.00 0 ...
uscms3501@submit-2.t2.ucsd.edu 775.26 0.78 1000.00 0 ...
shi034@glidein-2.t2.ucsd.edu 827.95 0.83 1000.00 0 ...
shi034@glidein-2.t2.ucsd.edu 827.95 0.83 1000.00 0 ...


Example 2/2
… Total Usage Usage Last
… Total Usage Usage Last
User Name … (wghted-hrs) Start Time Usage Time
User Name … (wghted-hrs) Start Time Usage Time
------------------------------ … ----------- ---------------- ----------------
------------------------------ … ----------- ---------------- ----------------
cmspa0029@submit-2.t2.ucsd.edu … 82863.87 10/03/2011 01:41 1/11/2012 07:05
cmspa0029@submit-2.t2.ucsd.edu … 82863.87 10/03/2011 01:41 1/11/2012 07:05
cmspa0029@glidein-2.t2.ucsd.ed … 202430.74 10/31/2011 01:30 1/12/2012 02:00
cmspa0029@glidein-2.t2.ucsd.ed … 202430.74 10/31/2011 01:30 1/12/2012 02:00
uscms506@glidein-2.t2.ucsd.edu … 437667.09 7/02/2011 08:06 1/08/2012 07:29
uscms506@glidein-2.t2.ucsd.edu … 437667.09 7/02/2011 08:06 1/08/2012 07:29
uscms2450@glidein-2.t2.ucsd.ed … 47024.87 10/09/2011 13:26 1/07/2012 01:28
uscms3501@submit-2.t2.ucsd.edu … 3677.14 11/23/2011 08:12 1/10/2012 01:02
shi034@glidein-2.t2.ucsd.edu … 1309024.85 6/03/2009 00:48 1/07/2012 15:57
shi034@glidein-2.t2.ucsd.edu … 1309024.85 6/03/2009 00:48 1/07/2012 15:57


NegotiatorLog Example
01/13/12 18:23:05 ---------- Finished Negotiation Cycle ----------
01/13/12 18:23:05 ---------- Finished Negotiation Cycle ----------
01/13/12 18:24:09 ---------- Started Negotiation Cycle ----------
01/13/12 18:24:09 ---------- Started Negotiation Cycle ----------
01/13/12 18:24:09 Phase 1: Obtaining ads from collector ...
01/13/12 18:24:09 Phase 1: Obtaining ads from collector ...
01/13/12 18:24:09 Getting all public ads ...
01/13/12 18:24:09 Getting all public ads ...
01/13/12 18:24:44 Sorting 23021 ads ...
01/13/12 18:24:44 Sorting 23021 ads ...
01/13/12 18:24:46 Getting startd private ads ...
01/13/12 18:24:46 Getting startd private ads ...
01/13/12 18:24:51 Got ads: 23021 public and 22571 private
01/13/12 18:24:51 Got ads: 23021 public and 22571 private
01/13/12 18:24:51 Public ads include 38 submitter, 22568 startd
01/13/12 18:24:51 Public ads include 38 submitter, 22568 startd
01/13/12 18:24:51 Phase 2: Performing accounting ...
01/13/12 18:24:51 Phase 2: Performing accounting ...
01/13/12 18:25:01 Phase 3: Sorting submitter ads by priority ...
01/13/12 18:25:01 Phase 3: Sorting submitter ads by priority ...
01/13/12 18:25:01 Phase 4.1: Negotiating with schedds ...
01/13/12 18:25:01 Phase 4.1: Negotiating with schedds ...
01/13/12 18:25:01 Negotiating with sfiligoi@submit-2.t2.ucsd.edu at
01/13/12 18:25:01 Negotiating with sfiligoi@submit-2.t2.ucsd.edu at
<169.228.130.26:9615?sock=10263_1229_2>
<169.228.130.26:9615?sock=10263_1229_2>
01/13/12 18:25:01 0 seconds so far
01/13/12 18:25:01 0 seconds so far
01/13/12 18:25:02 Request 345869.00000:
01/13/12 18:25:02 Request 345869.00000:
01/13/12 18:25:02 Rejected 345869.0 sfiligoi@submit-2.t2.ucsd.edu
01/13/12 18:25:02 Rejected 345869.0 sfiligoi@submit-2.t2.ucsd.edu
<169.228.130.26:9615?sock=10263_1229_2>: no match found
<169.228.130.26:9615?sock=10263_1229_2>: no match found
01/13/12 18:25:02 Got NO_MORE_JOBS; done negotiating
01/13/12 18:25:02 Got NO_MORE_JOBS; done negotiating
…
…
01/13/12 18:25:06 Request 384970.00170:
01/13/12 18:25:06 Request 384970.00170:
01/13/12 18:25:06 Matched 384970.170 uscms2535@glidein-2.t2.ucsd.edu
01/13/12 18:25:06 Matched 384970.170 uscms2535@glidein-2.t2.ucsd.edu
<169.228.130.11:9615?sock=9763_cd4c_2> preempting none <192.168.3.77:55906?
<169.228.130.11:9615?sock=9763_cd4c_2> preempting none <192.168.3.77:55906?
CCBID=169.228.130.23:9823#13833&noUDP> glidein_15335@lnxfarm177.colorado.edu
CCBID=169.228.130.23:9823#13833&noUDP> glidein_15335@lnxfarm177.colorado.edu
01/13/12 18:25:06 Successfully matched with
01/13/12 18:25:06 Successfully matched with
glidein_15335@lnxfarm177.colorado.edu
glidein_15335@lnxfarm177.colorado.edu

CycleServer Screenshots
● Can do more than just monitoring
● But the rest beyond the scope of this talk


Frontend Monitoring


Frontend monitoring
Frontend node

Entry
Group ... Group
● Helper cmdline tool Spawn

● Plus, each Group provides: Frontend

● Activity/Error logs
● RRD files with statistics (running, held, etc.)
● XML files with current snapshot
● Resource ClassAds
● Master frontend aggregates RRD and XML
files, and writes them in its own area
● Human readable/viewable Web pages available


Helper cmdline tool
● Wrapper around condor condor_status
glideinWMS/tools/glidein_status.py
● Provides useful formatting
~/glideinWMS/tools$ ./glidein_status.py
~/glideinWMS/tools$ ./glidein_status.py

Name Site Factory Entry State Activit
Name Site Factory Entry State Activi

glidein_6682@alicegrid26.ba.infn.it Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy
glidein_6682@alicegrid26.ba.infn.it
glidein_10678@alicegrid32.ba.infn.it Bari
Bari v1_0@OSGGOC
v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed
CMS_T2_IT_Bari_ce01 Claimed Busy
Busy
… glidein_10678@alicegrid32.ba.infn.it Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy
…
glidein_5861@wp-05-12.pn.pd.infn.it Legnaro v1_0@OSGGOC CMS_T2_IT_Legnaro_. Claimed Retirin
glidein_5861@wp-05-12.pn.pd.infn.it Legnaro v1_0@OSGGOC CMS_T2_IT_Legnaro_. Claimed Retiri

Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed M
Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed

CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 11 0 11 0 0 0
CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 522
CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 11 00 11
517 00 00 50
CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 1201
CMS_T2_US_Purdue_osg@v1_0@OSGGOC 522 00 517
1182 14 0 00 55
… CMS_T2_US_Purdue_osg@v1_0@OSGGOC 1201 0 1182 14 0 5
…
CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD 135 0 132 0 0 3
CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD 135 0 132 0 0 3
Total 21474 0 19742 1264 0 468
Total 21474 0 19742 1264 0 468


Log files
● Each Frontend group provides 3 types of logs
log/group_XXX/frontend.date.type.log
● info - Progress and warnings
● err - One line warnings
● debug - Multi line error messages
● The master frontend has similar logs
log/frontend/frontend.date.type.log
● But rarely anything interesting there


Example Info Log
:01-07:00 15037] Iteration at Tue Nov 15 10:44:01 2011
:01-07:00 15037] Query condor
:01-07:00 15037] Child processes created
:05-07:00 31633] WARNING: Failed to talk to schedd submit-1.t2.ucsd.edu. See debug log for more details.
:05-07:00 15037] All children terminated
:05-07:00 15037] Jobs found total 4836 idle 1732 (old 1732, voms 1703) running 3104
:05-07:00 15037] Glideins found total 639 idle 8 running 630 limit 800 curb 600
:05-07:00 15037] Using 1 proxies
:05-07:00 15037] Match
:05-07:00 15037] Counting
:05-07:00 15037] Child processes created
:06-07:00 15037] All children terminated
:06-07:00 15037] Total matching idle 1732 (old 1703) running 3104
:06-07:00 15037] Jobs in schedd queues | Glideins | Request
:06-07:00 15037] Idle (match eff old uniq ) Run ( here max ) | Total Idle Run | Idle MaxRun Down Factory
:06-07:00 15037] 171( 1705 170 169 0) 3104( 102 250) | 105 1 103 | 10 3276 Up CMS_T2_US_Nebraska_Red@Produ
:06-07:00 15037] 171( 1705 167 169 0) 3104( 187 250) | 197 4 193 | 10 3276 Up CMS_T2_US_Nebraska_Red_gw1@P
:06-07:00 15037] 171( 1705 171 169 0) 3104( 0 250) | 0 0 0 | 10 3276 Down CMS_T2_US_Nebraska_Red_gw2@P
:06-07:00 15037] 171( 1705 171 169 0) 3104( 62 250) | 62 0 62 | 10 3276 Up CMS_T2_US_Wisconsin_cms01@Pr
:06-07:00 15037] 171( 1705 171 169 0) 3104( 71 250) | 71 0 71 | 10 3276 Up CMS_T2_US_Wisconsin_cms02@Pr
:06-07:00 15037] 171( 1705 169 169 0) 3104( 88 250) | 96 2 94 | 10 3276 Up CMS_T2_US_Nebraska_Red@v1_0@
:06-07:00 15037] 171( 1705 171 169 0) 3104( 1 250) | 1 0 1 | 10 3276 Up CMS_T2_US_Nebraska_Red_gw1@v
:06-07:00 15037] 171( 1705 171 169 0) 3104( 0 250) | 0 0 0 | 10 3276 Down CMS_T2_US_Nebraska_Red_gw2@v
:06-07:00 15037] 171( 1705 171 169 0) 3104( 45 250) | 45 0 45 | 10 3276 Up CMS_T2_US_Wisconsin_cms01@v1
:06-07:00 15037] 171( 1705 170 169 0) 3104( 60 250) | 62 1 61 | 10 3276 Up CMS_T2_US_Wisconsin_cms02@v1
:06-07:00 15037] Jobs in schedd queues | Glideins | Request
:06-07:00 15037] Idle (match eff old uniq ) Run ( here max ) | Total Idle Run | Idle MaxRun Down Factory
:06-07:00 15037] 1368(13640 1360 1352 0) 24832( 616 2000) | 639 8 630 | 80 26208 Up Sum of useful factories
:06-07:00 15037] 342( 3410 342 338 0) 6208( 0 500) | 0 0 0 | 20 6552 Down Sum of down factories
:06-07:00 15037] 27( 27 27 14 27) 0( 0 0) | 0 0 0 | 0 0 Down Unmatched
:06-07:00 15037] Advertizing 10 requests
:07-07:00 15037] Done advertizing
:07-07:00 15037] Advertising 10 glideresource classads to the user pool
:07-07:00 15037] Done advertising glideresource classads
:07-07:00 15037] Writing stats
:07-07:00 15037] Sleep


Example log files

frontend.20120113.err.log
[2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details.
[2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details.
[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details.
[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details.

frontend.20120113.debug.log
[2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running '/home/frontend/glidecondor/sbin/condor_advertise
[2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running '/home/frontend/glidecondor/sbin/condor_advertise
-pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2'
-pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2'
code 1:failed to send classad to <169.228.130.10:9618>
code 1:failed to send classad to <169.228.130.10:9618>
failed to send classad to <169.228.130.10:9618>
[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd 'vocms120.cern.ch' not found
[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd 'vocms120.cern.ch' not found


Web pages 1/3
frontendStatus.html

Historical overview

Fully dynamic,
allows for zooming
and selecting of
elements to plot

Default shows everything,
but can restrict to a group
and/or a Factory


Web pages 2/3

frontendGroupGraphStatusNow.html

Current snapshot in tabular form

Useful for spotting problems


Web pages 3/3
frontendGroupGraphStatusNow.html

Contains also pie-charts with the same info


RRDs and XML files
● The Web pages are just rendering of the RRDs
and XML pages
● Raw data loaded in the browser and rendered
● No server side code
● Other tools could use those data
● Publicly available, if one knows the URL
● No user-identifying data, only summary stats


Resource ClassAds
● The Frontend Groups advertise one ClassAd
for each Factory it is requesting glideins from
● Type glideresource
● They contain pretty much everything the
Frontend Group knows about the Factory:
● Factory attributes used for matchmaking
● Stats about the matching jobs
● What is being requested
● Even what the Factory is doing!


Example query
● Not a Condor native type, must use
● -any
● Then constrain the type
$ condor_status -any -const 'MyType=="glideresource"' -format '%sn' Name
$ condor_status -any -const 'MyType=="glideresource"' -format '%sn' Name
CMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main
...
...
CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.main

Remotely queryable


Example ClassAd
$ condor_status -any
$ condor_status -any
CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -l
CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -l
MyType = "glideresource"
MyType = "glideresource" Identification
Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main"
Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main"
GlideClientName = "UCSD-v5_3.main"
GlideClientName = "UCSD-v5_3.main"
...
...
GlideClientMonitorJobsIdle = 210.000000
GlideClientMonitorJobsIdle = 210.000000
GlideClientMonitorJobsRunningHere = 213 Info about local jobs
GlideClientMonitorJobsRunningHere = 213
...
...
GlideClientMonitorGlideinsRequestIdle = 50
GlideClientMonitorGlideinsRequestIdle = 50
GlideClientMonitorGlideinsRequestMaxRun = 445
What is being requested
GlideClientMonitorGlideinsRequestMaxRun = 445
...
...
GLIDEIN_Site = "UCSD"
GLIDEIN_Site = "UCSD"
GLEXEC_BIN = "OSG" Factory attributes
GLEXEC_BIN = "OSG"
...
...
GlideClientMonitorGlideinsRunning = 215
GlideClientMonitorGlideinsRunning = 215
GlideClientMonitorGlideinsTotal = 216 Info about registered glideins
GlideClientMonitorGlideinsTotal = 216
...
...
GlideFactoryMonitorStatusRunning = 339
GlideFactoryMonitorStatusRunning = 339
GlideFactoryMonitorStatusPending = 277 Factory status
GlideFactoryMonitorStatusPending = 277
GlideFactoryMonitorStatusHeld = 0
GlideFactoryMonitorStatusHeld = 0
...
...

Currently more information than you get on the Web

OK, now you know
what's available.

What will you do
with all that information?
(i.e. What to look for)


Monitoring the health of the system
● Six major areas to look after; your goal is
● Few unclaimed glideins
(both globally, and per site)
● No unmatched jobs
● Reasonably low restart rate
(both global, and per site)
● Reasonably low job failure rate
(both global, and per site)
● Negotiation cycle reasonably short
● Schedd node not overloaded


Unclaimed glideins
● Frontend and Negotiator policies are
not identical
● You may end up with glideins that
never run any jobs
● The discrepancy can be big enough to be
noticed on a global scale
● But more often it is just for one (or few) sites
● Short spikes are not a problem
● But long periods are


How do you notice it?
● Historical Web monitoring

Bad
Good

● Ask for daily emails from the Factory
● Or write your own scripts No Frontend report generators
in glideinWMS at this time
Parse the RRDs

How do you find the root cause?
● Analyze the latest snapshots
● condor_status/glidein_status
● condor_q
● Frontend Web
● Limit the research to few sites, if possible
● Then start comparing
● Job Requirements, with
Can be daunting!
● Glidein Start expressions
In theory, there is “condor_q -ana”, but it is usually worthless


Unmatched jobs
● The other side of the problem
● Glideins never asked for some jobs Jobs will never start!

● Two possible reasons
● Wrong Frontend matchmaking policy
● No available Factory entries to serve the job


How do you notice it?
● “Unmatched Factory” in Web monitoring


How do you find the root cause?
● Again, start with the latest snaphot
● condor_q
● condor_status -any -const 'MyType=="glideresource"'
● Get the (python) Match expression from XML
● Start comparing!
Can be daunting!


Restarted jobs
● Any restart == wasted CPU
● How do you notice it?
● condor_q is your friend here
condor_q -format '%in' NumJobStarts
No historical/Web monitoring provided
● Why it happens?
● Glidein disappears!
● End of lifetime hit
Not in the default config,
● Preemption policies but you may set Condor to do it
● Submit node overload
Condor daemons do not like being resource constrained!

Why glideins disappear?
● Three main reasons Rare
● Remote node just died Some sites do this; nothing you can do.
Learn who they are and act accordingly.
● Site preemption policy
● Glidein killed by Site because it exceeded slot limits
– Most likely Memory One of 2 limits the OSG factory advertises.
GLIDEIN_MaxMemMBs
● Why can limits be exceeded?
Job told you it needed
● Job underestimated resource use more resources than
the limit!
● Frontend matchmaking logic problem
● Wrong advertised limits
Factory problem!


Wallclock limits
● Main resource limit is time
● The glidein automatically deals with it
– Will go away before the deadline
– … killing/preemptiong any jobs if needed!
● Limit advertised as In seconds
– Factory: GLIDEIN_Max_Walltime (-Δ)
– Glidein: GLIDEIN_ToDie UNIX time
● Why jobs may reach the deadline?
● Like with all other resources
– Job underestimates time it needs
– Frontend matchmaking logic problems


Job failures
● Jobs can fail for many reasons
● You should monitor the ExitCode
condor_history -back -const 'JobStatus==5' -format '%in' ExitCode
● Knowing what users run often needed to interpret
errors
● For common WN errors, Frontend admin
should create appropriate validation script
● So glideins fail, not user jobs


Negotiation time
● The negotiation time should be << 5mins
● If much longer,
glideins may terminate without running any jobs
● Monitor the NegotiatorLog on CM
● Possible causes
● CPU starvations (e.g. other processes)
● Autocluster explosion
– Condor tries to be smart about Matchmaking
– But if users don't cooperate, cannot do much


Autoclustering
Much faster
● Condor Schedd will try to group jobs if only few
groups exist
● All “similar jobs” will be matched together!
● What “similar” means?
● Similar == Would result in the same match
● How it is implemented?
● Tuple of attributes considered during matchmaking
● E.g. (DESIRED_Sites,ImageSize)
● How can the number of autoclusters explode?
● If an attribute that changes a lot is added
Example of really bad one: JobID
https://condor-wiki.cs.wisc.edu/index.cgi/attach_get/220/cs739.pdf


Submit node health
● Condor is very sensitive to resource starvation
● If submit node overloaded, expect problems!
● How can we get to resource starvation?
Trying to run 3k jobs on a 1G RAM node???
● Poor planning
● Other processes May steal CPU/RAM/IO from Condor

● Interactive activity particularly risky
● Due to its unpredictable nature
– Including user errors
● But portals not immune to resource overuse


Summary


Summary
● You have plenty of Monitoring options
● Some prettier, some more powerful
● Most of the time, things just work
● So you don't need to constantly watch after your
installation
● But occasionally things will break
Or the users will tell you!
● It is in your interest noticing it
● Having good monitoring tools will help you there!


The End


Pointers
● The official glideinWMS project Web page is
http://tinyurl.com/glideinWMS
● glideinWMS development team is reachable at
glideinwms-support@fnal.gov
● The OSG glidein factory is reachable at
osg-gfactory-support@physics.ucsd.edu


Acknowledgments
● The glideinWMS is a CMS-led project
developed mostly at FNAL, with contributions
from UCSD and ISI
● The glideinWMS factory operations at UCSD is
sponsored by OSG
● The funding comes from NSF, DOE and the
UC system


glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (6)

Similaire à glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

Similaire à glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012 (20)

Plus de Igor Sfiligoi

Plus de Igor Sfiligoi (20)

Dernier

Dernier (20)

glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012