SlideShare une entreprise Scribd logo
1  sur  60
Télécharger pour lire hors ligne
glideinWMS Training @ UCSD



                glideinWMS Frontend
                      Monitoring
                     by Igor Sfiligoi (UCSD)




UCSD Jan 18th 2012        Frontend Monitoring   1
Overview


                     ●   Refresher
                     ●   What is available
                     ●   What to look for




UCSD Jan 18th 2012                   Frontend Monitoring   2
Refresher – glideinWMS
 ●   A glidein is just a properly configured Condor
     execution node submitted as a Grid job
      ●   Frontend drives submission                                 Configure Condor G.N.

                                   Submit node
  Frontend node                                                      Worker node
                       Monitor     Submit node
     Frontend          Condor                                           glidein
                                 Central manager
                                                                        Startd
                     Match
                                                          Globus              Job
             Request
             glideins            Factory node

                                   Condor                             glidein
                                                                   Execution node
                                                        CREAM
                                   Factory                           glidein
                                                                   Execution node
                                                  Submit
                                                  glideins
UCSD Jan 18th 2012                       Frontend Monitoring                                 3
Reminder




              Condor is king!
             (glideinWMS just a small layer on top)




UCSD Jan 18th 2012         Frontend Monitoring        4
Refresher – Frontend arch
 ●   Many Groups
      ●   With a “Master” Frontend as an aggregator
                       Submit node
     Factory                              Submit node        Factory

                               Central manager

                                 Frontend node

                     Group
                     Entry       ...            Group                  glidein


                               Spawn                          Web
                                                             Server
                              Frontend



UCSD Jan 18th 2012                     Frontend Monitoring                   5
Available monitoring
 ●   Condor monitoring
                                                     Even if a dynamic one
      ●   It is just a condor pool!
      ●   Any Condor monitoring tools will work
 ●   VO Frontend monitoring
      ●   The VO Frontend provides some basic
          Condor monitoring
      ●   Plus the monitoring of it own internal workings
 ●   Glidein Factory monitoring
                                           You should not need to use it
                                           but it is publicly accessible

UCSD Jan 18th 2012            Frontend Monitoring                            6
Condor monitoring




UCSD Jan 18th 2012         Frontend Monitoring   7
Condor Monitoring
 ●   Out of the box you get
      ●   Command line tools
      ●   Log parsing

 ●   Several external tools available, e.g.
      ●   CondorView
                                          Condor external package
      ●   CycleServer

       Your portal may     Commercial tool, (semi-)free for Academia
      provide additional
       monitoring, too


UCSD Jan 18th 2012            Frontend Monitoring                      8
Glidein monitoring
 ●   The glideins will register with the Collector
 ●   Condor command to monitor them                                                       Same syntax as
     condor_status                                                                        Requirements
      ● -constraint - To select a subset of them
      ● -total      - For a quick summary
 ●   Output formatting options
      ●   No arguments - In use/unused
      ●   -long        - Full ClassAds
      ●   -format      - Select attributes only
      ●   -xml         - xml formatting                                               Easier to
                     http://www.cs.wisc.edu/condor/manual/v7.6/condor_status.html   machine parse

UCSD Jan 18th 2012                          Frontend Monitoring                                     9
Example

$ condor_status
 $ condor_status
Name                     OpSys      Arch      State             Activity LoadAv Mem        ActvtyTime
 Name                     OpSys      Arch      State             Activity LoadAv Mem        ActvtyTime
glidein_17848@alic LINUX      X86_64 Claimed   Busy     7.440 18037 0+01:06:06
 glidein_17848@alic LINUX      X86_64 Claimed   Busy     7.440 18037 0+01:06:06
glidein_15842@alic LINUX      X86_64 Claimed   Busy     7.010 18037 0+00:35:21
 glidein_15842@alic LINUX      X86_64 Claimed   Busy     7.010 18037 0+00:35:21
glidein_18249@alic LINUX      X86_64 Claimed   Busy     7.510 18037 0+01:24:09
 glidein_18249@alic LINUX      X86_64 Claimed   Busy     7.510 18037 0+01:24:09
glidein_17825@wn89 LINUX      X86_64 Unclaimed Idle     11.990 16056 0+00:15:12
 glidein_17825@wn89 LINUX      X86_64 Unclaimed Idle     11.990 16056 0+00:15:12
glidein_10082@wn91 LINUX      X86_64 Claimed   Idle     7.000 16056 0+00:02:46
 glidein_10082@wn91 LINUX      X86_64 Claimed   Idle     7.000 16056 0+00:02:46
…
 …
glidein_3964@wp-05 LINUX      X86_64 Claimed   Busy     24.000 64464 0+16:00:29
 glidein_3964@wp-05 LINUX      X86_64 Claimed   Busy     24.000 64464 0+16:00:29
glidein_5614@wp-05 LINUX      X86_64 Claimed   Busy     23.360 64464 0+16:12:56
 glidein_5614@wp-05 LINUX      X86_64 Claimed   Busy     23.360 64464 0+16:12:56
glidein_5861@wp-05 LINUX      X86_64 Claimed   Retiring 22.140 64464 0+00:23:18
 glidein_5861@wp-05 LINUX      X86_64 Claimed   Retiring 22.140 64464 0+00:23:18
                     Total Owner Claimed Unclaimed Matched Preempting Backfill
                      Total Owner Claimed Unclaimed Matched Preempting Backfill
         X86_64/LINUX 23249          0      22697               552       0            0           0
          X86_64/LINUX 23249          0      22697               552       0            0           0
                     Total 23249    0       22697               552      0             0          0
                      Total 23249    0       22697               552         0             0          0




UCSD Jan 18th 2012                        Frontend Monitoring                                         10
Another example
$ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime 
 $ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime 
                 -const "GLIDEIN_Max_Walltime>83000"
                  -const "GLIDEIN_Max_Walltime>83000"
glidein_10001@we017.grid.hep.ph.ic.ac.uk              86040
 glidein_10001@we017.grid.hep.ph.ic.ac.uk              86040
glidein_10006@rossmann-a292.rcac.purdue.edu          114840
 glidein_10006@rossmann-a292.rcac.purdue.edu          114840
glidein_10007@we033.grid.hep.ph.ic.ac.uk              86040
 glidein_10007@we033.grid.hep.ph.ic.ac.uk              86040
...
 ...
glidein_9990@lxbra6310.cern.ch                       114840
 glidein_9990@lxbra6310.cern.ch                       114840
glidein_9990@rossmann-a212.rcac.purdue.edu           114840
 glidein_9990@rossmann-a212.rcac.purdue.edu           114840
glidein_9993@grid191.lal.in2p3.fr                    114840
 glidein_9993@grid191.lal.in2p3.fr                    114840
$ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime -xml 
 $ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime -xml 
                 -const "GLIDEIN_Max_Walltime>83000"
                  -const "GLIDEIN_Max_Walltime>83000"
<?xml version="1.0"?>
 <?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
 <!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
 <classads>
<c>
 <c>
     <a n="MyType"><s>Machine</s></a>
      <a n="MyType"><s>Machine</s></a>
     <a n="TargetType"><s>Job</s></a>
      <a n="TargetType"><s>Job</s></a>
     <a n="Name"><s>glidein_10001@we017.grid.hep.ph.ic.ac.uk</s></a>
      <a n="Name"><s>glidein_10001@we017.grid.hep.ph.ic.ac.uk</s></a>
     <a n="GLIDEIN_Max_Walltime"><i>86040</i></a>
      <a n="GLIDEIN_Max_Walltime"><i>86040</i></a>
     <a n="CurrentTime"><e>time()</e></a>
      <a n="CurrentTime"><e>time()</e></a>
</c>
 </c>
...
 ...

 UCSD Jan 18th 2012               Frontend Monitoring                         11
Collector log(s)
                                                                                  Place to look when
                                                                                  things seem fishy!
 ●   The Collector(s) will log any errors
      ●   The interesting errors will likely be in the leaves of
          the Collector tree
          ~condor/glidecondor/condor_local/log/CondorXXXLog
      ●   Logs rotate, so be sure to look in .old as well
                                                                                                         Yes, you will
 ●   You also get the glidein                                                                             have 100s
     authentication logs                                                                                   of them!

      ●   And log verbosity can be further increased with
          COLLECTOR_DEBUG
                     http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:SubsysDebug


UCSD Jan 18th 2012                                 Frontend Monitoring                                             12
Example

01/13/12 17:24:13 ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN=
  01/13/12 17:24:13 ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN=
uscmspilot47/glidein-1.t2.ucsd.edu'
  uscmspilot47/glidein-1.t2.ucsd.edu'
01/13/12 17:24:13 ZKM: 2: mapret: 00 included_voms: 0 canonical_user:glidein47
  01/13/12 17:24:13 ZKM: 2: mapret: included_voms: 0 canonical_user: glidein47
01/13/12 17:24:13 ZKM: successful mapping to glidein47
  01/13/12 17:24:13 ZKM: successful mapping to glidein47
...
  ...
01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno ==104 Connection reset
  01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno 104 Connection reset
by peer, reading 44 bytesfrom <130.104.133.245:7812>.
  by peer, reading bytes from <130.104.133.245:7812>.
01/13/12 17:24:19 DaemonCore: Can't receive command request from 130.104.133.245
  01/13/12 17:24:19 DaemonCore: Can't receive command request from 130.104.133.245
 (perhaps aa timeout?)
   (perhaps timeout?)
...
  ...
01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=1
  01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=1
0716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is
  0716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is
 currently registered with that id (perhaps ititrecently disconnected).
   currently registered with that id (perhaps recently disconnected).




UCSD Jan 18th 2012                 Frontend Monitoring                              13
Job monitoring
 ●   You can monitor local jobs
      ●   For jobs still in the queue (still waiting or running)
          condor_q
      ●   For finished jobs                                    Limited number of jobs
                                                               preserved
          condor_history
 ●   Similar cmdline args as condor_status
 ●   Remote condor_q possible with
     -name

                     http://www.cs.wisc.edu/condor/manual/v7.6/condor_q.html
                     http://www.cs.wisc.edu/condor/manual/v7.6/condor_history.html


UCSD Jan 18th 2012                      Frontend Monitoring                             14
Example

$ condor_q
 $ condor_q

-- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node
 -- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node
  ID      OWNER           SUBMITTED     RUN_TIME ST PRI SIZE CMD
   ID      OWNER           SUBMITTED     RUN_TIME ST PRI SIZE CMD
367788.0    uscms2330      1/6 11:03    0+00:00:00 I 0    0.0 CMSSW.sh 1 1
 367788.0    uscms2330      1/6 11:03    0+00:00:00 I 0    0.0 CMSSW.sh 1 1
367788.1    uscms2330      1/6 11:03    0+00:00:00 I 0    0.0 CMSSW.sh 2 1
 367788.1    uscms2330      1/6 11:03    0+00:00:00 I 0    0.0 CMSSW.sh 2 1
383995.19 uscms1789        1/11 02:26   2+13:35:38 R 0    1953.1 CMSSW.sh 1118 4
 383995.19 uscms1789        1/11 02:26   2+13:35:38 R 0    1953.1 CMSSW.sh 1118 4
383995.179 uscms1789       1/11 02:26   2+11:29:06 R 0    1464.8 CMSSW.sh 1310 4
 383995.179 uscms1789       1/11 02:26   2+11:29:06 R 0    1464.8 CMSSW.sh 1310 4
383999.32 uscms1789        1/11 02:31   2+09:36:12 R 0    1953.1 CMSSW.sh 299 4
 383999.32 uscms1789        1/11 02:31   2+09:36:12 R 0    1953.1 CMSSW.sh 299 4
383999.46 uscms1789        1/11 02:31   2+11:00:25 R 0    1953.1 CMSSW.sh 316 4
 383999.46 uscms1789        1/11 02:31   2+11:00:25 R 0    1953.1 CMSSW.sh 316 4
…
 …
385002.7    uscms3015      1/13 17:31   0+00:01:51 R 0    0.0 CMSSW.sh 70 2
 385002.7    uscms3015      1/13 17:31   0+00:01:51 R 0    0.0 CMSSW.sh 70 2
385002.8    uscms3015      1/13 17:31   0+00:01:49 R 0    0.0 CMSSW.sh 89 2
 385002.8    uscms3015      1/13 17:31   0+00:01:49 R 0    0.0 CMSSW.sh 89 2
385002.9    uscms3015      1/13 17:31   0+00:01:29 R 0    0.0 CMSSW.sh 91 2
 385002.9    uscms3015      1/13 17:31   0+00:01:29 R 0    0.0 CMSSW.sh 91 2
385002.10 uscms3015        1/13 17:31   0+00:01:00 R 0    0.0 CMSSW.sh 97 2
 385002.10 uscms3015        1/13 17:31   0+00:01:00 R 0    0.0 CMSSW.sh 97 2
58707 jobs; 39484 idle, 11694 running, 7529 held
 58707 jobs; 39484 idle, 11694 running, 7529 held




  UCSD Jan 18th 2012               Frontend Monitoring                         15
Job logs
                 ●   Users are encouraged to have a log for jobs
                      ●   Provides easy way to monitor the progress without
                          calling condor_q/condor_history
                  000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569>
                   000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569>
                  ...
                   ...
                  001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422>
                   001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422>
                  ...
                   ...
                  005 (001.000.000) 12/16 13:30:32 Job terminated.
                   005 (001.000.000) 12/16 13:30:32 Job terminated.
Literally ...




                          (1) Normal termination (return value 0)
                           (1) Normal termination (return value 0)
                                  Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage
                                   Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage
                                  Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
                                   Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
                                  Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage
                                   Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage
                                  Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
                                   Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
                          217 - Run Bytes Sent By Job
                           217 - Run Bytes Sent By Job
                          76 - Run Bytes Received By Job
                           76 - Run Bytes Received By Job
                          217 - Total Bytes Sent By Job
                           217 - Total Bytes Sent By Job
                          76 - Total Bytes Received By Job
                           76 - Total Bytes Received By Job
                  ...
                   ...

                UCSD Jan 18th 2012               Frontend Monitoring                          16
Condor Daemon logs
 ●   By default
      ●   Schedd writes a log
          /opt/glidecondor/condor_local/log/ScheddLog
      ●   Shadows share a common log
          /opt/glidecondor/condor_local/log/ShadowLog
      ●   The logs rotate, look for .old files as well
 ●   Lots of interesting info in them
      ●   Quite high verbosity by default



UCSD Jan 18th 2012            Frontend Monitoring        17
ScheddLog Example
01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng
  01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng
01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng
  01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng
01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.4
  01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.4
01/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/
  01/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/
CN=rokpilot01/osg.ctbp.ucsd.edu'
  CN=rokpilot01/osg.ctbp.ucsd.edu'
01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot
  01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot
01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot
  01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot
01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100
  01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100
...
  ...
01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824>
  01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824>
01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE
  01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE
01/13/12 18:05:02 (pid:32035) Negotiating for owner: cfeng@osg.ctbp.ucsd.edu
  01/13/12 18:05:02 (pid:32035) Negotiating for owner: cfeng@osg.ctbp.ucsd.edu
01/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected
  01/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected
01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_14642@
  01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_14642@
cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng
  cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng
01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_16403@
  01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_16403@
cabinet-1-1-0.t2.ucsd.edu <169.228.131.217:34929?CCBID=169.228.40.37:9623#108&noUDP> for cfeng
  cabinet-1-1-0.t2.ucsd.edu <169.228.131.217:34929?CCBID=169.228.40.37:9623#108&noUDP> for cfeng
01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_13015@
  01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_13015@
cabinet-2-2-24.t2.ucsd.edu <169.228.131.156:33784?CCBID=169.228.40.37:9708#95&noUDP> for cfeng
  cabinet-2-2-24.t2.ucsd.edu <169.228.131.156:33784?CCBID=169.228.40.37:9708#95&noUDP> for cfeng
01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138)
  01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138)
01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on glidein_14642@
  01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on glidein_14642@
cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng,
  cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng,
 (shadow pid = 5238)
    (shadow pid = 5238)

 UCSD Jan 18th 2012                       Frontend Monitoring                                     18
ShadowLog Example

01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP>
 01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP>
01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586>
 01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586>
01/12/12 21:52:36 Setting maximum accepts per cycle 4.
 01/12/12 21:52:36 Setting maximum accepts per cycle 4.
01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.108
 01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.108
01/12/12 21:52:36 (157171.97) (32318): Request to run on
 01/12/12 21:52:36 (157171.97) (32318): Request to run on
glidein_23261@cabinet-2-2-26.t2.ucsd.edu <169.228.131.154:48495?
 glidein_23261@cabinet-2-2-26.t2.ucsd.edu <169.228.131.154:48495?
CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED
 CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED
01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated:
 01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated:
exited with status 0
 exited with status 0
01/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW)
 01/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW)
pid 10937 EXITING WITH STATUS 100
 pid 10937 EXITING WITH STATUS 100
…
 …
01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28)
 01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28)
SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>;
 SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>;
STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/
 STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/
dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz:
 dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz:
(errno 28) No space left on device
 (errno 28) No space left on device
01/13/12 18:01:15 (157177.52) (4768): **** condor_shadow (condor_SHADOW)
 01/13/12 18:01:15 (157177.52) (4768): **** condor_shadow (condor_SHADOW)
pid 4768 EXITING WITH STATUS 112
 pid 4768 EXITING WITH STATUS 112




 UCSD Jan 18th 2012               Frontend Monitoring                          19
Submitter ClassAds
 ●   The schedd will advertise two types of
     ClassAds to the Collector
      ●   Schedd daemon ClassAds
          condor_status -schedd
      ●   Per-user ClassAds
          condor_status -submitter
 ●   Can be useful for getting a summary view
     of the system



UCSD Jan 18th 2012         Frontend Monitoring   20
Example

  $ condor_status -schedd
   $ condor_status -schedd
  Name                 Machine    TotalRunningJobs TotalIdleJobs TotalHeldJobs
   Name                 Machine    TotalRunningJobs TotalIdleJobs TotalHeldJobs
  cmsfnal01.fnal.gov   cmsfnal01.                     0              0                0
   cmsfnal01.fnal.gov   cmsfnal01.                     0              0                0
  glidein-2.t2.ucsd.ed glidein-2.                 10932          38480            7607
   glidein-2.t2.ucsd.ed glidein-2.                 10932          38480            7607
  submit-2.t2.ucsd.edu submit-2.t                 11103           8955            1667
   submit-2.t2.ucsd.edu submit-2.t                 11103           8955            1667
  vocms120.cern.ch     vocms120.c                     0           4024                2
   vocms120.cern.ch     vocms120.c                     0           4024                2
                        TotalRunningJobs           TotalIdleJobs        TotalHeldJobs
                         TotalRunningJobs           TotalIdleJobs        TotalHeldJobs

                 Total              22035                  51459                 9276
                  Total              22035                  51459                 9276
  $ condor_status -schedd -l submit-2.t2.ucsd.edu
   $ condor_status -schedd -l submit-2.t2.ucsd.edu
  Name = "submit-2.t2.ucsd.edu"
   Name = "submit-2.t2.ucsd.edu"
  MaxJobsRunning = 20000
   MaxJobsRunning = 20000
  TotalHeldJobs = 1667
   TotalHeldJobs = 1667
  TotalIdleJobs = 9347
   TotalIdleJobs = 9347
  …
   …
  TotalJobAds = 22096
   TotalJobAds = 22096
  TransferQueueDownloadWaitTime = 0
   TransferQueueDownloadWaitTime = 0
  MyType = "Scheduler"
   MyType = "Scheduler"


UCSD Jan 18th 2012                Frontend Monitoring                                 21
Example
$ condor_status -submitter
 $ condor_status -submitter
Name                          Machine      Running IdleJobs HeldJobs
 Name                          Machine      Running IdleJobs HeldJobs
uscms1789@glidein-2. glidein-2.       344                          0        20
 uscms1789@glidein-2. glidein-2.        344                         0        20
uscms1811@glidein-2. glidein-2.       176                       1141         0
 uscms1811@glidein-2. glidein-2.        176                      1141          0
uscms1976@glidein-2. glidein-2.       629                          0         7
 uscms1976@glidein-2. glidein-2.        629                         0          7
…
 …
uscms742@submit-2.t2 submit-2.t       405                          0        0
 uscms742@submit-2.t2 submit-2.t        405                         0         0
cms1279@vocms120.cer vocms120.c          0                      4000        0
 cms1279@vocms120.cer vocms120.c           0                     4000         0
                           RunningJobs                             IdleJobs        HeldJobs
                            RunningJobs                             IdleJobs        HeldJobs
uscms019@submit-2.t2                          11                           0             1
 uscms019@submit-2.t2                          11                            0             1
uscms1537@glidein-2.                           0                           0             1
 uscms1537@glidein-2.                           0                            0             1
uscms1811@glidein-2.                         176                        1141             0
 uscms1811@glidein-2.                         176                        1141              0
uscms1811@submit-2.t                         177                        3324             0
 uscms1811@submit-2.t                         177                        3324              0
…
 …
uscms742@glidein-2.t                       3107                          289            41
 uscms742@glidein-2.t                       3107                          289            41
uscms742@submit-2.t2                        405                            0             0
 uscms742@submit-2.t2                        405                             0             0
uscms911@glidein-2.t                          0                            0            42
 uscms911@glidein-2.t                          0                             0           42
                     Total                22092                         51518          9280
                      Total                22092                         51518          9280


UCSD Jan 18th 2012                        Frontend Monitoring                                  22
Negotiator Monitoring
 ●   To check for user priorities, use
     condor_userprio
      ●   -alluser - Without, only running users
      ●   -all        - Provides detailed info
 ●   Negotiator Log useful to troubleshoot
     ~/glidecondor/condor_local/log/NegotiatorLog
      ●   Look for errors and to monitor cycle times
 ●   Negotiator also advertises a ClassAd
      ●   Use condor_status -negotiator -long

UCSD Jan 18th 2012              Frontend Monitoring    23
Example 1/2
$ condor_userprio -all -allusers
 $ condor_userprio -all -allusers
Last Priority Update: 1/13 18:33
 Last Priority Update: 1/13 18:33
                                      Effective   Real     Priority    Res ...
                                       Effective   Real     Priority    Res ...
User Name                             Priority Priority     Factor    Used ...
 User Name                             Priority Priority     Factor     Used ...
------------------------------        --------- -------- ------------ ---- ...
 ------------------------------        --------- -------- ------------ ---- ...
cmspa0029@submit-2.t2.ucsd.edu           158.01    15.80        10.00     0 ...
 cmspa0029@submit-2.t2.ucsd.edu           158.01    15.80        10.00     0 ...
cmspa0029@glidein-2.t2.ucsd.ed           205.37    20.54        10.00     0 ...
 cmspa0029@glidein-2.t2.ucsd.ed           205.37    20.54        10.00     0 ...
uscms506@glidein-2.t2.ucsd.edu           559.11     0.56      1000.00     0 ...
 uscms506@glidein-2.t2.ucsd.edu           559.11     0.56      1000.00     0 ...
uscms2450@glidein-2.t2.ucsd.ed           576.15     0.58      1000.00     0 ...
 uscms2450@glidein-2.t2.ucsd.ed           576.15     0.58      1000.00     0 ...
uscms3501@submit-2.t2.ucsd.edu           775.26     0.78      1000.00     0 ...
 uscms3501@submit-2.t2.ucsd.edu           775.26     0.78      1000.00     0 ...
shi034@glidein-2.t2.ucsd.edu             827.95     0.83      1000.00     0 ...
 shi034@glidein-2.t2.ucsd.edu             827.95     0.83      1000.00     0 ...
uscms2450@submit-2.t2.ucsd.edu          1455.42     1.46      1000.00     0 ...
 uscms2450@submit-2.t2.ucsd.edu          1455.42     1.46      1000.00     0 ...
uscms4043@glidein-2.t2.ucsd.ed          1677.00     1.68      1000.00     0 ...
 uscms4043@glidein-2.t2.ucsd.ed          1677.00     1.68      1000.00     0 ...
uscms2336@glidein-2.t2.ucsd.ed          2113.44     2.11      1000.00     0 ...
 uscms2336@glidein-2.t2.ucsd.ed          2113.44     2.11      1000.00     0 ...
uscms2330@glidein-2.t2.ucsd.ed          2493.31     2.49      1000.00     0 ...
 uscms2330@glidein-2.t2.ucsd.ed          2493.31     2.49      1000.00     0 ...
uscms4084@glidein-2.t2.ucsd.ed          2506.61     2.51      1000.00     0 ...
 uscms4084@glidein-2.t2.ucsd.ed          2506.61     2.51      1000.00     0 ...
uscms2330@submit-2.t2.ucsd.edu          2771.17     2.77      1000.00     0 ...
 uscms2330@submit-2.t2.ucsd.edu          2771.17     2.77      1000.00     0 ...
uscms4043@submit-2.t2.ucsd.edu          5150.52     5.15      1000.00     0 ...
 uscms4043@submit-2.t2.ucsd.edu          5150.52     5.15      1000.00     0 ...
uscms2535@glidein-2.t2.ucsd.ed          5357.76     5.36      1000.00 176 ...
 uscms2535@glidein-2.t2.ucsd.ed          5357.76     5.36      1000.00 176 ...




UCSD Jan 18th 2012                Frontend Monitoring                              24
Example 2/2
$ condor_userprio -all -allusers
 $ condor_userprio -all -allusers
Last Priority Update: 1/13 18:33
 Last Priority Update: 1/13 18:33
                               …  Total Usage       Usage            Last
                                …  Total Usage       Usage            Last
User Name                      … (wghted-hrs)    Start Time       Usage Time
 User Name                      … (wghted-hrs)    Start Time       Usage Time
------------------------------ …  ----------- ---------------- ----------------
 ------------------------------ …  ----------- ---------------- ----------------
cmspa0029@submit-2.t2.ucsd.edu …     82863.87 10/03/2011 01:41 1/11/2012 07:05
 cmspa0029@submit-2.t2.ucsd.edu …     82863.87 10/03/2011 01:41 1/11/2012 07:05
cmspa0029@glidein-2.t2.ucsd.ed …    202430.74 10/31/2011 01:30 1/12/2012 02:00
 cmspa0029@glidein-2.t2.ucsd.ed …    202430.74 10/31/2011 01:30 1/12/2012 02:00
uscms506@glidein-2.t2.ucsd.edu …    437667.09 7/02/2011 08:06 1/08/2012 07:29
 uscms506@glidein-2.t2.ucsd.edu …    437667.09 7/02/2011 08:06 1/08/2012 07:29
uscms2450@glidein-2.t2.ucsd.ed …     47024.87 10/09/2011 13:26 1/07/2012 01:28
 uscms2450@glidein-2.t2.ucsd.ed …     47024.87 10/09/2011 13:26 1/07/2012 01:28
uscms3501@submit-2.t2.ucsd.edu …      3677.14 11/23/2011 08:12 1/10/2012 01:02
 uscms3501@submit-2.t2.ucsd.edu …      3677.14 11/23/2011 08:12 1/10/2012 01:02
shi034@glidein-2.t2.ucsd.edu   …   1309024.85 6/03/2009 00:48 1/07/2012 15:57
 shi034@glidein-2.t2.ucsd.edu   …   1309024.85 6/03/2009 00:48 1/07/2012 15:57
uscms2450@submit-2.t2.ucsd.edu …     81864.63 9/26/2011 15:22 1/07/2012 05:46
 uscms2450@submit-2.t2.ucsd.edu …     81864.63 9/26/2011 15:22 1/07/2012 05:46
uscms4043@glidein-2.t2.ucsd.ed …      6966.57 10/10/2011 22:48 1/09/2012 17:35
 uscms4043@glidein-2.t2.ucsd.ed …      6966.57 10/10/2011 22:48 1/09/2012 17:35
uscms2336@glidein-2.t2.ucsd.ed …     57125.01 5/27/2011 02:00 1/09/2012 21:13
 uscms2336@glidein-2.t2.ucsd.ed …     57125.01 5/27/2011 02:00 1/09/2012 21:13
uscms2330@glidein-2.t2.ucsd.ed …     85581.04 8/06/2011 12:45 1/09/2012 07:45
 uscms2330@glidein-2.t2.ucsd.ed …     85581.04 8/06/2011 12:45 1/09/2012 07:45
uscms4084@glidein-2.t2.ucsd.ed …    158894.51 10/11/2011 11:11 1/08/2012 17:17
 uscms4084@glidein-2.t2.ucsd.ed …    158894.51 10/11/2011 11:11 1/08/2012 17:17
uscms2330@submit-2.t2.ucsd.edu …     13528.66 9/05/2011 02:15 1/09/2012 23:46
 uscms2330@submit-2.t2.ucsd.edu …     13528.66 9/05/2011 02:15 1/09/2012 23:46
uscms4043@submit-2.t2.ucsd.edu …     10824.76 9/28/2011 05:02 1/09/2012 03:27
 uscms4043@submit-2.t2.ucsd.edu …     10824.76 9/28/2011 05:02 1/09/2012 03:27
uscms2535@glidein-2.t2.ucsd.ed …    304430.61 11/17/2009 11:04 1/13/2012 18:33
 uscms2535@glidein-2.t2.ucsd.ed …    304430.61 11/17/2009 11:04 1/13/2012 18:33




UCSD Jan 18th 2012              Frontend Monitoring                         25
NegotiatorLog Example
  01/13/12 18:23:05 ---------- Finished Negotiation Cycle ----------
   01/13/12 18:23:05 ---------- Finished Negotiation Cycle ----------
  01/13/12 18:24:09 ---------- Started Negotiation Cycle ----------
   01/13/12 18:24:09 ---------- Started Negotiation Cycle ----------
  01/13/12 18:24:09 Phase 1: Obtaining ads from collector ...
   01/13/12 18:24:09 Phase 1: Obtaining ads from collector ...
  01/13/12 18:24:09   Getting all public ads ...
   01/13/12 18:24:09   Getting all public ads ...
  01/13/12 18:24:44   Sorting 23021 ads ...
   01/13/12 18:24:44   Sorting 23021 ads ...
  01/13/12 18:24:46   Getting startd private ads ...
   01/13/12 18:24:46   Getting startd private ads ...
  01/13/12 18:24:51 Got ads: 23021 public and 22571 private
   01/13/12 18:24:51 Got ads: 23021 public and 22571 private
  01/13/12 18:24:51 Public ads include 38 submitter, 22568 startd
   01/13/12 18:24:51 Public ads include 38 submitter, 22568 startd
  01/13/12 18:24:51 Phase 2: Performing accounting ...
   01/13/12 18:24:51 Phase 2: Performing accounting ...
  01/13/12 18:25:01 Phase 3: Sorting submitter ads by priority ...
   01/13/12 18:25:01 Phase 3: Sorting submitter ads by priority ...
  01/13/12 18:25:01 Phase 4.1: Negotiating with schedds ...
   01/13/12 18:25:01 Phase 4.1: Negotiating with schedds ...
  01/13/12 18:25:01   Negotiating with sfiligoi@submit-2.t2.ucsd.edu at
   01/13/12 18:25:01   Negotiating with sfiligoi@submit-2.t2.ucsd.edu at
  <169.228.130.26:9615?sock=10263_1229_2>
   <169.228.130.26:9615?sock=10263_1229_2>
  01/13/12 18:25:01 0 seconds so far
   01/13/12 18:25:01 0 seconds so far
  01/13/12 18:25:02     Request 345869.00000:
   01/13/12 18:25:02     Request 345869.00000:
  01/13/12 18:25:02       Rejected 345869.0 sfiligoi@submit-2.t2.ucsd.edu
   01/13/12 18:25:02       Rejected 345869.0 sfiligoi@submit-2.t2.ucsd.edu
  <169.228.130.26:9615?sock=10263_1229_2>: no match found
   <169.228.130.26:9615?sock=10263_1229_2>: no match found
  01/13/12 18:25:02     Got NO_MORE_JOBS; done negotiating
   01/13/12 18:25:02     Got NO_MORE_JOBS; done negotiating
  …
   …
  01/13/12 18:25:06     Request 384970.00170:
   01/13/12 18:25:06     Request 384970.00170:
  01/13/12 18:25:06       Matched 384970.170 uscms2535@glidein-2.t2.ucsd.edu
   01/13/12 18:25:06       Matched 384970.170 uscms2535@glidein-2.t2.ucsd.edu
  <169.228.130.11:9615?sock=9763_cd4c_2> preempting none <192.168.3.77:55906?
   <169.228.130.11:9615?sock=9763_cd4c_2> preempting none <192.168.3.77:55906?
  CCBID=169.228.130.23:9823#13833&noUDP> glidein_15335@lnxfarm177.colorado.edu
   CCBID=169.228.130.23:9823#13833&noUDP> glidein_15335@lnxfarm177.colorado.edu
  01/13/12 18:25:06       Successfully matched with
   01/13/12 18:25:06       Successfully matched with
  glidein_15335@lnxfarm177.colorado.edu
   glidein_15335@lnxfarm177.colorado.edu
UCSD Jan 18th 2012               Frontend Monitoring                          26
CycleServer Screenshots
 ●   Can do more than just monitoring
      ●   But the rest beyond the scope of this talk




UCSD Jan 18th 2012           Frontend Monitoring       27
Frontend Monitoring




UCSD Jan 18th 2012          Frontend Monitoring   28
Frontend monitoring
                                                          Frontend node

                                                      Entry
                                                      Group      ...      Group
 ●   Helper cmdline tool                                       Spawn

 ●   Plus, each Group provides:                               Frontend


      ●   Activity/Error logs
      ●   RRD files with statistics (running, held, etc.)
      ●   XML files with current snapshot
      ●   Resource ClassAds
 ●   Master frontend aggregates RRD and XML
     files, and writes them in its own area
      ●   Human readable/viewable Web pages available

UCSD Jan 18th 2012              Frontend Monitoring                          29
Helper cmdline tool
   ●    Wrapper around condor condor_status
        glideinWMS/tools/glidein_status.py
        ●   Provides useful formatting
~/glideinWMS/tools$ ./glidein_status.py
 ~/glideinWMS/tools$ ./glidein_status.py

Name                                     Site       Factory               Entry              State       Activit
 Name                                     Site       Factory               Entry              State       Activi

glidein_6682@alicegrid26.ba.infn.it      Bari       v1_0@OSGGOC           CMS_T2_IT_Bari_ce01 Claimed    Busy
  glidein_6682@alicegrid26.ba.infn.it
glidein_10678@alicegrid32.ba.infn.it      Bari
                                         Bari        v1_0@OSGGOC
                                                    v1_0@OSGGOC            CMS_T2_IT_Bari_ce01 Claimed
                                                                          CMS_T2_IT_Bari_ce01 Claimed     Busy
                                                                                                         Busy
… glidein_10678@alicegrid32.ba.infn.it    Bari       v1_0@OSGGOC           CMS_T2_IT_Bari_ce01 Claimed    Busy
  …
glidein_5861@wp-05-12.pn.pd.infn.it      Legnaro    v1_0@OSGGOC           CMS_T2_IT_Legnaro_. Claimed    Retirin
  glidein_5861@wp-05-12.pn.pd.infn.it     Legnaro    v1_0@OSGGOC           CMS_T2_IT_Legnaro_. Claimed    Retiri

                                         Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed M
                                          Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed

  CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC    11      0               11               0           0             0
   CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 522
     CMS_T2_US_Purdue_hansen@v1_0@OSGGOC     11     00               11
                                                                   517               00          00            50
      CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 1201
        CMS_T2_US_Purdue_osg@v1_0@OSGGOC    522     00              517
                                                                  1182              14 0         00            55
…        CMS_T2_US_Purdue_osg@v1_0@OSGGOC 1201        0            1182              14            0             5
  …
CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD    135      0             132                0           0             3
  CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD   135         0          132                   0           0             3
                                  Total 21474       0          19742               1264          0            468
                                   Total 21474          0       19742               1264             0         468

 UCSD Jan 18th 2012                         Frontend Monitoring                                          30
Log files
 ●   Each Frontend group provides 3 types of logs
     log/group_XXX/frontend.date.type.log
      ●   info       - Progress and warnings
      ●   err        - One line warnings
      ●   debug      - Multi line error messages
 ●   The master frontend has similar logs
     log/frontend/frontend.date.type.log
      ●   But rarely anything interesting there



UCSD Jan 18th 2012            Frontend Monitoring   31
Example Info Log
:01-07:00   15037]   Iteration at Tue Nov 15 10:44:01 2011
:01-07:00   15037]   Query condor
:01-07:00   15037]   Child processes created
:05-07:00   31633]   WARNING: Failed to talk to schedd submit-1.t2.ucsd.edu. See debug log       for more details.
:05-07:00   15037]   All children terminated
:05-07:00   15037]   Jobs found total 4836 idle 1732 (old 1732, voms 1703) running 3104
:05-07:00   15037]   Glideins found total 639 idle 8 running 630 limit 800 curb 600
:05-07:00   15037]   Using 1 proxies
:05-07:00   15037]   Match
:05-07:00   15037]   Counting
:05-07:00   15037]   Child processes created
:06-07:00   15037]   All children terminated
:06-07:00   15037]   Total matching idle 1732 (old 1703) running 3104
:06-07:00   15037]                Jobs in schedd queues                |        Glideins         |   Request
:06-07:00   15037]   Idle (match eff     old uniq ) Run ( here max ) | Total Idle        Run     | Idle MaxRun Down   Factory
:06-07:00   15037]     171( 1705    170   169     0) 3104( 102    250) |    105      1    103    |    10 3276 Up      CMS_T2_US_Nebraska_Red@Produ
:06-07:00   15037]     171( 1705    167   169     0) 3104( 187    250) |    197      4    193    |    10 3276 Up      CMS_T2_US_Nebraska_Red_gw1@P
:06-07:00   15037]     171( 1705    171   169     0) 3104(    0   250) |      0      0       0   |    10 3276 Down    CMS_T2_US_Nebraska_Red_gw2@P
:06-07:00   15037]     171( 1705    171   169     0) 3104(   62   250) |     62      0     62    |    10 3276 Up      CMS_T2_US_Wisconsin_cms01@Pr
:06-07:00   15037]     171( 1705    171   169     0) 3104(   71   250) |     71      0     71    |    10 3276 Up      CMS_T2_US_Wisconsin_cms02@Pr
:06-07:00   15037]     171( 1705    169   169     0) 3104(   88   250) |     96      2     94    |    10 3276 Up      CMS_T2_US_Nebraska_Red@v1_0@
:06-07:00   15037]     171( 1705    171   169     0) 3104(    1   250) |      1      0       1   |    10 3276 Up      CMS_T2_US_Nebraska_Red_gw1@v
:06-07:00   15037]     171( 1705    171   169     0) 3104(    0   250) |      0      0       0   |    10 3276 Down    CMS_T2_US_Nebraska_Red_gw2@v
:06-07:00   15037]     171( 1705    171   169     0) 3104(   45   250) |     45      0     45    |    10 3276 Up      CMS_T2_US_Wisconsin_cms01@v1
:06-07:00   15037]     171( 1705    170   169     0) 3104(   60   250) |     62      1     61    |    10 3276 Up      CMS_T2_US_Wisconsin_cms02@v1
:06-07:00   15037]                Jobs in schedd queues                |        Glideins         |   Request
:06-07:00   15037]   Idle (match eff     old uniq ) Run ( here max ) | Total Idle        Run     | Idle MaxRun Down   Factory
:06-07:00   15037]    1368(13640 1360 1352        0) 24832( 616 2000) |     639      8    630    |    80 26208 Up     Sum of useful factories
:06-07:00   15037]     342( 3410    342   338     0) 6208(    0   500) |      0      0       0   |    20 6552 Down    Sum of down factories
:06-07:00   15037]      27(   27     27    14    27)     0(   0     0) |      0      0       0   |     0     0 Down   Unmatched
:06-07:00   15037]   Advertizing 10 requests
:07-07:00   15037]   Done advertizing
:07-07:00   15037]   Advertising 10 glideresource classads to the user pool
:07-07:00   15037]   Done advertising glideresource classads
:07-07:00   15037]   Writing stats
:07-07:00   15037]   Sleep

       UCSD Jan 18th 2012                                      Frontend Monitoring                                                      32
Example log files

                                     frontend.20120113.err.log
       [2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details.
         [2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details.
       [2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details.
         [2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details.




                                   frontend.20120113.debug.log
[2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running '/home/frontend/glidecondor/sbin/condor_advertise
  [2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running '/home/frontend/glidecondor/sbin/condor_advertise
-pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2'
  -pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2'
code 1:failed to send classad to <169.228.130.10:9618>
  code 1:failed to send classad to <169.228.130.10:9618>
failed to send classad to <169.228.130.10:9618>
  failed to send classad to <169.228.130.10:9618>
failed to send classad to <169.228.130.10:9618>
  failed to send classad to <169.228.130.10:9618>
failed to send classad to <169.228.130.10:9618>
  failed to send classad to <169.228.130.10:9618>
[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd 'vocms120.cern.ch' not found
  [2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd 'vocms120.cern.ch' not found




   UCSD Jan 18th 2012                             Frontend Monitoring                                            33
Web pages 1/3
                           frontendStatus.html


                                                   Historical overview




                                                     Fully dynamic,
                                                     allows for zooming
                                                     and selecting of
                                                     elements to plot



                                                 Default shows everything,
                                                 but can restrict to a group
                                                 and/or a Factory



UCSD Jan 18th 2012      Frontend Monitoring                           34
Web pages 2/3

frontendGroupGraphStatusNow.html




                           Current snapshot in tabular form


                             Useful for spotting problems


  UCSD Jan 18th 2012               Frontend Monitoring        35
Web pages 3/3
                      frontendGroupGraphStatusNow.html




                     Contains also pie-charts with the same info

UCSD Jan 18th 2012             Frontend Monitoring                 36
RRDs and XML files
 ●   The Web pages are just rendering of the RRDs
     and XML pages
      ●   Raw data loaded in the browser and rendered
      ●   No server side code
 ●   Other tools could use those data
      ●   Publicly available, if one knows the URL
      ●   No user-identifying data, only summary stats




UCSD Jan 18th 2012          Frontend Monitoring          37
Resource ClassAds
 ●   The Frontend Groups advertise one ClassAd
     for each Factory it is requesting glideins from
      ●   Type glideresource
 ●   They contain pretty much everything the
     Frontend Group knows about the Factory:
      ●   Factory attributes used for matchmaking
      ●   Stats about the matching jobs
      ●   What is being requested
      ●   Even what the Factory is doing!

UCSD Jan 18th 2012          Frontend Monitoring        38
Example query
 ●   Not a Condor native type, must use
      ●   -any
      ●   Then constrain the type
     $ condor_status -any -const 'MyType=="glideresource"' -format '%sn' Name
      $ condor_status -any -const 'MyType=="glideresource"' -format '%sn' Name
     CMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.main
      CMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.main
     CMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.main
      CMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.main
     CMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.main
      CMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.main
     CMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main
      CMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main
     ...
      ...
     CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.main
      CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.main
     CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main
      CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main
     CMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.main
      CMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.main
     CMS_T2_US_Wisconsin_cms02@v1_0@OSGGOC@UCSD-v5_3.main
      CMS_T2_US_Wisconsin_cms02@v1_0@OSGGOC@UCSD-v5_3.main



                                   Remotely queryable


UCSD Jan 18th 2012                 Frontend Monitoring                            39
Example ClassAd
$ condor_status -any 
 $ condor_status -any 
  CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -l
   CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -l
MyType = "glideresource"
 MyType = "glideresource"                                         Identification
Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main"
 Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main"
GlideClientName = "UCSD-v5_3.main"
 GlideClientName = "UCSD-v5_3.main"
...
 ...
GlideClientMonitorJobsIdle = 210.000000
 GlideClientMonitorJobsIdle = 210.000000
GlideClientMonitorJobsRunningHere = 213                           Info about local jobs
 GlideClientMonitorJobsRunningHere = 213
...
 ...
GlideClientMonitorGlideinsRequestIdle = 50
 GlideClientMonitorGlideinsRequestIdle = 50
GlideClientMonitorGlideinsRequestMaxRun = 445
                                                                  What is being requested
 GlideClientMonitorGlideinsRequestMaxRun = 445
...
 ...
GLIDEIN_Site = "UCSD"
 GLIDEIN_Site = "UCSD"
GLEXEC_BIN = "OSG"                                                Factory attributes
 GLEXEC_BIN = "OSG"
...
 ...
GlideClientMonitorGlideinsRunning = 215
 GlideClientMonitorGlideinsRunning = 215
GlideClientMonitorGlideinsTotal = 216                             Info about registered glideins
 GlideClientMonitorGlideinsTotal = 216
...
 ...
GlideFactoryMonitorStatusRunning = 339
 GlideFactoryMonitorStatusRunning = 339
GlideFactoryMonitorStatusPending = 277                            Factory status
 GlideFactoryMonitorStatusPending = 277
GlideFactoryMonitorStatusHeld = 0
 GlideFactoryMonitorStatusHeld = 0
...
 ...

                      Currently more information than you get on the Web
 UCSD Jan 18th 2012                    Frontend Monitoring                                40
OK, now you know
                      what's available.

              What will you do
          with all that information?
                          (i.e. What to look for)




UCSD Jan 18th 2012          Frontend Monitoring     41
Monitoring the health of the system
 ●   Six major areas to look after; your goal is
      ●   Few unclaimed glideins
          (both globally, and per site)
      ●   No unmatched jobs
      ●   Reasonably low restart rate
          (both global, and per site)
      ●   Reasonably low job failure rate
          (both global, and per site)
      ●   Negotiation cycle reasonably short
      ●   Schedd node not overloaded

UCSD Jan 18th 2012           Frontend Monitoring   42
Unclaimed glideins
 ●   Frontend and Negotiator policies are
     not identical
      ●   You may end up with glideins that
          never run any jobs
 ●   The discrepancy can be big enough to be
     noticed on a global scale
      ●   But more often it is just for one (or few) sites
 ●   Short spikes are not a problem
      ●   But long periods are

UCSD Jan 18th 2012            Frontend Monitoring            43
How do you notice it?
 ●   Historical Web monitoring

                                                                          Bad
     Good




 ●   Ask for daily emails from the Factory
      ●   Or write your own scripts                   No Frontend report generators
                                                      in glideinWMS at this time
                       Parse the RRDs
UCSD Jan 18th 2012              Frontend Monitoring                              44
How do you find the root cause?
 ●   Analyze the latest snapshots
      ●   condor_status/glidein_status
      ●   condor_q
      ●   Frontend Web
 ●   Limit the research to few sites, if possible
 ●   Then start comparing
      ●   Job Requirements, with
                                                                                  Can be daunting!
      ●   Glidein Start expressions
                     In theory, there is “condor_q -ana”, but it is usually worthless


UCSD Jan 18th 2012                         Frontend Monitoring                                       45
Unmatched jobs
 ●   The other side of the problem
      ●   Glideins never asked for some jobs       Jobs will never start!

 ●   Two possible reasons
      ●   Wrong Frontend matchmaking policy
      ●   No available Factory entries to serve the job




UCSD Jan 18th 2012          Frontend Monitoring                     46
How do you notice it?
 ●   “Unmatched Factory” in Web monitoring




UCSD Jan 18th 2012          Frontend Monitoring   47
How do you find the root cause?
 ●   Again, start with the latest snaphot
      ●   condor_q
      ●   condor_status -any -const 'MyType=="glideresource"'
 ●   Get the (python) Match expression from XML
      ●   Start comparing!
                                      Can be daunting!




UCSD Jan 18th 2012           Frontend Monitoring                48
Restarted jobs
 ●   Any restart == wasted CPU
 ●   How do you notice it?
      ●   condor_q is your friend here
          condor_q -format '%in' NumJobStarts
                                                      No historical/Web monitoring provided
 ●   Why it happens?
      ●   Glidein disappears!
      ●   End of lifetime hit
                                                  Not in the default config,
      ●   Preemption policies                     but you may set Condor to do it
      ●   Submit node overload
                      Condor daemons do not like being resource constrained!
UCSD Jan 18th 2012              Frontend Monitoring                                     49
Why glideins disappear?
 ●   Three main reasons                                Rare
      ●   Remote node just died                       Some sites do this; nothing you can do.
                                                      Learn who they are and act accordingly.
      ●   Site preemption policy
      ●   Glidein killed by Site because it exceeded slot limits
           –   Most likely Memory           One of 2 limits the OSG factory advertises.
                                                               GLIDEIN_MaxMemMBs
 ●   Why can limits be exceeded?
                                                                     Job told you it needed
      ●   Job underestimated resource use                            more resources than
                                                                     the limit!
      ●   Frontend matchmaking logic problem
      ●   Wrong advertised limits
                                                          Factory problem!

UCSD Jan 18th 2012              Frontend Monitoring                                    50
Wallclock limits
 ●   Main resource limit is time
      ●   The glidein automatically deals with it
           –   Will go away before the deadline
           –   … killing/preemptiong any jobs if needed!
      ●   Limit advertised as                          In seconds
           –   Factory: GLIDEIN_Max_Walltime (-Δ)
           –   Glidein: GLIDEIN_ToDie      UNIX time
 ●   Why jobs may reach the deadline?
      ●   Like with all other resources
           –   Job underestimates time it needs
           –   Frontend matchmaking logic problems

UCSD Jan 18th 2012               Frontend Monitoring                51
Job failures
 ●   Jobs can fail for many reasons
 ●   You should monitor the ExitCode
     condor_history -back -const 'JobStatus==5' -format '%in' ExitCode
      ●   Knowing what users run often needed to interpret
          errors
 ●   For common WN errors, Frontend admin
     should create appropriate validation script
      ●   So glideins fail, not user jobs



UCSD Jan 18th 2012              Frontend Monitoring                       52
Negotiation time
 ●   The negotiation time should be << 5mins
      ●   If much longer,
          glideins may terminate without running any jobs
      ●   Monitor the NegotiatorLog on CM
 ●   Possible causes
      ●   CPU starvations (e.g. other processes)
      ●   Autocluster explosion
           –   Condor tries to be smart about Matchmaking
           –   But if users don't cooperate, cannot do much


UCSD Jan 18th 2012               Frontend Monitoring          53
Autoclustering
                                                                                          Much faster
 ●   Condor Schedd will try to group jobs                                                  if only few
                                                                                          groups exist
      ●   All “similar jobs” will be matched together!
 ●   What “similar” means?
      ●   Similar == Would result in the same match
 ●   How it is implemented?
      ●   Tuple of attributes considered during matchmaking
      ●   E.g. (DESIRED_Sites,ImageSize)
 ●   How can the number of autoclusters explode?
      ●   If an attribute that changes a lot is added
                                     Example of really bad one: JobID
                     https://condor-wiki.cs.wisc.edu/index.cgi/attach_get/220/cs739.pdf


UCSD Jan 18th 2012                          Frontend Monitoring                                          54
Submit node health
 ●   Condor is very sensitive to resource starvation
      ●   If submit node overloaded, expect problems!
 ●   How can we get to resource starvation?
                                       Trying to run 3k jobs on a 1G RAM node???
      ●   Poor planning
      ●   Other processes                 May steal CPU/RAM/IO from Condor

 ●   Interactive activity particularly risky
      ●   Due to its unpredictable nature
           –   Including user errors
      ●   But portals not immune to resource overuse

UCSD Jan 18th 2012               Frontend Monitoring                               55
Summary




UCSD Jan 18th 2012    Frontend Monitoring   56
Summary
 ●   You have plenty of Monitoring options
      ●   Some prettier, some more powerful
 ●   Most of the time, things just work
      ●   So you don't need to constantly watch after your
          installation
 ●   But occasionally things will break
                                                     Or the users will tell you!
      ●   It is in your interest noticing it
      ●   Having good monitoring tools will help you there!


UCSD Jan 18th 2012             Frontend Monitoring                             57
The End




UCSD Jan 18th 2012    Frontend Monitoring   58
Pointers
 ●   The official glideinWMS project Web page is
     http://tinyurl.com/glideinWMS
 ●   glideinWMS development team is reachable at
     glideinwms-support@fnal.gov
 ●   The OSG glidein factory is reachable at
     osg-gfactory-support@physics.ucsd.edu




UCSD Jan 18th 2012     Frontend Monitoring         59
Acknowledgments
 ●   The glideinWMS is a CMS-led project
     developed mostly at FNAL, with contributions
     from UCSD and ISI
 ●   The glideinWMS factory operations at UCSD is
     sponsored by OSG
 ●   The funding comes from NSF, DOE and the
     UC system




UCSD Jan 18th 2012        Frontend Monitoring       60

Contenu connexe

En vedette

Add-On Development: EE Expects that Every Developer will do his Duty
Add-On Development: EE Expects that Every Developer will do his DutyAdd-On Development: EE Expects that Every Developer will do his Duty
Add-On Development: EE Expects that Every Developer will do his Dutyreedmaniac
 
17) 11 (may, 2003) squid master this proxy server
17) 11 (may, 2003)   squid master this proxy server17) 11 (may, 2003)   squid master this proxy server
17) 11 (may, 2003) squid master this proxy serverswarup1435
 
Nagios 3
Nagios 3Nagios 3
Nagios 3zmoly
 
Using Q4M - a message queue storage engine for MySQL
Using Q4M - a message queue storage engine for MySQLUsing Q4M - a message queue storage engine for MySQL
Using Q4M - a message queue storage engine for MySQLKazuho Oku
 

En vedette (6)

Doctrine in FLOW3
Doctrine in FLOW3Doctrine in FLOW3
Doctrine in FLOW3
 
Add-On Development: EE Expects that Every Developer will do his Duty
Add-On Development: EE Expects that Every Developer will do his DutyAdd-On Development: EE Expects that Every Developer will do his Duty
Add-On Development: EE Expects that Every Developer will do his Duty
 
17) 11 (may, 2003) squid master this proxy server
17) 11 (may, 2003)   squid master this proxy server17) 11 (may, 2003)   squid master this proxy server
17) 11 (may, 2003) squid master this proxy server
 
Nagios 3
Nagios 3Nagios 3
Nagios 3
 
This is from spr
This is from sprThis is from spr
This is from spr
 
Using Q4M - a message queue storage engine for MySQL
Using Q4M - a message queue storage engine for MySQLUsing Q4M - a message queue storage engine for MySQL
Using Q4M - a message queue storage engine for MySQL
 

Similaire à glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

Condor from the user point of view - glideinWMS Training Jan 2012
Condor from the user point of view - glideinWMS Training Jan 2012Condor from the user point of view - glideinWMS Training Jan 2012
Condor from the user point of view - glideinWMS Training Jan 2012Igor Sfiligoi
 
Introduction to glideinWMS
Introduction to glideinWMSIntroduction to glideinWMS
Introduction to glideinWMSIgor Sfiligoi
 
Matchmaking in glideinWMS in CMS
Matchmaking in glideinWMS in CMSMatchmaking in glideinWMS in CMS
Matchmaking in glideinWMS in CMSIgor Sfiligoi
 
Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012Igor Sfiligoi
 
glideinWMS Training Jan 2012 - Condor tuning
glideinWMS Training Jan 2012 - Condor tuningglideinWMS Training Jan 2012 - Condor tuning
glideinWMS Training Jan 2012 - Condor tuningIgor Sfiligoi
 
Wedding convenience and control with RemoteCondor
Wedding convenience and control with RemoteCondorWedding convenience and control with RemoteCondor
Wedding convenience and control with RemoteCondorIgor Sfiligoi
 
Android Platform Debugging and Development at ELCE 2013
Android Platform Debugging and Development at ELCE 2013Android Platform Debugging and Development at ELCE 2013
Android Platform Debugging and Development at ELCE 2013Opersys inc.
 
Skaffold - faster development on K8S
Skaffold - faster development on K8SSkaffold - faster development on K8S
Skaffold - faster development on K8SHuynh Thai Bao
 
Developing Android Platform Tools
Developing Android Platform ToolsDeveloping Android Platform Tools
Developing Android Platform ToolsOpersys inc.
 
DCSF 19 Deploying Rootless buildkit on Kubernetes
DCSF 19 Deploying Rootless buildkit on KubernetesDCSF 19 Deploying Rootless buildkit on Kubernetes
DCSF 19 Deploying Rootless buildkit on KubernetesDocker, Inc.
 
Heroku 101 py con 2015 - David Gouldin
Heroku 101   py con 2015 - David GouldinHeroku 101   py con 2015 - David Gouldin
Heroku 101 py con 2015 - David GouldinHeroku
 
FRIDA 101 Android
FRIDA 101 AndroidFRIDA 101 Android
FRIDA 101 AndroidTony Thomas
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and DevelopmentOpersys inc.
 
Java and Container - Make it Awesome !
Java and Container - Make it Awesome !Java and Container - Make it Awesome !
Java and Container - Make it Awesome !Dinakar Guniguntala
 
Debug JNI code with ndk-gdb and eclipse GUI
Debug JNI code with ndk-gdb and eclipse GUIDebug JNI code with ndk-gdb and eclipse GUI
Debug JNI code with ndk-gdb and eclipse GUITom Chen
 
Oracle ZDM KamaleshRamasamy Sangam2020
Oracle ZDM KamaleshRamasamy Sangam2020Oracle ZDM KamaleshRamasamy Sangam2020
Oracle ZDM KamaleshRamasamy Sangam2020Kamalesh Ramasamy
 
Android on Windows 11 - A Developer's Perspective (Windows Subsystem For Andr...
Android on Windows 11 - A Developer's Perspective (Windows Subsystem For Andr...Android on Windows 11 - A Developer's Perspective (Windows Subsystem For Andr...
Android on Windows 11 - A Developer's Perspective (Windows Subsystem For Andr...Embarcadero Technologies
 
Android Platform Debugging and Development at ABS 2014
Android Platform Debugging and Development at ABS 2014Android Platform Debugging and Development at ABS 2014
Android Platform Debugging and Development at ABS 2014Opersys inc.
 

Similaire à glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012 (20)

Condor from the user point of view - glideinWMS Training Jan 2012
Condor from the user point of view - glideinWMS Training Jan 2012Condor from the user point of view - glideinWMS Training Jan 2012
Condor from the user point of view - glideinWMS Training Jan 2012
 
Pilot Factory
Pilot FactoryPilot Factory
Pilot Factory
 
Introduction to glideinWMS
Introduction to glideinWMSIntroduction to glideinWMS
Introduction to glideinWMS
 
Matchmaking in glideinWMS in CMS
Matchmaking in glideinWMS in CMSMatchmaking in glideinWMS in CMS
Matchmaking in glideinWMS in CMS
 
Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012
 
glideinWMS Training Jan 2012 - Condor tuning
glideinWMS Training Jan 2012 - Condor tuningglideinWMS Training Jan 2012 - Condor tuning
glideinWMS Training Jan 2012 - Condor tuning
 
Wedding convenience and control with RemoteCondor
Wedding convenience and control with RemoteCondorWedding convenience and control with RemoteCondor
Wedding convenience and control with RemoteCondor
 
Android Platform Debugging and Development at ELCE 2013
Android Platform Debugging and Development at ELCE 2013Android Platform Debugging and Development at ELCE 2013
Android Platform Debugging and Development at ELCE 2013
 
Skaffold - faster development on K8S
Skaffold - faster development on K8SSkaffold - faster development on K8S
Skaffold - faster development on K8S
 
Developing Android Platform Tools
Developing Android Platform ToolsDeveloping Android Platform Tools
Developing Android Platform Tools
 
DCSF 19 Deploying Rootless buildkit on Kubernetes
DCSF 19 Deploying Rootless buildkit on KubernetesDCSF 19 Deploying Rootless buildkit on Kubernetes
DCSF 19 Deploying Rootless buildkit on Kubernetes
 
Heroku 101 py con 2015 - David Gouldin
Heroku 101   py con 2015 - David GouldinHeroku 101   py con 2015 - David Gouldin
Heroku 101 py con 2015 - David Gouldin
 
FRIDA 101 Android
FRIDA 101 AndroidFRIDA 101 Android
FRIDA 101 Android
 
Android Platform Debugging and Development
Android Platform Debugging and DevelopmentAndroid Platform Debugging and Development
Android Platform Debugging and Development
 
Java and Container - Make it Awesome !
Java and Container - Make it Awesome !Java and Container - Make it Awesome !
Java and Container - Make it Awesome !
 
Path to Surfdroid
Path to SurfdroidPath to Surfdroid
Path to Surfdroid
 
Debug JNI code with ndk-gdb and eclipse GUI
Debug JNI code with ndk-gdb and eclipse GUIDebug JNI code with ndk-gdb and eclipse GUI
Debug JNI code with ndk-gdb and eclipse GUI
 
Oracle ZDM KamaleshRamasamy Sangam2020
Oracle ZDM KamaleshRamasamy Sangam2020Oracle ZDM KamaleshRamasamy Sangam2020
Oracle ZDM KamaleshRamasamy Sangam2020
 
Android on Windows 11 - A Developer's Perspective (Windows Subsystem For Andr...
Android on Windows 11 - A Developer's Perspective (Windows Subsystem For Andr...Android on Windows 11 - A Developer's Perspective (Windows Subsystem For Andr...
Android on Windows 11 - A Developer's Perspective (Windows Subsystem For Andr...
 
Android Platform Debugging and Development at ABS 2014
Android Platform Debugging and Development at ABS 2014Android Platform Debugging and Development at ABS 2014
Android Platform Debugging and Development at ABS 2014
 

Plus de Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 

Plus de Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 

Dernier

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

  • 1. glideinWMS Training @ UCSD glideinWMS Frontend Monitoring by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Frontend Monitoring 1
  • 2. Overview ● Refresher ● What is available ● What to look for UCSD Jan 18th 2012 Frontend Monitoring 2
  • 3. Refresher – glideinWMS ● A glidein is just a properly configured Condor execution node submitted as a Grid job ● Frontend drives submission Configure Condor G.N. Submit node Frontend node Worker node Monitor Submit node Frontend Condor glidein Central manager Startd Match Globus Job Request glideins Factory node Condor glidein Execution node CREAM Factory glidein Execution node Submit glideins UCSD Jan 18th 2012 Frontend Monitoring 3
  • 4. Reminder Condor is king! (glideinWMS just a small layer on top) UCSD Jan 18th 2012 Frontend Monitoring 4
  • 5. Refresher – Frontend arch ● Many Groups ● With a “Master” Frontend as an aggregator Submit node Factory Submit node Factory Central manager Frontend node Group Entry ... Group glidein Spawn Web Server Frontend UCSD Jan 18th 2012 Frontend Monitoring 5
  • 6. Available monitoring ● Condor monitoring Even if a dynamic one ● It is just a condor pool! ● Any Condor monitoring tools will work ● VO Frontend monitoring ● The VO Frontend provides some basic Condor monitoring ● Plus the monitoring of it own internal workings ● Glidein Factory monitoring You should not need to use it but it is publicly accessible UCSD Jan 18th 2012 Frontend Monitoring 6
  • 7. Condor monitoring UCSD Jan 18th 2012 Frontend Monitoring 7
  • 8. Condor Monitoring ● Out of the box you get ● Command line tools ● Log parsing ● Several external tools available, e.g. ● CondorView Condor external package ● CycleServer Your portal may Commercial tool, (semi-)free for Academia provide additional monitoring, too UCSD Jan 18th 2012 Frontend Monitoring 8
  • 9. Glidein monitoring ● The glideins will register with the Collector ● Condor command to monitor them Same syntax as condor_status Requirements ● -constraint - To select a subset of them ● -total - For a quick summary ● Output formatting options ● No arguments - In use/unused ● -long - Full ClassAds ● -format - Select attributes only ● -xml - xml formatting Easier to http://www.cs.wisc.edu/condor/manual/v7.6/condor_status.html machine parse UCSD Jan 18th 2012 Frontend Monitoring 9
  • 10. Example $ condor_status $ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime Name OpSys Arch State Activity LoadAv Mem ActvtyTime glidein_17848@alic LINUX X86_64 Claimed Busy 7.440 18037 0+01:06:06 glidein_17848@alic LINUX X86_64 Claimed Busy 7.440 18037 0+01:06:06 glidein_15842@alic LINUX X86_64 Claimed Busy 7.010 18037 0+00:35:21 glidein_15842@alic LINUX X86_64 Claimed Busy 7.010 18037 0+00:35:21 glidein_18249@alic LINUX X86_64 Claimed Busy 7.510 18037 0+01:24:09 glidein_18249@alic LINUX X86_64 Claimed Busy 7.510 18037 0+01:24:09 glidein_17825@wn89 LINUX X86_64 Unclaimed Idle 11.990 16056 0+00:15:12 glidein_17825@wn89 LINUX X86_64 Unclaimed Idle 11.990 16056 0+00:15:12 glidein_10082@wn91 LINUX X86_64 Claimed Idle 7.000 16056 0+00:02:46 glidein_10082@wn91 LINUX X86_64 Claimed Idle 7.000 16056 0+00:02:46 … … glidein_3964@wp-05 LINUX X86_64 Claimed Busy 24.000 64464 0+16:00:29 glidein_3964@wp-05 LINUX X86_64 Claimed Busy 24.000 64464 0+16:00:29 glidein_5614@wp-05 LINUX X86_64 Claimed Busy 23.360 64464 0+16:12:56 glidein_5614@wp-05 LINUX X86_64 Claimed Busy 23.360 64464 0+16:12:56 glidein_5861@wp-05 LINUX X86_64 Claimed Retiring 22.140 64464 0+00:23:18 glidein_5861@wp-05 LINUX X86_64 Claimed Retiring 22.140 64464 0+00:23:18 Total Owner Claimed Unclaimed Matched Preempting Backfill Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 23249 0 22697 552 0 0 0 X86_64/LINUX 23249 0 22697 552 0 0 0 Total 23249 0 22697 552 0 0 0 Total 23249 0 22697 552 0 0 0 UCSD Jan 18th 2012 Frontend Monitoring 10
  • 11. Another example $ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime $ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime -const "GLIDEIN_Max_Walltime>83000" -const "GLIDEIN_Max_Walltime>83000" glidein_10001@we017.grid.hep.ph.ic.ac.uk 86040 glidein_10001@we017.grid.hep.ph.ic.ac.uk 86040 glidein_10006@rossmann-a292.rcac.purdue.edu 114840 glidein_10006@rossmann-a292.rcac.purdue.edu 114840 glidein_10007@we033.grid.hep.ph.ic.ac.uk 86040 glidein_10007@we033.grid.hep.ph.ic.ac.uk 86040 ... ... glidein_9990@lxbra6310.cern.ch 114840 glidein_9990@lxbra6310.cern.ch 114840 glidein_9990@rossmann-a212.rcac.purdue.edu 114840 glidein_9990@rossmann-a212.rcac.purdue.edu 114840 glidein_9993@grid191.lal.in2p3.fr 114840 glidein_9993@grid191.lal.in2p3.fr 114840 $ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime -xml $ condor_status -format '%-50s ' Name -format '%6in' GLIDEIN_Max_Walltime -xml -const "GLIDEIN_Max_Walltime>83000" -const "GLIDEIN_Max_Walltime>83000" <?xml version="1.0"?> <?xml version="1.0"?> <!DOCTYPE classads SYSTEM "classads.dtd"> <!DOCTYPE classads SYSTEM "classads.dtd"> <classads> <classads> <c> <c> <a n="MyType"><s>Machine</s></a> <a n="MyType"><s>Machine</s></a> <a n="TargetType"><s>Job</s></a> <a n="TargetType"><s>Job</s></a> <a n="Name"><s>glidein_10001@we017.grid.hep.ph.ic.ac.uk</s></a> <a n="Name"><s>glidein_10001@we017.grid.hep.ph.ic.ac.uk</s></a> <a n="GLIDEIN_Max_Walltime"><i>86040</i></a> <a n="GLIDEIN_Max_Walltime"><i>86040</i></a> <a n="CurrentTime"><e>time()</e></a> <a n="CurrentTime"><e>time()</e></a> </c> </c> ... ... UCSD Jan 18th 2012 Frontend Monitoring 11
  • 12. Collector log(s) Place to look when things seem fishy! ● The Collector(s) will log any errors ● The interesting errors will likely be in the leaves of the Collector tree ~condor/glidecondor/condor_local/log/CondorXXXLog ● Logs rotate, so be sure to look in .old as well Yes, you will ● You also get the glidein have 100s authentication logs of them! ● And log verbosity can be further increased with COLLECTOR_DEBUG http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:SubsysDebug UCSD Jan 18th 2012 Frontend Monitoring 12
  • 13. Example 01/13/12 17:24:13 ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN= 01/13/12 17:24:13 ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN= uscmspilot47/glidein-1.t2.ucsd.edu' uscmspilot47/glidein-1.t2.ucsd.edu' 01/13/12 17:24:13 ZKM: 2: mapret: 00 included_voms: 0 canonical_user:glidein47 01/13/12 17:24:13 ZKM: 2: mapret: included_voms: 0 canonical_user: glidein47 01/13/12 17:24:13 ZKM: successful mapping to glidein47 01/13/12 17:24:13 ZKM: successful mapping to glidein47 ... ... 01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno ==104 Connection reset 01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno 104 Connection reset by peer, reading 44 bytesfrom <130.104.133.245:7812>. by peer, reading bytes from <130.104.133.245:7812>. 01/13/12 17:24:19 DaemonCore: Can't receive command request from 130.104.133.245 01/13/12 17:24:19 DaemonCore: Can't receive command request from 130.104.133.245 (perhaps aa timeout?) (perhaps timeout?) ... ... 01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=1 01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=1 0716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is 0716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is currently registered with that id (perhaps ititrecently disconnected). currently registered with that id (perhaps recently disconnected). UCSD Jan 18th 2012 Frontend Monitoring 13
  • 14. Job monitoring ● You can monitor local jobs ● For jobs still in the queue (still waiting or running) condor_q ● For finished jobs Limited number of jobs preserved condor_history ● Similar cmdline args as condor_status ● Remote condor_q possible with -name http://www.cs.wisc.edu/condor/manual/v7.6/condor_q.html http://www.cs.wisc.edu/condor/manual/v7.6/condor_history.html UCSD Jan 18th 2012 Frontend Monitoring 14
  • 15. Example $ condor_q $ condor_q -- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node -- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 367788.0 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 1 1 367788.0 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 1 1 367788.1 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 2 1 367788.1 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 2 1 383995.19 uscms1789 1/11 02:26 2+13:35:38 R 0 1953.1 CMSSW.sh 1118 4 383995.19 uscms1789 1/11 02:26 2+13:35:38 R 0 1953.1 CMSSW.sh 1118 4 383995.179 uscms1789 1/11 02:26 2+11:29:06 R 0 1464.8 CMSSW.sh 1310 4 383995.179 uscms1789 1/11 02:26 2+11:29:06 R 0 1464.8 CMSSW.sh 1310 4 383999.32 uscms1789 1/11 02:31 2+09:36:12 R 0 1953.1 CMSSW.sh 299 4 383999.32 uscms1789 1/11 02:31 2+09:36:12 R 0 1953.1 CMSSW.sh 299 4 383999.46 uscms1789 1/11 02:31 2+11:00:25 R 0 1953.1 CMSSW.sh 316 4 383999.46 uscms1789 1/11 02:31 2+11:00:25 R 0 1953.1 CMSSW.sh 316 4 … … 385002.7 uscms3015 1/13 17:31 0+00:01:51 R 0 0.0 CMSSW.sh 70 2 385002.7 uscms3015 1/13 17:31 0+00:01:51 R 0 0.0 CMSSW.sh 70 2 385002.8 uscms3015 1/13 17:31 0+00:01:49 R 0 0.0 CMSSW.sh 89 2 385002.8 uscms3015 1/13 17:31 0+00:01:49 R 0 0.0 CMSSW.sh 89 2 385002.9 uscms3015 1/13 17:31 0+00:01:29 R 0 0.0 CMSSW.sh 91 2 385002.9 uscms3015 1/13 17:31 0+00:01:29 R 0 0.0 CMSSW.sh 91 2 385002.10 uscms3015 1/13 17:31 0+00:01:00 R 0 0.0 CMSSW.sh 97 2 385002.10 uscms3015 1/13 17:31 0+00:01:00 R 0 0.0 CMSSW.sh 97 2 58707 jobs; 39484 idle, 11694 running, 7529 held 58707 jobs; 39484 idle, 11694 running, 7529 held UCSD Jan 18th 2012 Frontend Monitoring 15
  • 16. Job logs ● Users are encouraged to have a log for jobs ● Provides easy way to monitor the progress without calling condor_q/condor_history 000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569> 000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569> ... ... 001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422> 001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422> ... ... 005 (001.000.000) 12/16 13:30:32 Job terminated. 005 (001.000.000) 12/16 13:30:32 Job terminated. Literally ... (1) Normal termination (return value 0) (1) Normal termination (return value 0) Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 217 - Run Bytes Sent By Job 217 - Run Bytes Sent By Job 76 - Run Bytes Received By Job 76 - Run Bytes Received By Job 217 - Total Bytes Sent By Job 217 - Total Bytes Sent By Job 76 - Total Bytes Received By Job 76 - Total Bytes Received By Job ... ... UCSD Jan 18th 2012 Frontend Monitoring 16
  • 17. Condor Daemon logs ● By default ● Schedd writes a log /opt/glidecondor/condor_local/log/ScheddLog ● Shadows share a common log /opt/glidecondor/condor_local/log/ShadowLog ● The logs rotate, look for .old files as well ● Lots of interesting info in them ● Quite high verbosity by default UCSD Jan 18th 2012 Frontend Monitoring 17
  • 18. ScheddLog Example 01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng 01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng 01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng 01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng 01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.4 01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.4 01/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/ 01/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/ CN=rokpilot01/osg.ctbp.ucsd.edu' CN=rokpilot01/osg.ctbp.ucsd.edu' 01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot 01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot 01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot 01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot 01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100 01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100 ... ... 01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824> 01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824> 01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE 01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE 01/13/12 18:05:02 (pid:32035) Negotiating for owner: cfeng@osg.ctbp.ucsd.edu 01/13/12 18:05:02 (pid:32035) Negotiating for owner: cfeng@osg.ctbp.ucsd.edu 01/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected 01/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected 01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_14642@ 01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_14642@ cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng 01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_16403@ 01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_16403@ cabinet-1-1-0.t2.ucsd.edu <169.228.131.217:34929?CCBID=169.228.40.37:9623#108&noUDP> for cfeng cabinet-1-1-0.t2.ucsd.edu <169.228.131.217:34929?CCBID=169.228.40.37:9623#108&noUDP> for cfeng 01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_13015@ 01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_13015@ cabinet-2-2-24.t2.ucsd.edu <169.228.131.156:33784?CCBID=169.228.40.37:9708#95&noUDP> for cfeng cabinet-2-2-24.t2.ucsd.edu <169.228.131.156:33784?CCBID=169.228.40.37:9708#95&noUDP> for cfeng 01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138) 01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138) 01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on glidein_14642@ 01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on glidein_14642@ cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng, cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng, (shadow pid = 5238) (shadow pid = 5238) UCSD Jan 18th 2012 Frontend Monitoring 18
  • 19. ShadowLog Example 01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP> 01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP> 01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586> 01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586> 01/12/12 21:52:36 Setting maximum accepts per cycle 4. 01/12/12 21:52:36 Setting maximum accepts per cycle 4. 01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.108 01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.108 01/12/12 21:52:36 (157171.97) (32318): Request to run on 01/12/12 21:52:36 (157171.97) (32318): Request to run on glidein_23261@cabinet-2-2-26.t2.ucsd.edu <169.228.131.154:48495? glidein_23261@cabinet-2-2-26.t2.ucsd.edu <169.228.131.154:48495? CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED 01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated: 01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated: exited with status 0 exited with status 0 01/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW) 01/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW) pid 10937 EXITING WITH STATUS 100 pid 10937 EXITING WITH STATUS 100 … … 01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28) 01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28) SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>; SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>; STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/ STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/ dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz: dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz: (errno 28) No space left on device (errno 28) No space left on device 01/13/12 18:01:15 (157177.52) (4768): **** condor_shadow (condor_SHADOW) 01/13/12 18:01:15 (157177.52) (4768): **** condor_shadow (condor_SHADOW) pid 4768 EXITING WITH STATUS 112 pid 4768 EXITING WITH STATUS 112 UCSD Jan 18th 2012 Frontend Monitoring 19
  • 20. Submitter ClassAds ● The schedd will advertise two types of ClassAds to the Collector ● Schedd daemon ClassAds condor_status -schedd ● Per-user ClassAds condor_status -submitter ● Can be useful for getting a summary view of the system UCSD Jan 18th 2012 Frontend Monitoring 20
  • 21. Example $ condor_status -schedd $ condor_status -schedd Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs cmsfnal01.fnal.gov cmsfnal01. 0 0 0 cmsfnal01.fnal.gov cmsfnal01. 0 0 0 glidein-2.t2.ucsd.ed glidein-2. 10932 38480 7607 glidein-2.t2.ucsd.ed glidein-2. 10932 38480 7607 submit-2.t2.ucsd.edu submit-2.t 11103 8955 1667 submit-2.t2.ucsd.edu submit-2.t 11103 8955 1667 vocms120.cern.ch vocms120.c 0 4024 2 vocms120.cern.ch vocms120.c 0 4024 2 TotalRunningJobs TotalIdleJobs TotalHeldJobs TotalRunningJobs TotalIdleJobs TotalHeldJobs Total 22035 51459 9276 Total 22035 51459 9276 $ condor_status -schedd -l submit-2.t2.ucsd.edu $ condor_status -schedd -l submit-2.t2.ucsd.edu Name = "submit-2.t2.ucsd.edu" Name = "submit-2.t2.ucsd.edu" MaxJobsRunning = 20000 MaxJobsRunning = 20000 TotalHeldJobs = 1667 TotalHeldJobs = 1667 TotalIdleJobs = 9347 TotalIdleJobs = 9347 … … TotalJobAds = 22096 TotalJobAds = 22096 TransferQueueDownloadWaitTime = 0 TransferQueueDownloadWaitTime = 0 MyType = "Scheduler" MyType = "Scheduler" UCSD Jan 18th 2012 Frontend Monitoring 21
  • 22. Example $ condor_status -submitter $ condor_status -submitter Name Machine Running IdleJobs HeldJobs Name Machine Running IdleJobs HeldJobs uscms1789@glidein-2. glidein-2. 344 0 20 uscms1789@glidein-2. glidein-2. 344 0 20 uscms1811@glidein-2. glidein-2. 176 1141 0 uscms1811@glidein-2. glidein-2. 176 1141 0 uscms1976@glidein-2. glidein-2. 629 0 7 uscms1976@glidein-2. glidein-2. 629 0 7 … … uscms742@submit-2.t2 submit-2.t 405 0 0 uscms742@submit-2.t2 submit-2.t 405 0 0 cms1279@vocms120.cer vocms120.c 0 4000 0 cms1279@vocms120.cer vocms120.c 0 4000 0 RunningJobs IdleJobs HeldJobs RunningJobs IdleJobs HeldJobs uscms019@submit-2.t2 11 0 1 uscms019@submit-2.t2 11 0 1 uscms1537@glidein-2. 0 0 1 uscms1537@glidein-2. 0 0 1 uscms1811@glidein-2. 176 1141 0 uscms1811@glidein-2. 176 1141 0 uscms1811@submit-2.t 177 3324 0 uscms1811@submit-2.t 177 3324 0 … … uscms742@glidein-2.t 3107 289 41 uscms742@glidein-2.t 3107 289 41 uscms742@submit-2.t2 405 0 0 uscms742@submit-2.t2 405 0 0 uscms911@glidein-2.t 0 0 42 uscms911@glidein-2.t 0 0 42 Total 22092 51518 9280 Total 22092 51518 9280 UCSD Jan 18th 2012 Frontend Monitoring 22
  • 23. Negotiator Monitoring ● To check for user priorities, use condor_userprio ● -alluser - Without, only running users ● -all - Provides detailed info ● Negotiator Log useful to troubleshoot ~/glidecondor/condor_local/log/NegotiatorLog ● Look for errors and to monitor cycle times ● Negotiator also advertises a ClassAd ● Use condor_status -negotiator -long UCSD Jan 18th 2012 Frontend Monitoring 23
  • 24. Example 1/2 $ condor_userprio -all -allusers $ condor_userprio -all -allusers Last Priority Update: 1/13 18:33 Last Priority Update: 1/13 18:33 Effective Real Priority Res ... Effective Real Priority Res ... User Name Priority Priority Factor Used ... User Name Priority Priority Factor Used ... ------------------------------ --------- -------- ------------ ---- ... ------------------------------ --------- -------- ------------ ---- ... cmspa0029@submit-2.t2.ucsd.edu 158.01 15.80 10.00 0 ... cmspa0029@submit-2.t2.ucsd.edu 158.01 15.80 10.00 0 ... cmspa0029@glidein-2.t2.ucsd.ed 205.37 20.54 10.00 0 ... cmspa0029@glidein-2.t2.ucsd.ed 205.37 20.54 10.00 0 ... uscms506@glidein-2.t2.ucsd.edu 559.11 0.56 1000.00 0 ... uscms506@glidein-2.t2.ucsd.edu 559.11 0.56 1000.00 0 ... uscms2450@glidein-2.t2.ucsd.ed 576.15 0.58 1000.00 0 ... uscms2450@glidein-2.t2.ucsd.ed 576.15 0.58 1000.00 0 ... uscms3501@submit-2.t2.ucsd.edu 775.26 0.78 1000.00 0 ... uscms3501@submit-2.t2.ucsd.edu 775.26 0.78 1000.00 0 ... shi034@glidein-2.t2.ucsd.edu 827.95 0.83 1000.00 0 ... shi034@glidein-2.t2.ucsd.edu 827.95 0.83 1000.00 0 ... uscms2450@submit-2.t2.ucsd.edu 1455.42 1.46 1000.00 0 ... uscms2450@submit-2.t2.ucsd.edu 1455.42 1.46 1000.00 0 ... uscms4043@glidein-2.t2.ucsd.ed 1677.00 1.68 1000.00 0 ... uscms4043@glidein-2.t2.ucsd.ed 1677.00 1.68 1000.00 0 ... uscms2336@glidein-2.t2.ucsd.ed 2113.44 2.11 1000.00 0 ... uscms2336@glidein-2.t2.ucsd.ed 2113.44 2.11 1000.00 0 ... uscms2330@glidein-2.t2.ucsd.ed 2493.31 2.49 1000.00 0 ... uscms2330@glidein-2.t2.ucsd.ed 2493.31 2.49 1000.00 0 ... uscms4084@glidein-2.t2.ucsd.ed 2506.61 2.51 1000.00 0 ... uscms4084@glidein-2.t2.ucsd.ed 2506.61 2.51 1000.00 0 ... uscms2330@submit-2.t2.ucsd.edu 2771.17 2.77 1000.00 0 ... uscms2330@submit-2.t2.ucsd.edu 2771.17 2.77 1000.00 0 ... uscms4043@submit-2.t2.ucsd.edu 5150.52 5.15 1000.00 0 ... uscms4043@submit-2.t2.ucsd.edu 5150.52 5.15 1000.00 0 ... uscms2535@glidein-2.t2.ucsd.ed 5357.76 5.36 1000.00 176 ... uscms2535@glidein-2.t2.ucsd.ed 5357.76 5.36 1000.00 176 ... UCSD Jan 18th 2012 Frontend Monitoring 24
  • 25. Example 2/2 $ condor_userprio -all -allusers $ condor_userprio -all -allusers Last Priority Update: 1/13 18:33 Last Priority Update: 1/13 18:33 … Total Usage Usage Last … Total Usage Usage Last User Name … (wghted-hrs) Start Time Usage Time User Name … (wghted-hrs) Start Time Usage Time ------------------------------ … ----------- ---------------- ---------------- ------------------------------ … ----------- ---------------- ---------------- cmspa0029@submit-2.t2.ucsd.edu … 82863.87 10/03/2011 01:41 1/11/2012 07:05 cmspa0029@submit-2.t2.ucsd.edu … 82863.87 10/03/2011 01:41 1/11/2012 07:05 cmspa0029@glidein-2.t2.ucsd.ed … 202430.74 10/31/2011 01:30 1/12/2012 02:00 cmspa0029@glidein-2.t2.ucsd.ed … 202430.74 10/31/2011 01:30 1/12/2012 02:00 uscms506@glidein-2.t2.ucsd.edu … 437667.09 7/02/2011 08:06 1/08/2012 07:29 uscms506@glidein-2.t2.ucsd.edu … 437667.09 7/02/2011 08:06 1/08/2012 07:29 uscms2450@glidein-2.t2.ucsd.ed … 47024.87 10/09/2011 13:26 1/07/2012 01:28 uscms2450@glidein-2.t2.ucsd.ed … 47024.87 10/09/2011 13:26 1/07/2012 01:28 uscms3501@submit-2.t2.ucsd.edu … 3677.14 11/23/2011 08:12 1/10/2012 01:02 uscms3501@submit-2.t2.ucsd.edu … 3677.14 11/23/2011 08:12 1/10/2012 01:02 shi034@glidein-2.t2.ucsd.edu … 1309024.85 6/03/2009 00:48 1/07/2012 15:57 shi034@glidein-2.t2.ucsd.edu … 1309024.85 6/03/2009 00:48 1/07/2012 15:57 uscms2450@submit-2.t2.ucsd.edu … 81864.63 9/26/2011 15:22 1/07/2012 05:46 uscms2450@submit-2.t2.ucsd.edu … 81864.63 9/26/2011 15:22 1/07/2012 05:46 uscms4043@glidein-2.t2.ucsd.ed … 6966.57 10/10/2011 22:48 1/09/2012 17:35 uscms4043@glidein-2.t2.ucsd.ed … 6966.57 10/10/2011 22:48 1/09/2012 17:35 uscms2336@glidein-2.t2.ucsd.ed … 57125.01 5/27/2011 02:00 1/09/2012 21:13 uscms2336@glidein-2.t2.ucsd.ed … 57125.01 5/27/2011 02:00 1/09/2012 21:13 uscms2330@glidein-2.t2.ucsd.ed … 85581.04 8/06/2011 12:45 1/09/2012 07:45 uscms2330@glidein-2.t2.ucsd.ed … 85581.04 8/06/2011 12:45 1/09/2012 07:45 uscms4084@glidein-2.t2.ucsd.ed … 158894.51 10/11/2011 11:11 1/08/2012 17:17 uscms4084@glidein-2.t2.ucsd.ed … 158894.51 10/11/2011 11:11 1/08/2012 17:17 uscms2330@submit-2.t2.ucsd.edu … 13528.66 9/05/2011 02:15 1/09/2012 23:46 uscms2330@submit-2.t2.ucsd.edu … 13528.66 9/05/2011 02:15 1/09/2012 23:46 uscms4043@submit-2.t2.ucsd.edu … 10824.76 9/28/2011 05:02 1/09/2012 03:27 uscms4043@submit-2.t2.ucsd.edu … 10824.76 9/28/2011 05:02 1/09/2012 03:27 uscms2535@glidein-2.t2.ucsd.ed … 304430.61 11/17/2009 11:04 1/13/2012 18:33 uscms2535@glidein-2.t2.ucsd.ed … 304430.61 11/17/2009 11:04 1/13/2012 18:33 UCSD Jan 18th 2012 Frontend Monitoring 25
  • 26. NegotiatorLog Example 01/13/12 18:23:05 ---------- Finished Negotiation Cycle ---------- 01/13/12 18:23:05 ---------- Finished Negotiation Cycle ---------- 01/13/12 18:24:09 ---------- Started Negotiation Cycle ---------- 01/13/12 18:24:09 ---------- Started Negotiation Cycle ---------- 01/13/12 18:24:09 Phase 1: Obtaining ads from collector ... 01/13/12 18:24:09 Phase 1: Obtaining ads from collector ... 01/13/12 18:24:09 Getting all public ads ... 01/13/12 18:24:09 Getting all public ads ... 01/13/12 18:24:44 Sorting 23021 ads ... 01/13/12 18:24:44 Sorting 23021 ads ... 01/13/12 18:24:46 Getting startd private ads ... 01/13/12 18:24:46 Getting startd private ads ... 01/13/12 18:24:51 Got ads: 23021 public and 22571 private 01/13/12 18:24:51 Got ads: 23021 public and 22571 private 01/13/12 18:24:51 Public ads include 38 submitter, 22568 startd 01/13/12 18:24:51 Public ads include 38 submitter, 22568 startd 01/13/12 18:24:51 Phase 2: Performing accounting ... 01/13/12 18:24:51 Phase 2: Performing accounting ... 01/13/12 18:25:01 Phase 3: Sorting submitter ads by priority ... 01/13/12 18:25:01 Phase 3: Sorting submitter ads by priority ... 01/13/12 18:25:01 Phase 4.1: Negotiating with schedds ... 01/13/12 18:25:01 Phase 4.1: Negotiating with schedds ... 01/13/12 18:25:01 Negotiating with sfiligoi@submit-2.t2.ucsd.edu at 01/13/12 18:25:01 Negotiating with sfiligoi@submit-2.t2.ucsd.edu at <169.228.130.26:9615?sock=10263_1229_2> <169.228.130.26:9615?sock=10263_1229_2> 01/13/12 18:25:01 0 seconds so far 01/13/12 18:25:01 0 seconds so far 01/13/12 18:25:02 Request 345869.00000: 01/13/12 18:25:02 Request 345869.00000: 01/13/12 18:25:02 Rejected 345869.0 sfiligoi@submit-2.t2.ucsd.edu 01/13/12 18:25:02 Rejected 345869.0 sfiligoi@submit-2.t2.ucsd.edu <169.228.130.26:9615?sock=10263_1229_2>: no match found <169.228.130.26:9615?sock=10263_1229_2>: no match found 01/13/12 18:25:02 Got NO_MORE_JOBS; done negotiating 01/13/12 18:25:02 Got NO_MORE_JOBS; done negotiating … … 01/13/12 18:25:06 Request 384970.00170: 01/13/12 18:25:06 Request 384970.00170: 01/13/12 18:25:06 Matched 384970.170 uscms2535@glidein-2.t2.ucsd.edu 01/13/12 18:25:06 Matched 384970.170 uscms2535@glidein-2.t2.ucsd.edu <169.228.130.11:9615?sock=9763_cd4c_2> preempting none <192.168.3.77:55906? <169.228.130.11:9615?sock=9763_cd4c_2> preempting none <192.168.3.77:55906? CCBID=169.228.130.23:9823#13833&noUDP> glidein_15335@lnxfarm177.colorado.edu CCBID=169.228.130.23:9823#13833&noUDP> glidein_15335@lnxfarm177.colorado.edu 01/13/12 18:25:06 Successfully matched with 01/13/12 18:25:06 Successfully matched with glidein_15335@lnxfarm177.colorado.edu glidein_15335@lnxfarm177.colorado.edu UCSD Jan 18th 2012 Frontend Monitoring 26
  • 27. CycleServer Screenshots ● Can do more than just monitoring ● But the rest beyond the scope of this talk UCSD Jan 18th 2012 Frontend Monitoring 27
  • 28. Frontend Monitoring UCSD Jan 18th 2012 Frontend Monitoring 28
  • 29. Frontend monitoring Frontend node Entry Group ... Group ● Helper cmdline tool Spawn ● Plus, each Group provides: Frontend ● Activity/Error logs ● RRD files with statistics (running, held, etc.) ● XML files with current snapshot ● Resource ClassAds ● Master frontend aggregates RRD and XML files, and writes them in its own area ● Human readable/viewable Web pages available UCSD Jan 18th 2012 Frontend Monitoring 29
  • 30. Helper cmdline tool ● Wrapper around condor condor_status glideinWMS/tools/glidein_status.py ● Provides useful formatting ~/glideinWMS/tools$ ./glidein_status.py ~/glideinWMS/tools$ ./glidein_status.py Name Site Factory Entry State Activit Name Site Factory Entry State Activi glidein_6682@alicegrid26.ba.infn.it Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy glidein_6682@alicegrid26.ba.infn.it glidein_10678@alicegrid32.ba.infn.it Bari Bari v1_0@OSGGOC v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed CMS_T2_IT_Bari_ce01 Claimed Busy Busy … glidein_10678@alicegrid32.ba.infn.it Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy … glidein_5861@wp-05-12.pn.pd.infn.it Legnaro v1_0@OSGGOC CMS_T2_IT_Legnaro_. Claimed Retirin glidein_5861@wp-05-12.pn.pd.infn.it Legnaro v1_0@OSGGOC CMS_T2_IT_Legnaro_. Claimed Retiri Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed M Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 11 0 11 0 0 0 CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 522 CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 11 00 11 517 00 00 50 CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 1201 CMS_T2_US_Purdue_osg@v1_0@OSGGOC 522 00 517 1182 14 0 00 55 … CMS_T2_US_Purdue_osg@v1_0@OSGGOC 1201 0 1182 14 0 5 … CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD 135 0 132 0 0 3 CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD 135 0 132 0 0 3 Total 21474 0 19742 1264 0 468 Total 21474 0 19742 1264 0 468 UCSD Jan 18th 2012 Frontend Monitoring 30
  • 31. Log files ● Each Frontend group provides 3 types of logs log/group_XXX/frontend.date.type.log ● info - Progress and warnings ● err - One line warnings ● debug - Multi line error messages ● The master frontend has similar logs log/frontend/frontend.date.type.log ● But rarely anything interesting there UCSD Jan 18th 2012 Frontend Monitoring 31
  • 32. Example Info Log :01-07:00 15037] Iteration at Tue Nov 15 10:44:01 2011 :01-07:00 15037] Query condor :01-07:00 15037] Child processes created :05-07:00 31633] WARNING: Failed to talk to schedd submit-1.t2.ucsd.edu. See debug log for more details. :05-07:00 15037] All children terminated :05-07:00 15037] Jobs found total 4836 idle 1732 (old 1732, voms 1703) running 3104 :05-07:00 15037] Glideins found total 639 idle 8 running 630 limit 800 curb 600 :05-07:00 15037] Using 1 proxies :05-07:00 15037] Match :05-07:00 15037] Counting :05-07:00 15037] Child processes created :06-07:00 15037] All children terminated :06-07:00 15037] Total matching idle 1732 (old 1703) running 3104 :06-07:00 15037] Jobs in schedd queues | Glideins | Request :06-07:00 15037] Idle (match eff old uniq ) Run ( here max ) | Total Idle Run | Idle MaxRun Down Factory :06-07:00 15037] 171( 1705 170 169 0) 3104( 102 250) | 105 1 103 | 10 3276 Up CMS_T2_US_Nebraska_Red@Produ :06-07:00 15037] 171( 1705 167 169 0) 3104( 187 250) | 197 4 193 | 10 3276 Up CMS_T2_US_Nebraska_Red_gw1@P :06-07:00 15037] 171( 1705 171 169 0) 3104( 0 250) | 0 0 0 | 10 3276 Down CMS_T2_US_Nebraska_Red_gw2@P :06-07:00 15037] 171( 1705 171 169 0) 3104( 62 250) | 62 0 62 | 10 3276 Up CMS_T2_US_Wisconsin_cms01@Pr :06-07:00 15037] 171( 1705 171 169 0) 3104( 71 250) | 71 0 71 | 10 3276 Up CMS_T2_US_Wisconsin_cms02@Pr :06-07:00 15037] 171( 1705 169 169 0) 3104( 88 250) | 96 2 94 | 10 3276 Up CMS_T2_US_Nebraska_Red@v1_0@ :06-07:00 15037] 171( 1705 171 169 0) 3104( 1 250) | 1 0 1 | 10 3276 Up CMS_T2_US_Nebraska_Red_gw1@v :06-07:00 15037] 171( 1705 171 169 0) 3104( 0 250) | 0 0 0 | 10 3276 Down CMS_T2_US_Nebraska_Red_gw2@v :06-07:00 15037] 171( 1705 171 169 0) 3104( 45 250) | 45 0 45 | 10 3276 Up CMS_T2_US_Wisconsin_cms01@v1 :06-07:00 15037] 171( 1705 170 169 0) 3104( 60 250) | 62 1 61 | 10 3276 Up CMS_T2_US_Wisconsin_cms02@v1 :06-07:00 15037] Jobs in schedd queues | Glideins | Request :06-07:00 15037] Idle (match eff old uniq ) Run ( here max ) | Total Idle Run | Idle MaxRun Down Factory :06-07:00 15037] 1368(13640 1360 1352 0) 24832( 616 2000) | 639 8 630 | 80 26208 Up Sum of useful factories :06-07:00 15037] 342( 3410 342 338 0) 6208( 0 500) | 0 0 0 | 20 6552 Down Sum of down factories :06-07:00 15037] 27( 27 27 14 27) 0( 0 0) | 0 0 0 | 0 0 Down Unmatched :06-07:00 15037] Advertizing 10 requests :07-07:00 15037] Done advertizing :07-07:00 15037] Advertising 10 glideresource classads to the user pool :07-07:00 15037] Done advertising glideresource classads :07-07:00 15037] Writing stats :07-07:00 15037] Sleep UCSD Jan 18th 2012 Frontend Monitoring 32
  • 33. Example log files frontend.20120113.err.log [2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details. [2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details. [2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details. [2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details. frontend.20120113.debug.log [2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running '/home/frontend/glidecondor/sbin/condor_advertise [2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running '/home/frontend/glidecondor/sbin/condor_advertise -pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2' -pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2' code 1:failed to send classad to <169.228.130.10:9618> code 1:failed to send classad to <169.228.130.10:9618> failed to send classad to <169.228.130.10:9618> failed to send classad to <169.228.130.10:9618> failed to send classad to <169.228.130.10:9618> failed to send classad to <169.228.130.10:9618> failed to send classad to <169.228.130.10:9618> failed to send classad to <169.228.130.10:9618> [2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd 'vocms120.cern.ch' not found [2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd 'vocms120.cern.ch' not found UCSD Jan 18th 2012 Frontend Monitoring 33
  • 34. Web pages 1/3 frontendStatus.html Historical overview Fully dynamic, allows for zooming and selecting of elements to plot Default shows everything, but can restrict to a group and/or a Factory UCSD Jan 18th 2012 Frontend Monitoring 34
  • 35. Web pages 2/3 frontendGroupGraphStatusNow.html Current snapshot in tabular form Useful for spotting problems UCSD Jan 18th 2012 Frontend Monitoring 35
  • 36. Web pages 3/3 frontendGroupGraphStatusNow.html Contains also pie-charts with the same info UCSD Jan 18th 2012 Frontend Monitoring 36
  • 37. RRDs and XML files ● The Web pages are just rendering of the RRDs and XML pages ● Raw data loaded in the browser and rendered ● No server side code ● Other tools could use those data ● Publicly available, if one knows the URL ● No user-identifying data, only summary stats UCSD Jan 18th 2012 Frontend Monitoring 37
  • 38. Resource ClassAds ● The Frontend Groups advertise one ClassAd for each Factory it is requesting glideins from ● Type glideresource ● They contain pretty much everything the Frontend Group knows about the Factory: ● Factory attributes used for matchmaking ● Stats about the matching jobs ● What is being requested ● Even what the Factory is doing! UCSD Jan 18th 2012 Frontend Monitoring 38
  • 39. Example query ● Not a Condor native type, must use ● -any ● Then constrain the type $ condor_status -any -const 'MyType=="glideresource"' -format '%sn' Name $ condor_status -any -const 'MyType=="glideresource"' -format '%sn' Name CMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main ... ... CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Wisconsin_cms02@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Wisconsin_cms02@v1_0@OSGGOC@UCSD-v5_3.main Remotely queryable UCSD Jan 18th 2012 Frontend Monitoring 39
  • 40. Example ClassAd $ condor_status -any $ condor_status -any CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -l CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -l MyType = "glideresource" MyType = "glideresource" Identification Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main" Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main" GlideClientName = "UCSD-v5_3.main" GlideClientName = "UCSD-v5_3.main" ... ... GlideClientMonitorJobsIdle = 210.000000 GlideClientMonitorJobsIdle = 210.000000 GlideClientMonitorJobsRunningHere = 213 Info about local jobs GlideClientMonitorJobsRunningHere = 213 ... ... GlideClientMonitorGlideinsRequestIdle = 50 GlideClientMonitorGlideinsRequestIdle = 50 GlideClientMonitorGlideinsRequestMaxRun = 445 What is being requested GlideClientMonitorGlideinsRequestMaxRun = 445 ... ... GLIDEIN_Site = "UCSD" GLIDEIN_Site = "UCSD" GLEXEC_BIN = "OSG" Factory attributes GLEXEC_BIN = "OSG" ... ... GlideClientMonitorGlideinsRunning = 215 GlideClientMonitorGlideinsRunning = 215 GlideClientMonitorGlideinsTotal = 216 Info about registered glideins GlideClientMonitorGlideinsTotal = 216 ... ... GlideFactoryMonitorStatusRunning = 339 GlideFactoryMonitorStatusRunning = 339 GlideFactoryMonitorStatusPending = 277 Factory status GlideFactoryMonitorStatusPending = 277 GlideFactoryMonitorStatusHeld = 0 GlideFactoryMonitorStatusHeld = 0 ... ... Currently more information than you get on the Web UCSD Jan 18th 2012 Frontend Monitoring 40
  • 41. OK, now you know what's available. What will you do with all that information? (i.e. What to look for) UCSD Jan 18th 2012 Frontend Monitoring 41
  • 42. Monitoring the health of the system ● Six major areas to look after; your goal is ● Few unclaimed glideins (both globally, and per site) ● No unmatched jobs ● Reasonably low restart rate (both global, and per site) ● Reasonably low job failure rate (both global, and per site) ● Negotiation cycle reasonably short ● Schedd node not overloaded UCSD Jan 18th 2012 Frontend Monitoring 42
  • 43. Unclaimed glideins ● Frontend and Negotiator policies are not identical ● You may end up with glideins that never run any jobs ● The discrepancy can be big enough to be noticed on a global scale ● But more often it is just for one (or few) sites ● Short spikes are not a problem ● But long periods are UCSD Jan 18th 2012 Frontend Monitoring 43
  • 44. How do you notice it? ● Historical Web monitoring Bad Good ● Ask for daily emails from the Factory ● Or write your own scripts No Frontend report generators in glideinWMS at this time Parse the RRDs UCSD Jan 18th 2012 Frontend Monitoring 44
  • 45. How do you find the root cause? ● Analyze the latest snapshots ● condor_status/glidein_status ● condor_q ● Frontend Web ● Limit the research to few sites, if possible ● Then start comparing ● Job Requirements, with Can be daunting! ● Glidein Start expressions In theory, there is “condor_q -ana”, but it is usually worthless UCSD Jan 18th 2012 Frontend Monitoring 45
  • 46. Unmatched jobs ● The other side of the problem ● Glideins never asked for some jobs Jobs will never start! ● Two possible reasons ● Wrong Frontend matchmaking policy ● No available Factory entries to serve the job UCSD Jan 18th 2012 Frontend Monitoring 46
  • 47. How do you notice it? ● “Unmatched Factory” in Web monitoring UCSD Jan 18th 2012 Frontend Monitoring 47
  • 48. How do you find the root cause? ● Again, start with the latest snaphot ● condor_q ● condor_status -any -const 'MyType=="glideresource"' ● Get the (python) Match expression from XML ● Start comparing! Can be daunting! UCSD Jan 18th 2012 Frontend Monitoring 48
  • 49. Restarted jobs ● Any restart == wasted CPU ● How do you notice it? ● condor_q is your friend here condor_q -format '%in' NumJobStarts No historical/Web monitoring provided ● Why it happens? ● Glidein disappears! ● End of lifetime hit Not in the default config, ● Preemption policies but you may set Condor to do it ● Submit node overload Condor daemons do not like being resource constrained! UCSD Jan 18th 2012 Frontend Monitoring 49
  • 50. Why glideins disappear? ● Three main reasons Rare ● Remote node just died Some sites do this; nothing you can do. Learn who they are and act accordingly. ● Site preemption policy ● Glidein killed by Site because it exceeded slot limits – Most likely Memory One of 2 limits the OSG factory advertises. GLIDEIN_MaxMemMBs ● Why can limits be exceeded? Job told you it needed ● Job underestimated resource use more resources than the limit! ● Frontend matchmaking logic problem ● Wrong advertised limits Factory problem! UCSD Jan 18th 2012 Frontend Monitoring 50
  • 51. Wallclock limits ● Main resource limit is time ● The glidein automatically deals with it – Will go away before the deadline – … killing/preemptiong any jobs if needed! ● Limit advertised as In seconds – Factory: GLIDEIN_Max_Walltime (-Δ) – Glidein: GLIDEIN_ToDie UNIX time ● Why jobs may reach the deadline? ● Like with all other resources – Job underestimates time it needs – Frontend matchmaking logic problems UCSD Jan 18th 2012 Frontend Monitoring 51
  • 52. Job failures ● Jobs can fail for many reasons ● You should monitor the ExitCode condor_history -back -const 'JobStatus==5' -format '%in' ExitCode ● Knowing what users run often needed to interpret errors ● For common WN errors, Frontend admin should create appropriate validation script ● So glideins fail, not user jobs UCSD Jan 18th 2012 Frontend Monitoring 52
  • 53. Negotiation time ● The negotiation time should be << 5mins ● If much longer, glideins may terminate without running any jobs ● Monitor the NegotiatorLog on CM ● Possible causes ● CPU starvations (e.g. other processes) ● Autocluster explosion – Condor tries to be smart about Matchmaking – But if users don't cooperate, cannot do much UCSD Jan 18th 2012 Frontend Monitoring 53
  • 54. Autoclustering Much faster ● Condor Schedd will try to group jobs if only few groups exist ● All “similar jobs” will be matched together! ● What “similar” means? ● Similar == Would result in the same match ● How it is implemented? ● Tuple of attributes considered during matchmaking ● E.g. (DESIRED_Sites,ImageSize) ● How can the number of autoclusters explode? ● If an attribute that changes a lot is added Example of really bad one: JobID https://condor-wiki.cs.wisc.edu/index.cgi/attach_get/220/cs739.pdf UCSD Jan 18th 2012 Frontend Monitoring 54
  • 55. Submit node health ● Condor is very sensitive to resource starvation ● If submit node overloaded, expect problems! ● How can we get to resource starvation? Trying to run 3k jobs on a 1G RAM node??? ● Poor planning ● Other processes May steal CPU/RAM/IO from Condor ● Interactive activity particularly risky ● Due to its unpredictable nature – Including user errors ● But portals not immune to resource overuse UCSD Jan 18th 2012 Frontend Monitoring 55
  • 56. Summary UCSD Jan 18th 2012 Frontend Monitoring 56
  • 57. Summary ● You have plenty of Monitoring options ● Some prettier, some more powerful ● Most of the time, things just work ● So you don't need to constantly watch after your installation ● But occasionally things will break Or the users will tell you! ● It is in your interest noticing it ● Having good monitoring tools will help you there! UCSD Jan 18th 2012 Frontend Monitoring 57
  • 58. The End UCSD Jan 18th 2012 Frontend Monitoring 58
  • 59. Pointers ● The official glideinWMS project Web page is http://tinyurl.com/glideinWMS ● glideinWMS development team is reachable at glideinwms-support@fnal.gov ● The OSG glidein factory is reachable at osg-gfactory-support@physics.ucsd.edu UCSD Jan 18th 2012 Frontend Monitoring 59
  • 60. Acknowledgments ● The glideinWMS is a CMS-led project developed mostly at FNAL, with contributions from UCSD and ISI ● The glideinWMS factory operations at UCSD is sponsored by OSG ● The funding comes from NSF, DOE and the UC system UCSD Jan 18th 2012 Frontend Monitoring 60