next up previous contents
Next: DAGMan Up: Hands-on session with Condor: Previous: Managing jobs   Contents

Subsections

Standard Universe

In the standard universe, Condor provides checkpointing and remote system calls. These features make a job more reliable and allow it uniform access to resources from anywhere in the pool. To prepare a program as a standard universe job, it must be relinked with condor_compile. Most programs can be prepared as a standard universe job, but there are a few restrictions.

Condor checkpoints a job at regular intervals. A checkpoint image is essentially a snapshot of the current state of a job. If a job must be migrated from one machine to another, Condor makes a checkpoint image, copies the image to the new machine, and restarts the job continuing the job from where it left off. If a machine should crash or fail while it is running a job, Condor can restart the job on a new machine using the most recent checkpoint image. In this way, jobs can run for months or years even in the face of occasional computer failures.

To convert your program into a standard universe job, you must use condor_ compile to relink it with the Condor libraries. Put condor_compile in front of your usual link command. You do not need to modify the program's source code, but you do need access to the unlinked object files. A commercial program that is packaged as a single executable file cannot be converted into a standard universe job.

For example, if you would have linked the job by executing:

% cc main.o tools.o -o program

Then, relink the job for Condor with:

% condor_compile cc main.o tools.o -o program

There are a few restrictions on standard universe jobs. Before you plan to run a standard universe job, you should make sure that you check out these restrictions in section 2.4.1.1 of the manual page

http://research.iac.es/sieinvens/SINFIN/Condor/v6.6/2_4Road_map_Running.html.

At the IAC, we have opted to only do a partial install of condor_compile. Because of this you are restricted to using condor_compile with one of these programs:

Example

Our very useful program!

This program will just loop. In a fast machine it should take about three hours to finish.

#include <stdio.h>

int main (int argc, char *argv[])
{
  long this_number, other_number;

this_number = 1;

 while(this_number < 10000000) {
   other_number = 1;

   while(other_number < 100000) {
   if (!(this_number % 1000) && (other_number == 1))
     printf("%ld\n", this_number);
   other_number = other_number + 1;
   }
   this_number = this_number + 1;
 }
 return 0;
}

Submission file

########################################################
##
## Example Standard Universe
##
## File: submit_looping_std
##
########################################################

executable = looping_std_solaris_stripped
universe = standard
Requirements = Arch == "SUN4u" && OpSys == "SOLARIS29"

Initialdir = /net/guinda/scratch/angelv/Condor-Course/
output = std_universe.out
error =  std_universe.err
log =    std_universe.log
queue

Running the code

naranja(97)~/SCRIPTS/CONDOR/> condor_compile cc -o looping_std_solaris looping.c 
LINKING FOR CONDOR : /usr/ccs/bin/ld
/opt/SUNWspro/SC5.0/lib/crti.o /usr/pkg/condor/condor-6.6.3/lib/condor_rt0.o
/opt/SUNWspro/SC5.0/lib/values-xa.o -o looping_std_solaris looping.o -Y
P,/opt/SUNWspro/SC5.0/lib:/usr/ccs/lib: /usr/lib -Qy
/usr/pkg/condor/condor-6.6.3/lib/libcondorzsyscall.a
/usr/pkg/condor/condor-6.6.3/lib/libz.a -Bdynamic -lsocket -lnsl -lc
/opt/SUNWspro/SC5.0/lib/crtn.o
/usr/pkg/condor/condor-6.6.3/lib/libcondorc++support.a


naranja(102)~/SCRIPTS/CONDOR/> cp looping_std_solaris looping_std_solaris_stripped

naranja(103)~/SCRIPTS/CONDOR/> strip looping_std_solaris_stripped

naranja(107)~/SCRIPTS/CONDOR/> ls -l
total 41728
-rwxr-xr-x   1 angelv   other       4673 Sep 29 17:48 looping
-rw-r--r--   1 angelv   other        382 Sep 29 17:47 looping.c
-rwxr-xr-x   1 angelv   other    12327189 Sep 29 17:53 looping_std_linux
-rwxr-xr-x   1 angelv   other    1333784 Sep 29 17:56 looping_std_linux_stripped
-rwxr-xr-x   1 angelv   other    5678624 Sep 29 17:54 looping_std_solaris
-rwxr-xr-x   1 angelv   other     678676 Sep 29 17:55 looping_std_solaris_stripped
-rw-r--r--   1 angelv   other        138 May 26 16:52 submit_looping_std
naranja(108)~/SCRIPTS/CONDOR/>

naranja(106)~/SCRIPTS/CONDOR/> ./looping_std_solaris_stripped
Condor: Notice: Will checkpoint to ./looping_std_solaris_stripped.ckpt
Condor: Notice: Remote system calls disabled.
1000
2000
[...]
naranja(107)~/SCRIPTS/CONDOR/>

-------------------------------------------------------------------------

naranja(107)~/SCRIPTS/CONDOR/> cat /scratch/angelv/Condor-Course/std_universe.log

000 (188.000.000) 09/29 18:24:30 Job submitted from host: <161.72.81.187:51962>
...
001 (188.000.000) 09/29 18:25:16 Job executing on host: <161.72.65.35:37169>
...
006 (188.000.000) 09/29 18:27:22 Image size of job updated: 2961
...
004 (188.000.000) 09/29 18:27:30 Job was evicted.
        (1) Job was checkpointed.
                Usr 0 00:02:03, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        2353744  -  Run Bytes Sent By Job
        680030  -  Run Bytes Received By Job
...
001 (188.000.000) 09/29 18:31:21 Job executing on host: <161.72.65.35:37169>
...
004 (188.000.000) 09/29 18:33:31 Job was evicted.
        (1) Job was checkpointed.
                Usr 0 00:01:56, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        2353400  -  Run Bytes Sent By Job
        3032210  -  Run Bytes Received By Job
...
001 (188.000.000) 09/29 18:42:26 Job executing on host: <161.72.65.11:44853>
...
004 (188.000.000) 09/29 20:05:02 Job was evicted.
        (1) Job was checkpointed.
                Usr 0 01:20:57, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        2353400  -  Run Bytes Sent By Job
        3032146  -  Run Bytes Received By Job
...
001 (188.000.000) 09/29 20:13:26 Job executing on host: <161.72.69.18:45267>
...
006 (188.000.000) 09/30 02:13:41 Image size of job updated: 2993
...
003 (188.000.000) 09/30 02:14:19 Job was checkpointed.
        Usr 0 05:57:16, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
...
003 (188.000.000) 09/30 08:14:11 Job was checkpointed.
        Usr 0 11:53:43, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
...

[...]

...
001 (188.000.000) 09/30 11:58:17 Job executing on host: <161.72.66.25:61440>
...


[angelv@guinda Condor-Course]$ condor_q

-- Submitter: guinda.iac.es : <161.72.81.187:51962> : guinda.iac.es
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 188.0   angelv          9/29 18:24   0+18:26:51 R  0   2.9  looping_std_solari

1 jobs; 0 idle, 1 running, 0 held


[angelv@guinda Condor-Course]$ condor_q -l 188.0
-- Submitter: guinda.iac.es : <161.72.81.187:51962> : guinda.iac.es
MyType = "Job"
TargetType = "Machine"
ClusterId = 188
QDate = 1096478669
[...]
Iwd = "/net/guinda/scratch/angelv/Condor-Course/"
JobUniverse = 1
Cmd = "/home/angelv/Condor-Course/Standard_Universe/looping_std_solaris_stripped"
[...]
Requirements = (Arch == "SUN4u" && OpSys == "SOLARIS29") && 
               ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && 
               ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) && 
               (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize)
[...]
TotalSuspensions = 6
CumulativeSuspensionTime = 2853
[...]
NumCkpts = 8
NumRestarts = 13
CkptArch = "SUN4u"
CkptOpSys = "SOLARIS29"
RemoteWallClockTime = 61103.000000
LastRemoteHost = "avestruz.ll.iac.es"
[...]
RemoteHost = "gata.ll.iac.es"
RemoteVirtualMachineID = 1
ShadowBday = 1096541885
JobLastStartDate = 1096539606
JobCurrentStartDate = 1096541885
JobRunCount = 7
WallClockCheckpoint = 4242
ServerTime = 1096547198

[angelv@guinda Condor-Course]$

next up previous contents
Next: DAGMan Up: Hands-on session with Condor: Previous: Managing jobs Contents
Angel M de Vicente 2004-10-25