HOWTOs

HTCondor(6): HTCondor and IDL

Please note that all the SIEpedia's articles address specific issues or questions raised by IAC users, so they do not attempt to be rigorous or exhaustive, and may or may not be useful or applicable in different or more general contexts.

IMPORTANT: This documentation is deprecated. It will not be further updated. The new documentation for HTCondor can be found here

HTCondor and IDL

How to run IDL jobs with HTCondor unencumbered by IDL licences

A recurring question to us has been whether IDL jobs can be run with HTCondor. The use of IDL with HTCondor is limited by the number of available licenses at any given time (which meant that perhaps you could run 20-30 jobs simultaneously). However, we strongly recommend you use the IDL Virtual Machine (IDL VM) when possible since it lets you run an IDL "executable" file (SAVE file) without the need for licenses, so there will be no limits on the number of jobs you can concurrently run. Most of you probably know the necessary steps to create a SAVE file, but if in doubt see here for an example on how to create such a file.

Note: If for any reason you are not able to generate a SAVE file, please, contact us and we will help you to find other ways of executing IDL with HTCondor. Remember that running jobs in IDL with no Virtual Machine consumes licences and you must limit the number of concurrent jobs using a command like concurrency_limits = idl:40

Submitting a job to HTCondor using the IDL Virtual Machine (for the impatient)

All you will need to do in order to run your IDL jobs with the Virtual Machine is:

  1. Modify your IDL program so that it will take an argument (from 0 to the number of jobs you want to submit with HTCondor) and act according to that argument. A sample IDL program to illustrate this could be the following one (we will name it subs.pro):
   PRO SUBS

   args = command_line_args()

   print, 'Original argument   ', args(0)
   print, 'Modified   ', args(0)*2

   print, 'Wasting ', args(0), ' seconds'
   wait, args(0)

   print, 'I (IDL) have finished...'
   END
  1. Create a SAVE file from it. Usually you just need to compile your program and generate the SAVE file with your compiled routines. The name of the SAVE file has to be the same as the routine you want to execute. If you have any issue creating this file, please, check more information and examples):
   [...]$ idl
   IDL> .FULL_RESET_SESSION
   IDL> .COMPILE subs.pro
   IDL> RESOLVE_ALL 
   IDL> SAVE, /ROUTINES, FILENAME='subs.sav'
   IDL> exit
   [...]$ 
  1. Verify that this works with the IDL Virtual Machine without HTCondor (the IDL Virtual Machine will show you a Splash screen, where you will have to press the button "Click to Continue", and which then will proceed with the execution of the program).
   [...]$ idl -vm=subs.sav -args 10
   IDL Version 8.3 (linux x86_64 m64). (c) 2013, Exelis Visual Information Solutions, Inc.

   Original argument   10
   Modified         20
   Wasting 10 seconds
   I (IDL) have finished...
   [...]$
  1. Write the HTCondor submit file. If you are new to HTCondor, you might need to look our documentation about submit files (check also other sections like Introduction, Useful commands or FAQs). In the following example you will need to modify:
    • The arguments line, which has 4 items: the first one is the path to the SAVE file; the second one is the argument to pass to it; the third one is 1 if you use a left-handed mouse, and 0 otherwise; and the fourth one is 1 if you want verbose messages for debugging, or 0 otherwise)
    • NOTE: leave the line "next_job_start_delay = 1"
    N            = 20
    ID           = $(Cluster).$(Process)
    FNAME        = idl_vm
    Universe     = vanilla                   
    Notification = error
    should_transfer_files   = YES 
    when_to_transfer_output = ON_EXIT                                               

    output       = $(FNAME).$(ID).out
    error        = $(FNAME).$(ID).err
    Log          = $(FNAME).$(Cluster).log    

    transfer_input_files   = subs.sav
    #Use next command when specific output files hast to be copied back to your machine:
    #transfer_output_files  = 
    Executable   = /home/condor/SIE/idlvm_with_condor.sh
    arguments    = subs.sav $(Process) 0 1

    next_job_start_delay = 1                                  
    queue $(N)
  1. Submit it to HTCondor and go for a cup of coffee while the programs are executed...

Note: Why some of my jobs get the on hold status?

When executing jobs with the IDL VM, it could happen that some jobs get the on hold status. That means some problems occurred with your jobs and HTCondor is waiting that you solve them before continuing with the execution. You can use condor_q -hold command to get more info about the reason why they were held. If there is no apparent cause and you are sure that your jobs are correct, the problem might be related to the initialization of the IDL Virtual Machine: sometimes this process takes longer than usual on some specific machines, and if in the meanwhile more jobs try to initialize other IDL VM on the same machine, some of them could fail and your jobs will get the on hold status. This could randomly happen and there is not an easy way to avoid that.

If you are 100% sure that your program runs fine and the problem is caused by IDL, then you can use condor_release -all command and all your held jobs will get the idle status again so they will hopefully run with no problems on other machines. If some of your jobs fail again, you may need to repeat the condor_release command several times till all the jobs are done. If that happens too many times, you can use some commands to perform recurring releases: for instance, you can add a periodic_release command in your submit file (see this example) and HTCondor will periodically release your held jobs, or you can use a combination of condor_release and some shell commands like crontab, watch, etc.

On the other hand, if after releasing jobs they get the on hold state again, then the problem might not be related to IDL and you should check your application to find the error (remember that you can get more information about held jobs using condor_q -hold).

Check also:





Section: HOWTOs