DCSC logo
 
ABOUT-DCSC
DCSC/SDU
DCSC/AU
DCSC/AAU
DCSC/DTU
DCSC/KU
 
+Open all         -Close all
 
    Overview   Hardware   Software   Batchjobs   Hints  

 

IBM-cluster: Batchjobs

All jobs on the IBM-cluster must be executed as batchjobs through the queueing system. The queueing system is IBMs Loadleveler. It works a bit different than other queueing systems you may know (f.ex. LSF, NQS or PBS). The queueing system is responsible for providing the resources required by the job. Overbooking of resources will never take place, that means if a job requires N CPUs it wil get the N CPUs exclusively when it starts running. One ceveat however due to the design of the Power4: the L3 cache is shared between two CPUs - You cannot control what is going on on the "neighbor" CPU.

Currently these queues (classes) are defined:

Queue Description Limits/remarks
q1 For serial jobs Jobs may be suspended by jobs from qpar and qexp
qi For non-urgent serial jobs (idle queue) These jobs will be suspended by any job from
other queues.
qpar For parallel jobs requirering at most 8 processors.
On Sleipner or Fenris, but not mixed.
Max 48 processors for these jobs in total.
Suspends jobs in q1 if necessary.
qexp For short (less than 1 hour) high priority jobs On all machines. Suspends jobs in q1 if neces.
quu Run the job on Hugin or Munin.
One or more CPUs
Jobs will not be suspended by other jobs
qsl Run the job on Sleipner. One or more CPUs Max 8 processors for these jobs in total.
Jobs will not be suspended by other jobs
qfe Run the job on Fenris. One or more CPUs Max 8 processors for these jobs in total.
Jobs will not be suspended by other jobs
qmult For parallel jobs having threads on Sleipner
and Fenris simultaneously.
Only for validated users.
Contact Staff for access to this queue.
qncwh For special jobs requiring extra resources. Only for validated users.
Contact Staff for access to this queue.

A Loadleveler job is a jobscript which consists of two parts, the first part contains instructions to Loadleveler itself, the second part contains the actual commands which have to be carried out by the system. Below are examples of jobscripts for both sequential and parallel jobs.

Notice, that each of the node-specific queues, qsl, qfe and quu, only can occupy 8 CPUs. This means that jobs submitted to these queues may stay inactive, even if there are idle CPUs on the system. The advantage of using node-specific queues is that the job never will be suspended, the drawback is that the job probably will be waiting longer before it is started.
Considder using this construct instead:

    # @ class = q1
    # @ requirements = (Machine == "sleipner") 

When the job starts, a uniq directory will be created in the /scratch filesystem. You can refer to this directory via the SCRDIR environment variable as indicated in the jobscript below. When the job terminates the scratch-directory and its contents is automatically erased.

(If 'root' owns the file DONOTREMOVE in SCRDIR, the directory will be moved to a safe place instead of being erased. The saved directories will be kept in a couple of weeks before they are deleted. Contact staff for further informations.)

Usefull commands for handling batchjobs:

Submit jobs to the system
% llsubmit jobscript
Delete a pending or running job
% llcancel jobid
Hold a pending job to prevent it from starting
% llhold jobid
Release a holding job
% llhold -r jobid
Display jobs in the queues
% js
Display statistics about running job distribution
% bj
Display detailed informations about jobs
% llq -l -x
Display the reason for, why a job is pending
% llq -s jobid

You can monitor the system processes with the top command.
The current use of the /scratch -filesystem can be monitored with the duscr command.

Example of a jobscript for a serial batchjob:

#!/bin/sh
# @ job_name = serjob
# @ job_type = serial
# @ class = q1
# @ input = /dev/null
# @ output = $(job_name).$(jobid).o
# @ error =  $(job_name).$(jobid).e
# @ notification = never
# @ queue

cp where/my/program/is/a.out $SCRDIR/
cp where/my/indata/is/* $SCRDIR/
cd $SCRDIR
./a.out > output
cp output $HOME/resultdir/
echo === Job finished at `date` ====
#

Example of a jobscript for a (naive-) parallel job:

#!/bin/sh
# @ job_name = par4job
# @ job_type = parallel
# @ total_tasks = 4
# @ class = qpar
# @ input = /dev/null
# @ output = $(job_name).$(jobid).o
# @ error =  $(job_name).$(jobid).e
# @ notification = never
# @ queue

cd where/my/program/is
./program > $SCRDIR/output.1 &
./program > $SCRDIR/output.2 &
./program > $SCRDIR/output.3 &
./program > $SCRDIR/output.4 &
wait
grep "Result:" $SCRDIR/output.*
echo === Job finished at `date` ====
#

Example of a jobscript for a parallel MPI job:

#!/bin/sh
# @ job_name = mpijob
# @ job_type = parallel
# @ total_tasks = 4
# @ class = qpar
# @ executable = /usr/bin/poe
# @ arguments = /full/path/to/mpiprogram inputfile
# @ environment = MP_SHARED_MEMORY=yes
# @ input = /dev/null
# @ output = $(job_name).$(jobid).o
# @ error =  $(job_name).$(jobid).e
# @ notification = never
# @ queue
#

Note: /usr/bin/poe is the initiator of the mpi-program (on other systems you would use mpirun). The program spcified as the value for the arguments keyword (here: ./mpiprogram) is the actual mpi-program.

Example of a jobscript for a parallel MPI job with manuel call to poe:

#!/bin/sh
# @ job_name = mpijob
# @ job_type = parallel
# @ total_tasks = 4
# @ class = qpar
# @ environment = MP_SHARED_MEMORY=yes 
# @ input = /dev/null
# @ output = $(job_name).$(jobid).o
# @ error =  $(job_name).$(jobid).e
# @ notification = never
# @ queue

cp path/to/mpiprogram $SCRDIR
cp path/to/indata/input_files.* $SCRDIR
cd $SCRDIR
/usr/bin/poe ./mpiprogram argument1 argument2 > logfile
cp output_files.* $HOME/path/to/output_files/ 
grep error logfile
echo === Job finished at `date` ====
#

You may prefer this setup if you want to prepare (copy in place) indata before actually running the MPI-program, and if you want to do some postprocessing after the program has terminated.

Example of a jobscript for a auto-parallel (or OpenMP) job:

#!/bin/sh
# @ job_name = openmpjob
# @ job_type = parallel
# @ total_tasks = 4
# @ class = qpar
# @ input = /dev/null
# @ output = $(job_name).$(jobid).o
# @ error =  $(job_name).$(jobid).e
# @ notification = never
# @ queue

cd where/my/program/is
export OMP_NUM_THREADS=4
./openmpprogram > $SCRDIR/output
grep "Result:" $SCRDIR/output
echo ========= Job finished ===================


Example of a jobscript using the multi-node queue, qmult:

#!/bin/sh
# @ job_name = multi-mpi
# @ job_type = parallel
# @ node = 2
# @ tasks_per_node = 2
# @ class = qmult
# @ environment = MP_SHARED_MEMORY=yes
# @ network.mpi = gigabit
# @ input = /dev/null
# @ output = $(job_name).$(jobid).o
# @ error =  $(job_name).$(jobid).e
# @ notification = start
# @ queue

/usr/bin/poe ./mpiprogram

Note: Contact Staff to get access to this queue. The job will run 4 threads in total, 2 on Sleipner and 2 on Fenris. The # @ network.mpi = gigabit -line will be added automatically if it isn't specified, it instructs the threads to communicate via the Gigabit network between the two machines. Threads on the same machine communicates via shared memory.

Automatic generation of batchjob-scripts for Gaussian jobs:


Gaussian 03 Rev. B05 is available for all users affiliated with The University of Aarhus. The easiest way to use G03 is via the jobsubmission utillity, subg03. Just type "subg03 inputfile" (where inputfile is the Gaussian command file). The script will automaically create a bacthjob - serial or parallel according to the %nproc directive in the inputfile - and submit the job to the system.
See here for further instructions how to use the subg03 utillity, or type "subg03 --help" at the command prompt.

Further LoadLeveler documentation:


You can browse the complete IBM LoadLeveler manual here (requires a PDF-browser, f.ex. AcrobatReader).