![]() |
||||||||
|
|||||||||||||||||||||||||
|
Batchjobs on Grendel
All jobs on Grendel must be executed as batchjobs through the queueing
system.
On Gendel we are using Torque, which formerly was known as PBS
or OpenPBS.
When a job starts, a uniq directory will be created in a local /scratch
-filesystem on each node the job has allocated.
You can refer to this directory as /scratch/$PBS_JOBID To benefit from the backfilling mechanism, all jobs should specify a realistic wallclock time. Backfilling help jobs to start earlier, the drawback is that if the specified wallclock time is too small, the job will abend. The wallclock time can be changed during jobexecution with qalter -l walltime=hh:mm:ss jobid. However, to avoid fooling the backfill mechanism, only the sysadmin can raise jobs wallclock time.
Usefull commands for handling batchjobs on Grendel:
Example of a jobscript for a serial batchjob:#!/bin/csh #PBS -l nodes=1:ppn=4 #PBS -l walltime=2:30:00 #PBS -q q4 echo "========= Job started at `date` ==========" # copy inputdata and the executable to the scratch-directory cd path/to/my/inputdata cp *.dat /scratch/$PBS_JOBID cp prog.exe /scratch/$PBS_JOBID # change directory to the local scratch-directory, and run: cd /scratch/$PBS_JOBID ./prog.exe > out # copy home the outputdata: cp out $PBS_O_WORKDIR/ echo "========= Job finished at `date` ==========" #This job can be submitted with this command: qsub jobscript The #PBS -lines in the jobscript is an alternative way to pass arguments to the queueing system. Example of a jobscript for parallel execution of more serial tasksTo achieve best node-utilisation an easy trick is to run more processes at the same time in a job. In principle this can be done by starting a number of processes in the background and putting a wait-statement at last. The wait statement is very important, if it is missing the job will just fork the processes and exit. Example: #!/bin/csh #PBS -l nodes=1:ppn=4 #PBS -l walltime=2:30:00 #PBS -q q4 echo "========= Job started at `date` ==========" cd /scratch/$PBS_JOBID myprogram arg1 > outdata1 & myprogram arg2 > outdata2 & myprogram arg3 > outdata3 & myprogram arg4 > outdata4 & wait cp outdata* $PBS_O_WORKDIR/ echo "========= Job finished at `date` ==========" #The four processes are started in the background (&) and when all have finished, the outputdata files are copied back. The wait statement let the jobscript wait until all (child-) processes have finished. Example of a jobscript for parallel execution of tasks on more nodesThe principple above can be extended so that the job uses more nodes. It also calls the mem and cpus commands (see "Hints") to get the amount of memory and the number of CPUcores in the nodes.
#!/bin/bash
echo "========= Job started at `date` =========="
Mreq=1.5 # 'myprogram' requires Mreq GB.
N=0
for node in `sort -u < $PBS_NODEFILE`; do
Mnode=`mem $node` # Get amount of memory
Cnode=`cpus $node` # Get number of CPUcores
instances=$(echo "scale=0; $Mnode / $Mreq" | bc)
[ $instances -gt $Cnode ] && instances=$Cnode # Max Cnode cpus available
[ $instances -lt 1 ] && continue # Node has not enough memory
for i in $(seq $instances) ; do
((N++))
rsh -n $node "cd path; ./myprogram < input.$N > output.$N" &
done
done
wait
echo "========= Job finished at `date` =========="
|
|||||||||||||||||||||||||