| |
Hints and FAQ for Grendel
| Q1:
| How do I change my password?
|
|
| Use the command yppasswd
It will ask for the current password, the new password and the
new password once more.
|
| Top
|
|
| Q2:
| How to use the express queue and how is it implemented?
|
|
| The purpose of the express queue, qexp, is to give short
jobs opportunity to start earlier/immediately, instead of waiting long time
in the normal queue. This is implemented by permanently allocating four
X2200-nodes and two HP (nehalem) -nodes to exclusively serve this queue.
Non-qexp-jobs cannot run on these nodes, however qexp-jobs can run on every
node in the cluster, provided they are available. qexp enforces a wallclock limit
of 1 hour.
A user can only have one job running in this queue at a time, and a job can at maximum
allocate 4 nodes.
To use the queue, jobs must specify queue qexp, and it might be usefull to
specify the nodetype also, f.ex.:
qsub -q qexp -l nodes=2:nehalem:ppn=8 jobscript
qsub -q qexp -l nodes=2:x2200:ppn=8 jobscript
qsub -q qexp -l nodes=2:dell:ppn=4 jobscript
or request the queue and nodetype via inline #PBS-statements, f.ex.:
#!/bin/sh
#PBS -q qexp
#PBS -l nodes=2:nehalem:ppn=8
#PBS -N MPIjob
...
|
| Top
|
|
| Q3:
| How to select the Fat Nodes for my job?
|
|
| There are 25 Fat Nodes in the Grendel -cluster. These are
SUN x2200 machines, each with 2 Quadcore AMD/Opteron 2.3 GHz
CPUs, 32 GB memory and 2 TB scratch disk. These nodes can be
selcted by supplying these flags to the qsub-command:
qsub -q qfat -l nodes=5:ppn=8 jobscript
In this example the job will allocate 5 nodes each with 8 processors.
(Observe, that a processor is the same as a CPU core).
Likewise, to ensure a job only will run on the 4-core DELL-nodes, use:
qsub -q q4 -l nodes=5:ppn=4 jobscript
The example shows a job allocating all 4 processors in
5 Dell sc1435-machines. Thus, in total 20 processors will be allocated
for this job.
|
| Top
|
|
| Q4:
| How many nodes/processors are free?
|
|
| The command: nodes displays information about
free nodes.
To se a naive graphical view of the cluster, use the command
gnodes
|
| Top
|
|
| Q5:
| How many CPUs do a node have?
|
|
| The command
cpus displays how many CPU(-cores) the current
node possess. This number can be used in a generic jobscript
to determine how many processes can be started on a node, f.ex.:
#!/bin/bash
echo "========= Job started at `date` =========="
echo "Host: `hostname -s` has `cpus` CPUs"
cd some/where
for i in $(seq `cpus`) ; do
./myprogram < input.$i > output.$i &
done
wait
echo "========= Job finished at `date` =========="
Here we start same number of instances of the program
myprogram as the number of CPUs in the node.
The input- and outputfiles are parametrisized accordingly.
After the subprocesses running myprogram have been started
in the loop, the job waits at wait for all the
instances of myprogram to be finished. Then, the
job will continue and finish.
|
| Top
|
|
| Q6:
| How much memory do a node have?
|
|
| The command
mem displays how many GB of memory the current
node possess. This number can be used in a generic jobscript
to determine how many processes can be started on a node, f.ex.:
#!/bin/bash
echo "========= Job started at `date` =========="
echo "Host: `hostname -s` has `mem` GB memory"
echo "Host: `hostname -s` has `cpus` CPUs"
Mreq=2.5 # 'myprogram' requires Mreq GB.
instances=$(echo "scale=0; `mem` / $Mreq" | bc)
[ $instances -gt `cpus` ] && instances=`cpus`
if [ $instances -lt 1 ]; then
echo "Insufficient memory to run program. Exiting"
exit
fi
cd some/where
for i in $(seq $instances) ; do
./myprogram < input.$i > output.$i &
done
wait
echo "========= Job finished at `date` =========="
The number of instances of the program myprogram is
determined by the memoryrequirement of each instance, and
limitted by the number of CPUs in the node. If the memory
is insufficient to run any instance of the program, the
script will exit.
|
| Top
|
|
| Q7:
| How to use the local /scratch -filesystem.
|
|
| Each execution-node is equipped with a local /scratch -filesystem
which is much faster than the common home-filesystem. Jobs should
utilize the /scratch -filesystem while they are running to limit network
trafic to the home-filesystem.
Here is an example:
#!/bin/bash
echo "========= Job started at `date` =========="
./myprogram > /scratch/$PBS_JOBID/out
grep "Energy minimum" /scratch/$PBS_JOBID/out > results
cp /scratch/$PBS_JOBID/out out.$PBS_JOBID
echo "========= Job finished at `date` =========="
Here the output is written to the file /scratch/$PBS_JOBID/out
It resides in the local /scratch -filsystem in a job-specific directory
/scratch/$PBS_JOBID. This directory is created automatically
when the job starts, and will be deleted (together with its contents!)
when the job terminates. Therefore, remember to copy back important
files.
|
| Top
|
|
| Q8:
| How can I see my jobs?
|
|
| Use the Torque command qstat or you can use
js which also displays node information.
Use mj or bjobs to see your own jobs only.
Use bj or bj -u to get an overview of
current users on the system. bj -s also shows current allotment.
A graphical view of the cluster can be obtained by using the
gnodes command. Information about a spcific user's job or
just a job, can be seen with gnodes username or
gnodes jobid.
To see the efficiency of jobs by node utillization, use the je
command.
|
| Top
|
|
| Q9:
| Tips for running Gaussian jobs.
|
|
| The easiest way to run a Gaussian job is to use the
subg03 utillity. Subg03 has many usefull options
type subg03 -h for a brief listing.
A job will be generated and submitted to the queueing system.
Please note, that pr. default it will allocate 1 node pr.
processor requested by the %nprocLinda=N instruction in
the Gaussian commandfile.
This can be overrided with the -ppn1 flag to subg03.
With this flag the job won't waste a core for doing nothing. Jobs
requireing moderate amount of memory (< 1 GB) should use this flag.
If the cluster is very busy, the -ppn1 flag normally will guarantee
a shorter time for the job waiting in the queue.
Examples:
subg03 -h # List all options to subg03.
subg03 gaussjob.com # Submit a Gaussian job.
subg03 -q q8 gaussjob.com # Submit the job to queue q8
See also the Gaussian 03 page for Grendel.
|
| Top
|
|
| Q10:
| How to compile and link a MPI-job
|
|
| Thre are more than one MPI implementation installed on Grendel,
f.ex. depending on which compiler to use.
First of all, include the path to MPICH2 to your PATH
environment variable. Choose one of these:
# To use MPICH2 built w. Portland compilers
export PATH=/com/mpich2-1.0.4p1-pgi/bin:$PATH (Bourne-shell syntax)
set path = (/com/mpich2-1.0.4p1-pgi/bin $path) (C-shell syntax)
# To use MPICH2 built w. Intel compilers
export PATH=/com/mpich2-1.0.4p1/bin:$PATH (Bourne-shell syntax)
set path = (/com/mpich2-1.0.4p1/bin $path) (C-shell syntax)
Next, compile your program (-fragments). According to which version
of MPICH2 was included in the PATH, the corresponding compiler suite
will be used by mpif90:
mpif90 -c progfrag1.f -O2
mpif90 -c progfrag2.f -O2
mpif90 -c progfrag3.f -O2
Finaly, link the whole program:
mpif90 -o program.x progfrag1.o progfrag2.o progfrag3.o
The usual flags and arguments for linking with f.ex. the BLAS library
can be used here too.
|
| Top
|
|
| Q11:
| How to run a MPI-job (MPICH2).
|
|
| Below is an example of a MPI-job script, ready to submit to the
queueing system.
NOTICE: the version of MPICH2 used in this example
resides in /com/mpich2-1.0.4p1-pgi/bin. It was built with the Portland
Compilers.
If the program was compiled with the Intel compilers
the MPICH2 version to use resides in /com/mpich2-1.0.4p1/bin.
Set the PATH environment variable accordingly in the job script.
If running MPICH2 for the first time, create a file ~/.mpd.conf
in the login-directory, f.ex.:
echo "MPD_SECRETWORD=SW92seSam" > ~/.mpd.conf
chmod 600 ~/.mpd.conf
The jobscript for the MPI-job looks like:
#!/bin/sh
#PBS -l nodes=16:ppn=2
#PBS -N MPIjob
export PATH=/com/mpich2-1.0.4p1-pgi/bin:$PATH
cd $PBS_O_WORKDIR
# Generate the hostfile, with full-qualified-domainnames:
mpdfile=mpd.$$
awk '{printf("%s.grendel.cscaa.dk\n", $1)}' $PBS_NODEFILE |\
sort -u > $mpdfile
# Start the virtual machine on the nodes:
mpdboot --totalnum=16 --mpd=/com/mpich2-1.0.4p1-pgi/bin/mpd \
--file=$mpdfile --rsh=rsh
# Optional, show the participating nodes:
echo '===== ============================='
mpdtrace -l
echo '===== ============================='
# NB: If we used Intel compilers we must define LD_LIBRARY_PATH first:
# export LD_LIBRARY_PATH=/com/intel/fce/9.0/lib:$LD_LIBRARY_PATH
# Now, run the MPI-program:
mpiexec -n 32 ./mpiprogram arg1 arg2
# Shutdown the virtual machine, and clean up:
mpdallexit
rm -f $mpdfile
#
|
| Top
|
|
| Q12:
| How to use OpenMPI
|
|
| OpenMPI
is a comprehensive MPI2 implementation which plays together
with the queueing system.
When running OpenMPI jobs, there are no needs for machinefiles
because this information is retrieved directly from the queueing system.
First of all, setup the OpenMPI environment:
source /com/OpenMPI/1.4.1/intel/bin/openmpi.sh (Bourne-shell)
source /com/OpenMPI/1.4.1/intel/bin/openmpi.csh (C-shell)
(Observe that there also may exist an openmpi for the Portland Compiler suite
in eg. /com/OpenMPI/1.4.1/pgi/bin)
mpif90 -c progfrag1.f -O2
mpif90 -c progfrag2.f -O2
mpif90 -c progfrag3.f -O2
mpif90 -o OpenMPI-program progfrag1.o progfrag2.o progfrag3.o
The usual flags and arguments for linking with f.ex. the BLAS library
can be used here too.
Observe, you might get a warning like: feupdateenv is not implemented and
will always fail
This can sometimes be salvaged by including the standard mathematical library when linking:
mpif90 -o OpenMPI-program progfrag1.o progfrag2.o progfrag3.o -limf -lm
Running the OpenMPI requires access to the mpiexec program
and several DSOs (shared libraries at runtime for the Intel
compiler and openmpi).
A typical OpenMPI-jobscript therefore looks like (in Bourne-shell syntax):
#!/bin/sh
#PBS -l nodes=16:ppn=2
#PBS -N OpenMPIjob
echo "========= Job started at `date` =========="
# To enable OpenMPI:
source /com/OpenMPI/1.4.1/intel/bin/openmpi.sh
# To enable MKL:
source /com/intel/Compiler/11.1/064/mkl/tools/environment/mklvarsem64t.sh
cd $PBS_O_WORKDIR
mpiexec ./OpenMPI-program arg1 arg2
echo "========= Job finished at `date` =========="
#
This job will spawn two (ppn=2) instances on 16 nodes, that is 32 instances
in total of the OpenMPI -program.
If the program requires more memory, it may be necessary to reserve a
whole node for each instance of the program. I this case change the
mpiexec command to:
mpiexec -bynode -n 16 ./OpenMPI-program arg1 arg2
and submit with
-l nodes=16:ppn=4 -q q4
to get 16 whole DELL SC1435 nodes (16 x 8 GB memory) or
-l nodes=16:ppn=8 -q q8
to get 16 whole SUN x2200 nodes (16 x 16 GB memory)
If possible, OpenMPI will first try to make any inter-node communication
via Infiniband. If that isn't possible (the nodes may not posses any Infiniband
hardware) it will communicate via gigabit Ethernet. If two processes are
located within the same node, OpenMPI will communicate via Shared Memory
regions.
The latency is ca. 1000 times smaller going from gigabit Ethernet to Infiniband.
The effective bandwidth is ca. 20-25 times bigger going from gigabit Ethernet to
Infiniband.
|
| Top
|
|
| Q13:
| How to use the Intel MKL library?
|
|
| The Intel MKL library (Intel Math Kernel Library) contains
routines for BLAS 1,2,3, LAPACK, FFTW etc.
To link to this library:
ifort -o prog.exe prog.f \
-L/com/intel/mkl/10.1.1.019/lib/em64t \
-lmkl_lapack -lmkl -lguide -lpthread
To run a program using the MKL library, set the LD_LIBRARY_PATH
environment variable to contain the path to the library, f.ex.:
export LD_LIBRARY_PATH=/com/intel/mkl/10.1.1.019/lib/em64t:$LD_LIBRARY_PATH
./prog.exe
BLAS3 and some LAPACK routines are able to run parallel in more OpenMP
threads if the environment variable OMP_NUM_THREADS is set
to a number larger than 1. This can be usefull, if the program will
be run exclusively on a node with more CPU(cores). F.ex.:
#!/bin/sh
#PBS -l nodes=1:ppn=2
export LD_LIBRARY_PATH=/com/intel/mkl/10.1.1.019/lib/em64t:$LD_LIBRARY_PATH
export OMP_NUM_THREADS=2
./prog.exe
As default, all available CPUcores in a node will be used!!
|
| Top
|
|
| Q14:
| How to profile programs?
|
|
| Use the gprof utillity. Here is an example:
% ifort -p -g -c x1.f
% ifort -p -g -c x2.f
% ifort -o x.out -p x1.o x2.o
Now run the executable as normal. A file gmon.out
will be created. It contains the profiling data in a binary format.
Use the gprof utillity to analyze its contents:
% gprof x.out gmon.out
gprof takes a lot of flags. See man gprof for details.
|
| Top
|
|
| Q15:
| How to use Scalapack on Grendel?
|
|
| To compile a program, that calls Scalapack routines just include
the Scalapack-makefile in the Makefile as in the example
below.
# Makefile for making a Scalapack -program
OBJS = pdscaex.o pdscaexinfo.o pdlaread.o pdlawrite.o
EXE = xdscaex
#
PATH := /com/OpenMPI/1.4.1/intel/bin:${PATH}
include /com/lib/scalapack_openmpi/SLmake.inc
MYFFLAGS = -i_dynamic
all: $(EXE)
.f.o : ; $(F77) -c $(F77FLAGS) $*.f
.c.o : ; $(CC) -c $(CCFLAGS) $(CDEFS) $*.c
$(EXE): $(OBJS)
$(FCLOADER) $(FCLOADFLAGS) $(MYFFLAGS) -o $@ $(OBJS) $(STLIBS)
/com/lib/scalapack_openmpi/SLmake.inc contains the macros
needed for compiling and linking the program.
Browse it to see the definitions of the macros.
To run the program, use a jobscript like this:
#!/bin/sh
#PBS -l nodes=4:ppn=1
#PBS -N scalapack
#
export PATH=/com/OpenMPI/1.4.1/intel/bin:$PATH
cd ${PBS_O_WORKDIR}
#
mkl=/com/intel/mkl/9.1.023/lib/em64t
openmpi=/com/OpenMPI/1.4.1/intel/lib
intel=/com/intel/Compiler/11.1/064/lib/intel64
export LD_LIBRARY_PATH=${mkl}:${openmpi}:${intel}:$LD_LIBRARY_PATH
#
mpiexec ./xdscaex
#
The program must be run as a normal OpenMPI-program (see
how to use OpenMPI above.)
|
| Top
|
|
| Q16:
| How to use Intel's version 11.1.046 compilers and math. library?
|
|
| The new version 11.1.046 of Intel's compilers and math. library (MKL)
has been installed.
Currently it isn't the default version, so first do:
source /com/intel/Compiler/11.1/046/bin/intel64/ifortvars_intel64.sh
(Bourne-shells, sh, ksh, bash)
source /com/intel/Compiler/11.1/046/bin/intel64/ifortvars_intel64.csh
(C-shells, csh, tcsh)
Check with ifort -V or icc -V, it should indicate that
version 11.1.046 is used.
To link with Intel's math. library, MKL, (eg. if a program needs some
Lapack-routines) do:
ifort myprogram.f \
-L/com/intel/Compiler/11.1/046/mkl/lib/em64t \
-Wl,--start-group \
-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \
-Wl,--end-group \
-liomp5 -lpthread
To run the program, remember to define the environment variable, f.ex.:
export LD_LIBRARY_PATH=\
/com/intel/Compiler/11.1/046/lib/intel64:/com/intel/Compiler/11.1/046/mkl/lib/em64t
Intel has a very convenient webpage which shows how to link with MKL.
In a browser, open
how to link and make these selections:
|
|
Select OS:
Select processor architecture:
Select compiler:
Select dynamic or static linking:
Select your integers length:
Select sequential or multi-threaded version of Intel® MKL:
Rest of the fill-ins:
|
Linux
Intel(R) 64
Intel or Intel compatible
Dynamic
64-bit (ilp64)
Multi-threaded
What is appropriate...
|
Substitute $MKLPATH with
/com/intel/Compiler/11.1/046/mkl/lib/em64t
|
| Top
|
|
| Q17:
| How to run Embarrassingly Parallel jobs efficiently.
|
|
| A Embarrassingly Parallel job (EP-job) consists of several sub-jobs,
called tasks, with no dependency or communication between. EP-jobs are
able to achieve an execellent degree of utillization of the hardware (CPUs
and memory) because the individual tasks don't have to wait for each other or
for the communication channels to become ready. The problem with EP-jobs
is however, that it is cumbersome to start them, especially if the job spans
multiple nodes on a cluster.
A tool to launch them in an easy-to-understand and elegant way was needed,
and that was the motivation for CSCAA to develop the dispatch -tool.
In fact, dispatch provides the answers to these common problems regarding
EP-jobs:
- How to start EP-tasks transparrantly on the available
hardware resources? There should be no difference in starting
EP-tasks whether the tasks will be running on one or more cluster-nodes.
- How to "queue" tasks, if the number of tasks is larger than the
number of available processors (ie. CPU-cores). When a task finishes,
a waiting task is allowed to start, but never before! The only UNIX/Linux
command having an analogous capabillity is probably 'make -jN' (N>1).
- If the EP-job spans more than one node, how to distribute the tasks in a
balanced manner between the nodes.
- How to specify the EP-job topology wrt. nodes and CPUcores.
In order to use dispatch it is important that the EP-tasks can be
'parametrised' in a one-dimensional way. A very easy way to do this is to
create a set of small shell-scripts, one for each task, say script1,
script2 and so on.
A master -script is then created in the same directory as the
taskscripts. In principle it could be as simple as:
#!/bin/bash
cd `dirname $0`
./$1 > $1.log 2>&1
Now, each EP-task in principle could be run by:
./master scriptj.
In order to lauch a number of these tasks, dispatch is employed in a
PBS-batchjob. This could look like:
#PBS -l nodes=N:ppn=P
dispatch -s /absolute/path/to/master script1 script2 ... scriptT
Observe in the jobscript above, that the dispatch-command doesn't
involve any specification of the job-topology: nodes and CPUs. This is done
entirely by the Torque (aka PBS) resource request. Also, dispatch don't
expect any relationship between the number of available processors, N*P, and
the number of EP-tasks, T. If T<N*P the job just doesn't utillize all
available CPUs, if T>N*P some of the tasks must wait until a predecessor
has finished.
If the job includes more than one node (ie. N>1), the first up to N*P tasks
are distributed "node by node" to obtain a balanced load on the participating nodes.
NB: It is the responsibillity of the task-scripts to avoid clashes of
f.ex. output- and temporary files! Dispatch does not transfer any
environment variables or shell-aliases, these must be set in the master- or
the individual task scripts.
Of historical reasons, dispatch also have a second calling syntax.
Suppose a directory, mydir, contains a number of sub-directories,
subdir1, subdir2 and so on, and each of these contains a script
with the same name, say run.
To start the scripts
/absolute/path/to/mydir/subdirj/run
collectively as an EP-job, the following dispatch -command is used:
#PBS -l nodes=N:ppn=P
dispatch /absolute/path/to/mydir run subdir1 subdir2 ... subdirT
Observe that this method hasn't the need for a "master" script.
Contact staff for more information.
|
| Top
|
|
|
|
 |