DCSC logo
 
ABOUT-DCSC
DCSC/SDU
DCSC/AU
DCSC/AAU
DCSC/DTU
DCSC/KU
 
+Open all         -Close all
 
    Overview   Hardware   Software   Batchjobs   Hints  

 

Hints and FAQ for Grendel

Q1: How do I change my password?
Q2: How to use the express queue and how is it implemented?
Q3: How to select the Fat Nodes for my job?
Q4: How many nodes/processors are free?
Q5: How many CPUs do a node have?
Q6: How much memory do a node have?
Q7: How to use the local /scratch -filesystem.
Q8: How can I see my jobs?
Q9: Tips for running Gaussian jobs.
Q10: How to compile and link a MPI-job
Q11: How to run a MPI-job (MPICH2).
Q12: How to use OpenMPI
Q13: How to use the Intel MKL library?
Q14: How to profile programs?
Q15: How to use Scalapack on Grendel?
Q16: How to use Intel's version 11.1.046 compilers and math. library?
Q17: How to run Embarrassingly Parallel jobs efficiently.


Q1: How do I change my password?
  Use the command yppasswd It will ask for the current password, the new password and the new password once more.
Top  

Q2: How to use the express queue and how is it implemented?
  The purpose of the express queue, qexp, is to give short jobs opportunity to start earlier/immediately, instead of waiting long time in the normal queue. This is implemented by permanently allocating four X2200-nodes and two HP (nehalem) -nodes to exclusively serve this queue. Non-qexp-jobs cannot run on these nodes, however qexp-jobs can run on every node in the cluster, provided they are available. qexp enforces a wallclock limit of 1 hour. A user can only have one job running in this queue at a time, and a job can at maximum allocate 4 nodes.
To use the queue, jobs must specify queue qexp, and it might be usefull to specify the nodetype also, f.ex.:
qsub -q qexp -l nodes=2:nehalem:ppn=8 jobscript
qsub -q qexp -l nodes=2:x2200:ppn=8 jobscript
qsub -q qexp -l nodes=2:dell:ppn=4 jobscript
or request the queue and nodetype via inline #PBS-statements, f.ex.:
   #!/bin/sh
   #PBS -q qexp
   #PBS -l nodes=2:nehalem:ppn=8
   #PBS -N MPIjob
   ...
Top  

Q3: How to select the Fat Nodes for my job?
  There are 25 Fat Nodes in the Grendel -cluster. These are SUN x2200 machines, each with 2 Quadcore AMD/Opteron 2.3 GHz CPUs, 32 GB memory and 2 TB scratch disk. These nodes can be selcted by supplying these flags to the qsub-command:
    qsub -q qfat -l nodes=5:ppn=8 jobscript
In this example the job will allocate 5 nodes each with 8 processors. (Observe, that a processor is the same as a CPU core).
Likewise, to ensure a job only will run on the 4-core DELL-nodes, use:
    qsub -q q4 -l nodes=5:ppn=4 jobscript
The example shows a job allocating all 4 processors in 5 Dell sc1435-machines. Thus, in total 20 processors will be allocated for this job.
Top  

Q4: How many nodes/processors are free?
  The command: nodes displays information about free nodes.
To se a naive graphical view of the cluster, use the command gnodes
Top  

Q5: How many CPUs do a node have?
  The command cpus displays how many CPU(-cores) the current node possess. This number can be used in a generic jobscript to determine how many processes can be started on a node, f.ex.:
   #!/bin/bash
   echo "========= Job started  at `date` =========="
   echo "Host: `hostname -s` has `cpus` CPUs"
   cd some/where
   for i in $(seq `cpus`) ; do
     ./myprogram < input.$i > output.$i &
   done
   wait
   echo "========= Job finished at `date` =========="
Here we start same number of instances of the program myprogram as the number of CPUs in the node. The input- and outputfiles are parametrisized accordingly. After the subprocesses running myprogram have been started in the loop, the job waits at wait for all the instances of myprogram to be finished. Then, the job will continue and finish.
Top  

Q6: How much memory do a node have?
  The command mem displays how many GB of memory the current node possess. This number can be used in a generic jobscript to determine how many processes can be started on a node, f.ex.:
   #!/bin/bash
   echo "========= Job started  at `date` =========="
   echo "Host: `hostname -s` has `mem` GB memory"
   echo "Host: `hostname -s` has `cpus` CPUs"
   Mreq=2.5  # 'myprogram' requires Mreq GB.
   instances=$(echo "scale=0; `mem` / $Mreq" | bc)
   [ $instances -gt `cpus` ] && instances=`cpus`
   if [ $instances -lt 1 ]; then
     echo "Insufficient memory to run program. Exiting"
     exit
   fi
   cd some/where
   for i in $(seq $instances) ; do
     ./myprogram < input.$i > output.$i &
   done
   wait
   echo "========= Job finished at `date` =========="
The number of instances of the program myprogram is determined by the memoryrequirement of each instance, and limitted by the number of CPUs in the node. If the memory is insufficient to run any instance of the program, the script will exit.
Top  

Q7: How to use the local /scratch -filesystem.
  Each execution-node is equipped with a local /scratch -filesystem which is much faster than the common home-filesystem. Jobs should utilize the /scratch -filesystem while they are running to limit network trafic to the home-filesystem. Here is an example:
   #!/bin/bash
   echo "========= Job started  at `date` =========="
   ./myprogram > /scratch/$PBS_JOBID/out
   grep "Energy minimum" /scratch/$PBS_JOBID/out > results
   cp /scratch/$PBS_JOBID/out out.$PBS_JOBID
   echo "========= Job finished at `date` =========="
Here the output is written to the file /scratch/$PBS_JOBID/out It resides in the local /scratch -filsystem in a job-specific directory /scratch/$PBS_JOBID. This directory is created automatically when the job starts, and will be deleted (together with its contents!) when the job terminates. Therefore, remember to copy back important files.
Top  

Q8: How can I see my jobs?
  Use the Torque command qstat or you can use js which also displays node information.
Use mj or bjobs to see your own jobs only.
Use bj or bj -u to get an overview of current users on the system. bj -s also shows current allotment.
A graphical view of the cluster can be obtained by using the gnodes command. Information about a spcific user's job or just a job, can be seen with gnodes username or gnodes jobid.
To see the efficiency of jobs by node utillization, use the je command.
Top  

Q9: Tips for running Gaussian jobs.
  The easiest way to run a Gaussian job is to use the subg03 utillity. Subg03 has many usefull options type subg03 -h for a brief listing. A job will be generated and submitted to the queueing system. Please note, that pr. default it will allocate 1 node pr. processor requested by the %nprocLinda=N instruction in the Gaussian commandfile. This can be overrided with the -ppn1 flag to subg03. With this flag the job won't waste a core for doing nothing. Jobs requireing moderate amount of memory (< 1 GB) should use this flag. If the cluster is very busy, the -ppn1 flag normally will guarantee a shorter time for the job waiting in the queue.
Examples:
   subg03 -h                     # List all options to subg03.
   subg03 gaussjob.com           # Submit a Gaussian job.
   subg03 -q q8 gaussjob.com     # Submit the job to queue q8
See also the Gaussian 03 page for Grendel.
Top  

Q10: How to compile and link a MPI-job
  Thre are more than one MPI implementation installed on Grendel, f.ex. depending on which compiler to use.

First of all, include the path to MPICH2 to your PATH environment variable. Choose one of these:

   # To use MPICH2 built w. Portland compilers
   export PATH=/com/mpich2-1.0.4p1-pgi/bin:$PATH   (Bourne-shell syntax)
   set path = (/com/mpich2-1.0.4p1-pgi/bin $path)  (C-shell syntax)
   # To use MPICH2 built w. Intel compilers
   export PATH=/com/mpich2-1.0.4p1/bin:$PATH       (Bourne-shell syntax)
   set path = (/com/mpich2-1.0.4p1/bin $path)      (C-shell syntax)
Next, compile your program (-fragments). According to which version of MPICH2 was included in the PATH, the corresponding compiler suite will be used by mpif90:
   mpif90 -c progfrag1.f -O2
   mpif90 -c progfrag2.f -O2
   mpif90 -c progfrag3.f -O2
Finaly, link the whole program:
   mpif90 -o program.x progfrag1.o progfrag2.o progfrag3.o
The usual flags and arguments for linking with f.ex. the BLAS library can be used here too.
Top  

Q11: How to run a MPI-job (MPICH2).
  Below is an example of a MPI-job script, ready to submit to the queueing system.
NOTICE: the version of MPICH2 used in this example resides in /com/mpich2-1.0.4p1-pgi/bin. It was built with the Portland Compilers.
If the program was compiled with the Intel compilers the MPICH2 version to use resides in /com/mpich2-1.0.4p1/bin.
Set the PATH environment variable accordingly in the job script.

If running MPICH2 for the first time, create a file ~/.mpd.conf in the login-directory, f.ex.:

   echo "MPD_SECRETWORD=SW92seSam" > ~/.mpd.conf
   chmod 600 ~/.mpd.conf
The jobscript for the MPI-job looks like:
   #!/bin/sh
   #PBS -l nodes=16:ppn=2
   #PBS -N MPIjob
   export PATH=/com/mpich2-1.0.4p1-pgi/bin:$PATH
   cd $PBS_O_WORKDIR
   # Generate the hostfile, with full-qualified-domainnames:
   mpdfile=mpd.$$
   awk '{printf("%s.grendel.cscaa.dk\n", $1)}' $PBS_NODEFILE |\
     sort -u > $mpdfile
   # Start the virtual machine on the nodes:
   mpdboot --totalnum=16 --mpd=/com/mpich2-1.0.4p1-pgi/bin/mpd \
           --file=$mpdfile --rsh=rsh
   # Optional, show the participating nodes:
   echo '=====   ============================='
   mpdtrace -l
   echo '=====  ============================='
   # NB: If we used Intel compilers we must define LD_LIBRARY_PATH first:
   # export LD_LIBRARY_PATH=/com/intel/fce/9.0/lib:$LD_LIBRARY_PATH
   # Now, run the MPI-program:
   mpiexec -n 32 ./mpiprogram arg1 arg2
   # Shutdown the virtual machine, and clean up:
   mpdallexit
   rm -f $mpdfile
   #
Top  

Q12: How to use OpenMPI
  OpenMPI is a comprehensive MPI2 implementation which plays together with the queueing system.
When running OpenMPI jobs, there are no needs for machinefiles because this information is retrieved directly from the queueing system.

First of all, setup the OpenMPI environment:

  source /com/OpenMPI/1.4.1/intel/bin/openmpi.sh    (Bourne-shell)
  source /com/OpenMPI/1.4.1/intel/bin/openmpi.csh   (C-shell)
(Observe that there also may exist an openmpi for the Portland Compiler suite in eg. /com/OpenMPI/1.4.1/pgi/bin)
   mpif90 -c progfrag1.f -O2
   mpif90 -c progfrag2.f -O2
   mpif90 -c progfrag3.f -O2
   mpif90 -o OpenMPI-program progfrag1.o progfrag2.o progfrag3.o
The usual flags and arguments for linking with f.ex. the BLAS library can be used here too.
Observe, you might get a warning like: feupdateenv is not implemented and will always fail
This can sometimes be salvaged by including the standard mathematical library when linking:
   mpif90 -o OpenMPI-program progfrag1.o progfrag2.o progfrag3.o -limf -lm

Running the OpenMPI requires access to the mpiexec program and several DSOs (shared libraries at runtime for the Intel compiler and openmpi).
A typical OpenMPI-jobscript therefore looks like (in Bourne-shell syntax):
   #!/bin/sh
   #PBS -l nodes=16:ppn=2
   #PBS -N OpenMPIjob
   echo "========= Job started  at `date` =========="
   # To enable OpenMPI:
   source /com/OpenMPI/1.4.1/intel/bin/openmpi.sh
   # To enable MKL:
   source /com/intel/Compiler/11.1/064/mkl/tools/environment/mklvarsem64t.sh
   cd $PBS_O_WORKDIR
   mpiexec ./OpenMPI-program arg1 arg2
   echo "========= Job finished at `date` =========="
   #
This job will spawn two (ppn=2) instances on 16 nodes, that is 32 instances in total of the OpenMPI -program.

If the program requires more memory, it may be necessary to reserve a whole node for each instance of the program. I this case change the mpiexec command to:

   mpiexec -bynode -n 16 ./OpenMPI-program arg1 arg2
and submit with -l nodes=16:ppn=4 -q q4 to get 16 whole DELL SC1435 nodes (16 x 8 GB memory) or -l nodes=16:ppn=8 -q q8 to get 16 whole SUN x2200 nodes (16 x 16 GB memory)

If possible, OpenMPI will first try to make any inter-node communication via Infiniband. If that isn't possible (the nodes may not posses any Infiniband hardware) it will communicate via gigabit Ethernet. If two processes are located within the same node, OpenMPI will communicate via Shared Memory regions.
The latency is ca. 1000 times smaller going from gigabit Ethernet to Infiniband.
The effective bandwidth is ca. 20-25 times bigger going from gigabit Ethernet to Infiniband.

Top  

Q13: How to use the Intel MKL library?
  The Intel MKL library (Intel Math Kernel Library) contains routines for BLAS 1,2,3, LAPACK, FFTW etc.
To link to this library:
   ifort -o prog.exe prog.f \
     -L/com/intel/mkl/10.1.1.019/lib/em64t \
     -lmkl_lapack -lmkl -lguide   -lpthread
To run a program using the MKL library, set the LD_LIBRARY_PATH environment variable to contain the path to the library, f.ex.:
   export LD_LIBRARY_PATH=/com/intel/mkl/10.1.1.019/lib/em64t:$LD_LIBRARY_PATH
   ./prog.exe
BLAS3 and some LAPACK routines are able to run parallel in more OpenMP threads if the environment variable OMP_NUM_THREADS is set to a number larger than 1. This can be usefull, if the program will be run exclusively on a node with more CPU(cores). F.ex.:
   #!/bin/sh
   #PBS -l nodes=1:ppn=2
   export LD_LIBRARY_PATH=/com/intel/mkl/10.1.1.019/lib/em64t:$LD_LIBRARY_PATH
   export OMP_NUM_THREADS=2
   ./prog.exe
As default, all available CPUcores in a node will be used!!
Top  

Q14: How to profile programs?
  Use the gprof utillity. Here is an example:
   % ifort -p -g -c x1.f
   % ifort -p -g -c x2.f
   % ifort -o x.out -p x1.o x2.o
Now run the executable as normal. A file gmon.out will be created. It contains the profiling data in a binary format. Use the gprof utillity to analyze its contents:
   % gprof x.out gmon.out
gprof takes a lot of flags. See man gprof for details.
Top  

Q15: How to use Scalapack on Grendel?
  To compile a program, that calls Scalapack routines just include the Scalapack-makefile in the Makefile as in the example below.
   # Makefile for making a Scalapack -program
   OBJS = pdscaex.o pdscaexinfo.o pdlaread.o pdlawrite.o
   EXE  = xdscaex
   #
   PATH := /com/OpenMPI/1.4.1/intel/bin:${PATH}
   include /com/lib/scalapack_openmpi/SLmake.inc
   MYFFLAGS = -i_dynamic
   all: $(EXE)
   .f.o : ; $(F77) -c $(F77FLAGS) $*.f
   .c.o : ; $(CC) -c $(CCFLAGS) $(CDEFS) $*.c
   $(EXE): $(OBJS)
            $(FCLOADER) $(FCLOADFLAGS) $(MYFFLAGS) -o $@ $(OBJS) $(STLIBS)
/com/lib/scalapack_openmpi/SLmake.inc contains the macros needed for compiling and linking the program.
Browse it to see the definitions of the macros.

To run the program, use a jobscript like this:

   #!/bin/sh
   #PBS -l nodes=4:ppn=1
   #PBS -N scalapack
   #
   export PATH=/com/OpenMPI/1.4.1/intel/bin:$PATH
   cd ${PBS_O_WORKDIR}
   #
   mkl=/com/intel/mkl/9.1.023/lib/em64t
   openmpi=/com/OpenMPI/1.4.1/intel/lib
   intel=/com/intel/Compiler/11.1/064/lib/intel64
   export LD_LIBRARY_PATH=${mkl}:${openmpi}:${intel}:$LD_LIBRARY_PATH
   #
   mpiexec ./xdscaex
   #
The program must be run as a normal OpenMPI-program (see how to use OpenMPI above.)
Top  

Q16: How to use Intel's version 11.1.046 compilers and math. library?
  The new version 11.1.046 of Intel's compilers and math. library (MKL) has been installed.
Currently it isn't the default version, so first do:
source /com/intel/Compiler/11.1/046/bin/intel64/ifortvars_intel64.sh     (Bourne-shells, sh, ksh, bash)
source /com/intel/Compiler/11.1/046/bin/intel64/ifortvars_intel64.csh    (C-shells, csh, tcsh)
Check with ifort -V or icc -V, it should indicate that version 11.1.046 is used.

To link with Intel's math. library, MKL, (eg. if a program needs some Lapack-routines) do:

   ifort myprogram.f \
     -L/com/intel/Compiler/11.1/046/mkl/lib/em64t        \
     -Wl,--start-group                                   \
        -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core  \
     -Wl,--end-group                                     \
     -liomp5 -lpthread
To run the program, remember to define the environment variable, f.ex.:
export LD_LIBRARY_PATH=\
/com/intel/Compiler/11.1/046/lib/intel64:/com/intel/Compiler/11.1/046/mkl/lib/em64t

Intel has a very convenient webpage which shows how to link with MKL.
In a browser, open how to link and make these selections:

    Select OS:
Select processor architecture:
Select compiler:
Select dynamic or static linking:
Select your integers length:
Select sequential or multi-threaded version of Intel® MKL:
Rest of the fill-ins:
Linux
Intel(R) 64
Intel or Intel compatible
Dynamic
64-bit (ilp64)
Multi-threaded
What is appropriate...
Substitute $MKLPATH with /com/intel/Compiler/11.1/046/mkl/lib/em64t
Top  

Q17: How to run Embarrassingly Parallel jobs efficiently.
  A Embarrassingly Parallel job (EP-job) consists of several sub-jobs, called tasks, with no dependency or communication between. EP-jobs are able to achieve an execellent degree of utillization of the hardware (CPUs and memory) because the individual tasks don't have to wait for each other or for the communication channels to become ready. The problem with EP-jobs is however, that it is cumbersome to start them, especially if the job spans multiple nodes on a cluster.
A tool to launch them in an easy-to-understand and elegant way was needed, and that was the motivation for CSCAA to develop the dispatch -tool. In fact, dispatch provides the answers to these common problems regarding EP-jobs:
  • How to start EP-tasks transparrantly on the available hardware resources? There should be no difference in starting EP-tasks whether the tasks will be running on one or more cluster-nodes.
  • How to "queue" tasks, if the number of tasks is larger than the number of available processors (ie. CPU-cores). When a task finishes, a waiting task is allowed to start, but never before! The only UNIX/Linux command having an analogous capabillity is probably 'make -jN' (N>1).
  • If the EP-job spans more than one node, how to distribute the tasks in a balanced manner between the nodes.
  • How to specify the EP-job topology wrt. nodes and CPUcores.
In order to use dispatch it is important that the EP-tasks can be 'parametrised' in a one-dimensional way. A very easy way to do this is to create a set of small shell-scripts, one for each task, say script1, script2 and so on. A master -script is then created in the same directory as the taskscripts. In principle it could be as simple as:
    #!/bin/bash
    cd `dirname $0`
    ./$1 > $1.log 2>&1
Now, each EP-task in principle could be run by: ./master scriptj.
In order to lauch a number of these tasks, dispatch is employed in a PBS-batchjob. This could look like:
    #PBS -l nodes=N:ppn=P
    dispatch -s /absolute/path/to/master  script1 script2 ... scriptT
Observe in the jobscript above, that the dispatch-command doesn't involve any specification of the job-topology: nodes and CPUs. This is done entirely by the Torque (aka PBS) resource request. Also, dispatch don't expect any relationship between the number of available processors, N*P, and the number of EP-tasks, T. If T<N*P the job just doesn't utillize all available CPUs, if T>N*P some of the tasks must wait until a predecessor has finished. If the job includes more than one node (ie. N>1), the first up to N*P tasks are distributed "node by node" to obtain a balanced load on the participating nodes.
NB: It is the responsibillity of the task-scripts to avoid clashes of f.ex. output- and temporary files! Dispatch does not transfer any environment variables or shell-aliases, these must be set in the master- or the individual task scripts.
 
Of historical reasons, dispatch also have a second calling syntax. Suppose a directory, mydir, contains a number of sub-directories, subdir1, subdir2 and so on, and each of these contains a script with the same name, say run. To start the scripts /absolute/path/to/mydir/subdirj/run collectively as an EP-job, the following dispatch -command is used:
    #PBS -l nodes=N:ppn=P
    dispatch /absolute/path/to/mydir run  subdir1 subdir2 ... subdirT
Observe that this method hasn't the need for a "master" script.
Contact staff for more information.
Top