| |
Hints and FAQ for the IBM cluster
| Q1:
| How to compile programs, producing 32/64 bits executables?
|
|
| The 'xlf' -command invokes the generic Fortran compiler
(See 'man xlf').To get 64-bit executables you must set the
environment variable OBJECT_MODE to 64 (in a Bourneshell:
OBJECT_MODE=64; export OBJECT_MODE In a C-shell (tcsh):
setenv OBJECT_MODE 64) before any compiling,
linking or archiving ('ar'). If OBJECT_MODE is 32 or unset you
will get 32 bit executables.
|
| Top
|
|
| Q2:
| How to compile a "auto-parallel" program (a sequential
program parallized by the compiler)?
|
|
| Use this compilerstatement: % xlf_r -O3 -qsmp=auto prog.f
To run the program first set the number of threads the program
is using, then exec the program (example for C-shell):
setenv OMP_NUM_THREADS 4
./a.out
|
| Top
|
|
| Q3:
| How to compile MPI programs?
|
|
| The 'mpxlf_r' -command compiles and links a MPI-program. If you are
compiling in 64 bit mode (i.e. OBJECT_MODE=64) remember to include
the linker flag '-lmpi_r', ie: 'mpxlf_r -lmpi_r'. See 'man mpxlf' and
'man poe'. NB: It is important that you use the 'mpxlf_r' command
instead of 'mpxlf' otherwise you may get these runtime errors:
MPCI non-recoverable error...[devinit.c, 892], pid=61322, rc=324.
|
| Top
|
|
| Q4:
| How can I monitor my programs and jobs?
|
|
| For monitoring your UNIX-processes, use 'top'. There are other
utillities as well, 'nmon' and 'topas' monitors more aspects of the
system, 'iostat', 'vmstat' and 'sar' may be usefull only for your
system manager.
To monitor jobs, use 'js'.
To see the reason why a job isn't starting, use: llq -s jobid
|
| Top
|
|
| Q5:
| Where can I find online documentation (manuals)?
|
|
| Most of the IBM manuals are available as PDF-files at IBM.
The most relevant issues are available from
our software page.
|
| Top
|
|
| Q6:
| My program dies with a "not enough memory" error, when I request
(allocate) more than ca. 200 MB of memory, or it dies with
"1525-108 Error encountered while attempting to allocate a data object."
|
|
| This error can probably be circumvented by recompiling your program
with the 64 bit compiler. To do so, first set the OBJECT_MODE environment
variable to "64": 'export OBJECT_MODE=64' in Bourne shell or
'setenv OBJECT_MODE 64' in C-shell. Then do a recompile of your program.
If your program must be in 32 bit mode, link your program using the
-bmaxdata:0x80000000 option. If you allready have a
32 bit program (you will know, if the command 'file a.out' returns:
a.out: executable (RISC System/6000) or object module not stripped)
then you can enlarge the data-area to ca. 2 GB by using the mklarge
utillity: 'mklarge a.out'
If your program is in 64 bit mode ensure that you are not
using the -bmaxdata:0x80000000 option. It will limit the dataarea to 2 GB,
which probably isn't enough.
|
| Top
|
|
| Q7:
| Where can I find the scratch-files belonging to a running job?
|
|
| Please realize, that Sleipner, Fenris, Hugin and Munin each have a
local /scratch -filesystem, just as ordinary (Beowulf-) clusters
usually have. In batchjobs, the environment variable SCRDIR
points out the uniq scratch-directory which has been assigned to
the job. Using this environment variable consequently will make your
jobscript work on whatever node the job is started on.
If you want to check your scratch-files interactively while a
job is running however, you need to know which node it is running on.
The 'js' and 'llq' command will show this (to se the nodename for a
preempted job you must use 'js -H'). When logged in to sleipner you
can then go to the 'local' scratch-directories on each node by
doing:
cd /scratch/hostname
Example: cd /scratch/fenris
Important: Don't use the 'cd /scratch/hostname'
construction in jobscripts, as it might be rather inefficient
(based on NFS), and it will not generally work.
|
| Top
|
|
| Q8:
| How can I control which node will execute my job?
|
|
| You can control which node will execute your job in two ways. The
first method is to use the node specific queues qsl, qfe and
quu. These queues starts the job on Sleipner, Fenris or Hugin/Munin.
Notice, that jobs in these queues cannot be preempted - and will not
preempt other jobs!
The second method is to specify a LoadLeveler node requirement
in the jobscript and use the "normal" queues, for example to specify
that a qexp-job must run on Fenris, use this recipe:
#!/bin/sh
# @ job_name = myjob
# @ job_type = serial
# @ class = qexp
# @ requirements = (Machine == "fenris")
# requirements = (Machine == "fenris" || Machine == "sleipner")
# @ input = /dev/null
# @ output = $(job_name).$(jobid).o
# @ error = $(job_name).$(jobid).e
# @ notification = never
# @ queue
... shell commands ...
|
| Top
|
|
| Q9:
| How can I specify when my job may start?
|
|
| If you want to submit a job which must not start before
a specific date/time, add the following line to your LoadLeveler script
before the "# @ queue" statement:
# @ startdate = 10/25/2003 14:35
The job will not be started before 25-Oct-2003 14:35. Please notice,
that it is not guaranteed to start at that time,
if adequate resources are not available.
|
| Top
|
|
| Q10:
| Why should I read the manuals when porting my programs to IBM?
|
|
| We have realized, that several problems that users experience when
they are porting their programs to the IBM/AIX platform, basicly origins
in small differences in the implementation of routines. Especially,
be carefull when using the BLAS routines in ESSL. They may differ from
"normal" behaviour, but they work as documented!
See the documentation on our
software page.
|
| Top
|
|
| Q11:
| I accidently deleted some of my files. How can I get them back?
|
|
| Contact the Staff and ask for getting the files restored from
backup. Provide these informations:
- Full pathname specifications of the files and/or directories to
be restored.
- Date and time when the files/directories were deleted, or last
time they were known to exist.
- Pathname specification to a directory where you want the data
restored. You may want to specify original, which means
that the data will be restored to the place they were when backed up.
This requires that no new data has entered into this directory.
Alternatively, you can use the dsm command, which starts an
interactive graphical tool for restoring files.
Remember, if a deleted file hasn't been restored after 30 days, it
will expire from the backupsystem, and cannot be retrieved any longer.
Max. two versions of a file will be kept in the backupsystem.
|
| Top
|
|
| Q12:
| How do I review the messages flashing over the screen when I log in.
|
|
| Use this command: nyt
|
| Top
|
|
|
|
 |