DCSC logo
 
ABOUT-DCSC
DCSC/SDU
DCSC/AU
DCSC/AAU
DCSC/DTU
DCSC/KU
 
+Open all         -Close all
 
    Overview   Hardware   Software   Batchjobs   Hints  

 

Hints and FAQ for the IBM cluster

Q1: How to compile programs, producing 32/64 bits executables?
Q2: How to compile a "auto-parallel" program (a sequential program parallized by the compiler)?
Q3: How to compile MPI programs?
Q4: How can I monitor my programs and jobs?
Q5: Where can I find online documentation (manuals)?
Q6: My program dies with a "not enough memory" error, when I request (allocate) more than ca. 200 MB of memory, or it dies with "1525-108 Error encountered while attempting to allocate a data object."
Q7: Where can I find the scratch-files belonging to a running job?
Q8: How can I control which node will execute my job?
Q9: How can I specify when my job may start?
Q10: Why should I read the manuals when porting my programs to IBM?
Q11: I accidently deleted some of my files. How can I get them back?
Q12: How do I review the messages flashing over the screen when I log in.


Q1: How to compile programs, producing 32/64 bits executables?
  The 'xlf' -command invokes the generic Fortran compiler (See 'man xlf').To get 64-bit executables you must set the environment variable OBJECT_MODE to 64 (in a Bourneshell: OBJECT_MODE=64; export OBJECT_MODE In a C-shell (tcsh): setenv OBJECT_MODE 64) before any compiling, linking or archiving ('ar'). If OBJECT_MODE is 32 or unset you will get 32 bit executables.
Top  

Q2: How to compile a "auto-parallel" program (a sequential program parallized by the compiler)?
  Use this compilerstatement: % xlf_r -O3 -qsmp=auto prog.f
To run the program first set the number of threads the program is using, then exec the program (example for C-shell):
setenv OMP_NUM_THREADS 4
./a.out
Top  

Q3: How to compile MPI programs?
  The 'mpxlf_r' -command compiles and links a MPI-program. If you are compiling in 64 bit mode (i.e. OBJECT_MODE=64) remember to include the linker flag '-lmpi_r', ie: 'mpxlf_r -lmpi_r'. See 'man mpxlf' and 'man poe'. NB: It is important that you use the 'mpxlf_r' command instead of 'mpxlf' otherwise you may get these runtime errors:
MPCI non-recoverable error...[devinit.c, 892], pid=61322, rc=324.
Top  

Q4: How can I monitor my programs and jobs?
  For monitoring your UNIX-processes, use 'top'. There are other utillities as well, 'nmon' and 'topas' monitors more aspects of the system, 'iostat', 'vmstat' and 'sar' may be usefull only for your system manager.
To monitor jobs, use 'js'.
To see the reason why a job isn't starting, use: llq -s jobid
Top  

Q5: Where can I find online documentation (manuals)?
  Most of the IBM manuals are available as PDF-files at IBM. The most relevant issues are available from our software page.
Top  

Q6: My program dies with a "not enough memory" error, when I request (allocate) more than ca. 200 MB of memory, or it dies with "1525-108 Error encountered while attempting to allocate a data object."
  This error can probably be circumvented by recompiling your program with the 64 bit compiler. To do so, first set the OBJECT_MODE environment variable to "64": 'export OBJECT_MODE=64' in Bourne shell or 'setenv OBJECT_MODE 64' in C-shell. Then do a recompile of your program.
If your program must be in 32 bit mode, link your program using the -bmaxdata:0x80000000 option. If you allready have a 32 bit program (you will know, if the command 'file a.out' returns:
a.out: executable (RISC System/6000) or object module not stripped)
then you can enlarge the data-area to ca. 2 GB by using the mklarge utillity: 'mklarge a.out'
If your program is in 64 bit mode ensure that you are not using the -bmaxdata:0x80000000 option. It will limit the dataarea to 2 GB, which probably isn't enough.
Top  

Q7: Where can I find the scratch-files belonging to a running job?
  Please realize, that Sleipner, Fenris, Hugin and Munin each have a local /scratch -filesystem, just as ordinary (Beowulf-) clusters usually have. In batchjobs, the environment variable SCRDIR points out the uniq scratch-directory which has been assigned to the job. Using this environment variable consequently will make your jobscript work on whatever node the job is started on.
If you want to check your scratch-files interactively while a job is running however, you need to know which node it is running on. The 'js' and 'llq' command will show this (to se the nodename for a preempted job you must use 'js -H'). When logged in to sleipner you can then go to the 'local' scratch-directories on each node by doing:
cd /scratch/hostname
Example: cd /scratch/fenris
Important: Don't use the 'cd /scratch/hostname' construction in jobscripts, as it might be rather inefficient (based on NFS), and it will not generally work.
Top  

Q8: How can I control which node will execute my job?
  You can control which node will execute your job in two ways. The first method is to use the node specific queues qsl, qfe and quu. These queues starts the job on Sleipner, Fenris or Hugin/Munin. Notice, that jobs in these queues cannot be preempted - and will not preempt other jobs!
The second method is to specify a LoadLeveler node requirement in the jobscript and use the "normal" queues, for example to specify that a qexp-job must run on Fenris, use this recipe:
 #!/bin/sh
 # @ job_name = myjob
 # @ job_type = serial
 # @ class = qexp
 # @ requirements = (Machine == "fenris")
 #   requirements = (Machine == "fenris" || Machine == "sleipner")
 # @ input = /dev/null
 # @ output = $(job_name).$(jobid).o
 # @ error =  $(job_name).$(jobid).e
 # @ notification = never
 # @ queue
 ... shell commands ...
Top  

Q9: How can I specify when my job may start?
  If you want to submit a job which must not start before a specific date/time, add the following line to your LoadLeveler script before the "# @ queue" statement:
 # @ startdate = 10/25/2003 14:35
The job will not be started before 25-Oct-2003 14:35. Please notice, that it is not guaranteed to start at that time, if adequate resources are not available.
Top  

Q10: Why should I read the manuals when porting my programs to IBM?
  We have realized, that several problems that users experience when they are porting their programs to the IBM/AIX platform, basicly origins in small differences in the implementation of routines. Especially, be carefull when using the BLAS routines in ESSL. They may differ from "normal" behaviour, but they work as documented!
See the documentation on our software page.
Top  

Q11: I accidently deleted some of my files. How can I get them back?
  Contact the Staff and ask for getting the files restored from backup. Provide these informations:
  • Full pathname specifications of the files and/or directories to be restored.
  • Date and time when the files/directories were deleted, or last time they were known to exist.
  • Pathname specification to a directory where you want the data restored. You may want to specify original, which means that the data will be restored to the place they were when backed up. This requires that no new data has entered into this directory.
Alternatively, you can use the dsm command, which starts an interactive graphical tool for restoring files.
Remember, if a deleted file hasn't been restored after 30 days, it will expire from the backupsystem, and cannot be retrieved any longer. Max. two versions of a file will be kept in the backupsystem.
Top  

Q12: How do I review the messages flashing over the screen when I log in.
  Use this command: nyt
Top