PBS

From CS Support Wiki
Revision as of 18:16, 21 January 2009 by Ms6ep (Talk | contribs) (Using PBS: fix typo)

Jump to: navigation, search

Using the PBS Cluster

Jobs are submitted to the cluster by creating a PBS job command file that specifies certain attributes of the job, such as how long the job is expected to run and how many nodes of the cluster are needed (e.g. for parallel programs). PBS then schedules when the job is to start running on the cluster (based in part on those attributes), runs and monitors the job at the scheduled time, and returns any output to the user once the job completes.

Logging on to the Cluster

Logins to the cluster are done via SSH to one of the head nodes; you should not log in direction on any of the compute nodes. You will need an SSH client on your computer - either OpenSSH for a *nix system or SecureCRT for a Windows PC.

Log onto power[1..6].cs.virginia.edu; these are the head nodes for the cluster, and act as the control console for the queues. These machines are appropriate for any interactive work such as source code editing, compilation, and submitting jobs through PBS.

Centurion001 is the PBS server node, and jobs are executed on the centurion[2..64] nodes by default. There are also the radio, generals, lava and realitytv cluster queues, but you should check with root first before using those queues.

The sections below will outline what you need to know to set up and run your jobs on the various clusters. "Screen Capture" examples are given in the gray boxes to show what you should expect to see as output from various commands.

Configuring Your Account

Use of the CS PBS system assumes some familiarity with the Unix/Solaris software environment. In order to use PBS for batch job submission, it may be necessary to configure some of your Unix account startup files. General information about the Unix operating system can be found here.

When a job is submited to the cluster through PBS a new login to your account is initiated, and any initialization commands in your startup files (.profile, .variables.ksh, .kshrc etc) are executed. In this case (running in batch mode) it is necessary to disable the interactive commands such as setting tset and stty. If these precautions are not taken then error messages will be written to the batch jobs error file and your program may not run. The recommended procedure to disable the interactive sections of the startup files is to test the environment variable PBS_ENVIRONMENT, which is set when PBS runs. If the variable has been set, meaning a PBS job has initiated the login, the interactive parts of the startup files are skipped. CS Department profiles are not set to run any commands interactively, by default, so unless you have modified this yourself, there is nothing to be concerned with.

Your CS account is completely separate from your ITC account, even though it has the same 'username' and UID. Your password and home directory are not shared between the two systems, so please remember that this is a different password.

Password-less Login

You will need to set up ssh for password-free login on the CS Departmental Systems. In order to do this, you will need to set up a public/private key pair for use with SSH. Log onto one of the interactive front ends (power[1..6]), and do the following:

1. Change directories to your .ssh directory:

jpr9c@power1
: /af13/jpr9c ; cd .ssh

jpr9c@power1
: /af13/jpr9c/.ssh ; 

2. Generate an ssh key-pair using ssh-keygen:

jpr9c@power1
: /af13/jpr9c/.ssh ; ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/af13/jpr9c/.ssh/id_rsa): 

Go ahead and press "enter" and use the default file name. Next, just leave the "passphrase" blank - remember, you want to be able to log onto other systems using SSH without having to type in anything:

Enter passphrase (empty for no passphrase):
Enter same passphrase again:

The keys will be stored right where they need to be by default:

Your identification has been saved in /af13/jpr9c/.ssh/id_rsa.
Your public key has been saved in /af13/jpr9c/.ssh/id_rsa.pub.
The key fingerprint is:
a5:99:57:33:ee:d5:4f:b9:28:ff:91:4d:66:22:0e:5e jpr9c@power1

3. Next, we need to let the SSH client and server know that this key is to be used to allow you to log in, so add the public key to the "authorized_keys" file:

jpr9c@power1
: /af13/jpr9c/.ssh ; cat id_rsa.pub >> authorized_keys

Be sure the permissions are set correctly on this file:

jpr9c@power1
: /af13/jpr9c/.ssh ; chmod 644 authorized_keys

jpr9c@power1
: /af13/jpr9c/.ssh ; ls -l authorized_keys 
-rw-r--r-- 1 jpr9c uucp 617 2008-07-14 13:39 authorized_keys

4. Because SSH uses asymmetrical keys for host as well as user identification, the first time you connect to a new host, you get a prompt back asking if you want to add the host's public key to your known_hosts file. Since you need to be able to log onto the different machines without any keyboard-interaction, you'll need to add these host keys to your known_hosts file. Those keys are available in the pbs_hosts file; download a copy (either use the command below or 'right-click' and 'save-as') and add it to your known_hosts file:

jpr9c@power1
: /af13/jpr9c/.ssh ; wget http://www.cs.virginia.edu/~csadmin/pbs/pbs_hosts
--15:31:14--  http://www.cs.virginia.edu/~jpr9c/pbs_hosts
           => `pbs_hosts'
Resolving www.cs.virginia.edu... 128.143.137.29
Connecting to www.cs.virginia.edu|128.143.137.29|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 78,676 (77K) [text/plain]

100%[====================================>] 78,676        --.--K/s             

15:31:14 (135.30 MB/s) - `pbs_hosts' saved [78676/78676]

jpr9c@power1
: /af13/jpr9c/.ssh ; cat pbs_hosts >> known_hosts

5. Test your key setup to be sure you can log onto another system without typing in a password:

jpr9c@power1
: /af13/jpr9c/.ssh ; ssh power2
Linux power2 2.6.24-19-generic #1 SMP Fri Jul 11 21:01:46 UTC 2008 x86_64

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To access official Ubuntu documentation, please visit:
http://help.ubuntu.com/
Last login: Mon Aug 25 10:00:29 2008 from roark.cs.virginia.edu
No mail for jpr9c

It is absolutely essential that you have password-less SSH working for your PBS jobs to run! The PBS queue uses the SSH suite (scp, ssh) to move your files around the cluster and to copy your results back to you after your job has run. If you have to type anything to log onto one node from another, then review your steps, and see the trouble-shooting section below.

Using PBS

The PBS resource management system handles the management and monitoring of the computational workload on the Department clusters. Users submit "jobs" to the resource management system (PBS Server - Centurion001) where they are queued up until the system is ready to run them. PBS selects which jobs to run, when, and where, according to node attributes requested by the users. There are resource routing queues to try and ensure that users are handled as efficiently as possible and to maximize throughput.

To use PBS, you create a batch job command file which you submit to the PBS server to run on a PBS queue. A batch job file is simply a shell script containing the set of commands you want run on some set of cluster compute nodes. It also contains directives which specify the characteristics (attributes), and resource requirements (e.g. number of compute nodes and maximum runtime) that your job needs. Once you create your PBS job file, you can reuse it if you wish or modify it for subsequent runs.

PBS also provides a special kind of batch job called interactive-batch. An interactive-batch job is treated just like a regular batch job, in that it is placed into the queue and must wait for resources to become available before it can run. Once it is started, however, the user's terminal input and output are connected to the job in what appears to be an rlogin session to one of the compute nodes. Many users find this useful for debugging their applications or for computational steering.

PBS provides two user interfaces for batch job submission: a command line interface (CLI) and a graphical user interface (GUI). Both interfaces provide the same functionality. The CLI lets you type commands at the system prompt. This guide will only provide examples for using the CLI; if you are an experienced *nix and X-windows user and prefer the GUI, further information about how to configure and use the xpbs interface can be found in Chapter 5 of the PBS Pro User Guide. The remainder of this tutorial will focus on the PBS command line interface. More detailed information about using PBS can be found in the PBS Pro User Guide.

PBS Job Command File

To submit a job to run on the Centurion cluster, a PBS job command file must be created. The job command file is a shell script that contains PBS directives; these directives are preceded by #PBS. The following is an example of a PBS command file to run a serial job, which would require only 1 processor on 1 node.

#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l walltime=12:00:00
#PBS -o output_filename
#PBS -j oe
#PBS -m bea
#PBS -M userid@virginia.edu
 
cd $PBS_O_WORKDIR
./your_executable

The first line identifies this file as a shell script. The next several lines are PBS directives that must precede any commands to be executed by the shell (e.g. the last two lines). The PBS directives are defined in the table below:

PBS Directive                         Function
 
#PBS -l nodes=1:ppn=1          Specifies a PBS resource requirement of
                               1 compute node and 1 processor per node.
 
#PBS -l walltime=12:00:00      Specifies a PBS resource requirement of
                               12 hours of wall clock time to run the job.
 
#PBS -o output_filename        Specifies the name of the file where job
                               output is to be saved. May be omitted to
                               generate filename appended with jobid number.
 
#PBS -j oe                     Specifies that job output and error messages
                               are to be joined in one file.
 
#PBS -m bea                    Specifies that PBS send email notification
                               when the job begins (b), ends (e), or
                               aborts (a).
 
#PBS -M userid@virginia.edu    Specifies an alternate email address where PBS
                               notification is to be sent.

#PBS -V                        Specifies that all environment variables
                               are to be exported to the batch job.

The following is an example of a PBS email notification to the user at the end of the job:

Date: Tue, 2 Sep 2008 12:43:09 -0500
From: root 
To: jpr9c@cs.virginia.edu
Subject: PBS JOB 9563.centurion001
 
PBS Job Id: 1187.centurion
Job Name:   script.sh
Execution terminated
Exit_status=0
resources_used.cpupercent=02
resources_used.cput=00:00:01
resources_used.mem=64248kb
resources_used.ncpus=1
resources_used.vmem=81036kb
resources_used.walltime=00:00:02

Note that the walltime-used information in the email should be used to accurately estimate the walltime resource requirement in the PBS job command file for future job submissions so that PBS can more effectively schedule the job. When submitting a particular PBS job for the first time, the walltime requirement should be overestimated to prevent premature job termination.

After the PBS directives in the command file, the shell executes a change directory command to $PBS_O_WORKDIR, a PBS variable indicating the directory where the PBS job was submitted. Normally this will also be where the progam executable is located. Other shell commands can be executed as well. In the last line, the executable itself itself is invoked.

Submitting a Job

The PBS qsub command is used to submit job command files for scheduling and execution. For example, to submit your job with a PBS command file called "pbs_test.sh", the syntax would be

jpr9c@power1
: /af13/jpr9c/work/pbs ; qsub pbs_test.sh 
9563.centurion001

Notice that upon successful submission of a job, PBS returns a job identifier of the form <jobid>.centurion001, where <jobid> is an integer number assigned by PBS to that job. You'll need the job identifier for any actions involving the job, such as checking job status, deleting the job, or specifying job dependencies as described below.

There are many options to the qsub command as can be seen by typing man qsub at the command prompt on power[1..6].cs.virginia.edu or looking at the PBS Pro User Guide. Three of the more useful ones are the -W option for allowing specification of additional job attributes, the -I option, which declares that the job is to be run "interactively", and the -l option, which allows resource requirements to be listed as part of the qsub command. These are discussed below.

qsub Options

Displaying Job Status

The qstat -a command is used to obtain status information about jobs submitted to PBS.

jpr9c@power1
: /af13/jpr9c ; qstat -a

centurion001: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
9517.centurion0 bcs8d    centurio applu_1000  20134   1   1  512mb 10:00 R 02:04
9518.centurion0 bcs8d    centurio apsi_10000  29066   1   1  512mb 10:00 R 02:04
9520.centurion0 bcs8d    centurio bzip2graph   7502   1   1  512mb 10:00 R 02:03
9521.centurion0 bcs8d    centurio crafty_100   7515   1   1  512mb 10:00 R 02:03
9522.centurion0 bcs8d    centurio eoncook_10  26435   1   1  512mb 10:00 R 02:03
9523.centurion0 bcs8d    centurio equake_100  26444   1   1  512mb 10:00 E 02:01
9525.centurion0 bcs8d    centurio fma3d_1000  17953   1   1  512mb 10:00 R 02:03
9526.centurion0 bcs8d    centurio galgel_100  19263   1   1  512mb 10:00 R 02:03
9529.centurion0 bcs8d    centurio gzipgraphi    749   1   1  512mb 10:00 R 02:04
9530.centurion0 bcs8d    centurio lucas_1000  10537   1   1  512mb 10:00 R 02:03
9531.centurion0 bcs8d    centurio mcf_100000  10548   1   1  512mb 10:00 R 02:03
9532.centurion0 bcs8d    centurio mesa_10000  16103   1   1  512mb 10:00 R 02:03
9533.centurion0 bcs8d    centurio mgrid_1000  16113   1   1  512mb 10:00 R 02:03
9534.centurion0 bcs8d    centurio parser_100   4316   1   1  512mb 10:00 R 02:03
9538.centurion0 bcs8d    centurio twolf_1000  12837   1   1  512mb 10:00 R 02:03
9539.centurion0 bcs8d    centurio vortexone_  12846   1   1  512mb 10:00 R 02:03
9541.centurion0 bcs8d    centurio wupwise_10   8063   1   1  512mb 10:00 R 02:03

The first five fields of the display are self-explanatory. The sixth and seventh fields, titled NDS and TSK in the above display, indicate the total number of nodes and processors respectively required by each job. The ninth field indicates the required walltime (hrs:min.) and the last field shows the elapsed runtime. The tenth field titled S indicates the state of the job. The job state can have the following values:

State              Definition
 
E          Job is exiting after having run
H          Job is held
Q          Job is queued, eligible to run or be routed
R          Job is Running
T          Job is in transition (being moved to a new location)
W          Job is waiting for its requested execution time to be reached
S          Job is suspended

To see more specific information on your particular job, you can run "qstat -f <jobid>.centurion001":

jpr9c@centurion001
: /af13/jpr9c/work/pbs ; qstat -f 10252.centurion001
Job Id: 10252.centurion001
    Job_Name = script.sh
    Job_Owner = jm6dg@centurion001.cs.virginia.edu
    resources_used.cpupercent = 95
    resources_used.cput = 02:13:54
    resources_used.mem = 1351476kb
    resources_used.ncpus = 1
    resources_used.vmem = 2500800kb
    resources_used.walltime = 38:47:19
    job_state = R
    queue = centurion
    server = centurion001
    Checkpoint = u
    ctime = Wed Sep 10 17:46:04 2008
    Error_Path = centurion001.cs.virginia.edu:/uf8/jm6dg/fractal/driver/src/scr
	ipt.sh.e10252
    exec_host = centurion039/1
    exec_vnode = (centurion039:ncpus=1)
    Hold_Types = n
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Sep 10 21:35:50 2008
    Output_Path = centurion001.cs.virginia.edu:/uf8/jm6dg/fractal/driver/src/sc
	ript.sh.o10252
    Priority = 0
    qtime = Wed Sep 10 17:46:04 2008
    Rerunable = True
    Resource_List.mem = 400mb
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=1
    Resource_List.place = scatter
    Resource_List.select = 1:ncpus=1
    Resource_List.walltime = 100:00:00
    stime = 1221096950
    session_id = 29207
    job_dir = /uf8/jm6dg
    Variable_List = PBS_O_HOME=/uf8/jm6dg,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=jm6dg,
	PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
	n:/usr/games:/usr/pbs/bin,PBS_O_MAIL=/var/mail/jm6dg,
	PBS_O_SHELL=/usr/cs/bin/bash,PBS_O_HOST=centurion001.cs.virginia.edu,
	PBS_O_WORKDIR=/uf8/jm6dg/fractal/driver/src,PBS_O_SYSTEM=Linux,
	PBS_O_QUEUE=centurion
    comment = Job run at Wed Sep 10 at 21:35 on (centurion039:ncpus=1)
    etime = Wed Sep 10 17:46:04 2008

PBS Resources

Queues

The CS department has a number of clusters, each of which currently has it's own queue. General public users should only use the default queue; for permission to use another queue, please send mail to root.

Centurion (default queue)

This is the default queue, please run your jobs here. If you want to run on another queue, please clear it with root first.

Resources:

126 CPUs (AMD Opteron(tm) Processor 242) 2GB RAM

Limits:

1792MB Mem/Job

Radio

Resources:

48 CPUs (AMD Opteron(tm) Processor 246) Radio[1..5] 4GB RAM Radio[6..24] 2GB RAM

Limits:

Debugging PBS Jobs

Ordinarily, you should compile your program on one of the head/interactive nodes (power[1..6].cs.virginia.edu); the software and operating system environment (eg, filesystem layout, libraries, etc.) on the head nodes is identical to the PBS compute nodes. If your job will run correctly on the head node, it should run correctly when submitted through PBS. Please remember: to run in batch mode, your program should not require user interaction! Your test program should not require interactive input from the user (mouse or keyboard); stdin should come from a file or socket. To run a program interactively, see below.

Most PBS users have selected PBS because they need to run very large numbers of jobs; either multiple runs of slightly altered jobs, or the same job run against multiple data sets. Because you will likely be queuing a very large number of jobs, and because we try to minimize the user limits on PBS resources, you can potentially take all the department PBS resources. Please remember to check your jobs to see if they are running as expected. If you feel a job is hung, please qdel it and do not submit large batches until you are confident they are running as expected.

Normal Error Reporting

PBS will give you two type of error reporting - batch processing errors are reported via email, and the stderr output of jobs is captured and returned to you in a file named: <script_name>.e<jobid>.

To get PBS to report more extensively on batch processing errors, please use the "-m bae" script option so that PBS will report when it starts your job, when it aborts your job (and why), and when the job is ended. PBS will not report jobs which are switched to the "HELD" state because of errors actually batch-processing the job. If your job is "HELD" (status "H" in qstat) then it will stay that way indefinitely (or until the administrator cancels it), so you need to investigate why there were errors! To obtain more detailed information on why the job was held, please use:

qstat -f <jobid>.centurion001
If your job has a system hold, please send mail to root@cs.virginia.edu, with the output of that command in the contents of your email; we will need to see which host is causing your job to fail.

If your job is being run through the batch system successfully, you will get notification that PBS has run your job and you will get two files back, stdout and stderr from your job. These are returned to the CWD context of your qsub, unless you specify an alternate $PBS_WRK_DIR or "-o <dir/filename>" option in your PBS script. If your results are not what you expect, please examine the std error file <scriptname>.e<jobid>.

Running Interactively

Interactive Console Shell

The -I option of qsub declares that a job must be run "interactively". The job will be queued and scheduled as any PBS batch job, but when executed, the standard input, output, and error streams of the job are connected through qsub to the terminal session in which qsub is running. To acquire a node in interactive mode, you can construct a trivial PBS script such as the following, which for purposes of this example will be called debug.sh:

#!/bin/sh
#PBS -l walltime=10:00:00

Now submit debug.sh with

qsub -I debug.sh
Once the PBS intereactive job is executed, the terminal session will be logged into one of the compute nodes allocated by PBS (you will have an interactive bash session on the compute node). The executable can then be invoked manually from the command prompt. After you have completed your interactive session, be sure to exit from the shell on the compute node so that the node can be returned to PBS for other jobs. Exiting the shell terminates your interactive PBS job.

Interactive GUI

It is best to avoid the need for a graphical user interface when using PBS interactively. If you must use one, e.g. for developing a Matlab script, it is best to use it on the frontend. Keep in mind that an X server must be installed on your system; the dept. Linux systems already have XFree86 built in, and the Windows systems are deployed with Exceed. If there is an absolute need for using a graphical interface on the compute nodes, a more complex process is required. With eXceed under Windows: open a console window and at the prompt, type

      ipconfig

This will return your IP address. For the purposes of this example, suppose it is 128.143.67.37

Once on the node assigned by PBS, you must set your environment display; type:

      export DISPLAY=cordelia.cs.virginia.edu:0
      or
      export DISPLAY=128.143.67.37:0
depending on whether you know the name or only the IP address of your local machine.

If you use tcsh, type:

      setenv DISPLAY jellybean.itc.virginia.edu:0
and similarly for an IP address. You should now be able to run X applications on the compute node assigned by PBS.


Common Errors

Password-less SSH not properly set up

If your password-free ssh keys are not set up properly, you will get debugging output back which looks like the message below. Please step through the procedures outlined above. Note: if you have old host keys for any of the PBS nodes in your ~<user>/.ssh/known_hosts file, it can cause problems; please be sure to remove all old keys, and append the keys in the download above to your known_hosts file.

"bad UID"

This generally happens when you are not logged into one of the authorized front ends; you must submit your jobs from one of the power nodes. If you are submitting from a power node, then it is likely your job is being scheduled on a compute node which is having authentication problems. Please report this too root@cs.virginia.edu, with the full output of the email you got back from PBS so we can track down the problem node.

"scp: ambiguous target"

If you have spaces in your path you may get this error back from PBS when it tries to write the output of your job back to your home directory. You will get an email indicating that the job executed but that the copy of results was left in the 'undelivered' directory on the compute node; specifically, you will get the message above in the output (an extract is shown below). If this happens, please check your path to be sure there are no white spaces in it! Unlike GUIs, command-line shells (and other binaries) use white spaces as inter-field separators (to separate the ARGV elements), and spaces in the path can confuse them. In general, when working on Unix systems, it's just a bad habit to use spaces; use underscores "_" as non-whitespace placeholders.

In the example below, renaming the path from "CS 654" to "CS_654" fixes the problem.

debug1: Next authentication method: publickey
debug1: Trying private key: /af21/vm9u/.ssh/identity
debug1: Offering public key: /af21/vm9u/.ssh/id_rsa
debug1: Server accepts key: pkalg ssh-rsa blen 277
debug1: read PEM private key done: type RSA
debug1: Authentication succeeded (publickey).
debug1: channel 0: new [client-session]
debug1: Entering interactive session.
debug1: Sending environment.
debug1: Sending env LANG = en_US.UTF-8
debug1: Sending command: scp -v -d -t /af21/vm9u/wrk/CS 644/
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
scp: ambiguous target
vm9u@centurion010:/var/spool/PBS/undelivered$ debug1: channel 0: free: client-session, nchannels 1
debug1: fd 0 clearing O_NONBLOCK
debug1: fd 1 clearing O_NONBLOCK
debug1: Transferred: stdin 0, stdout 0, stderr 0 bytes in 0.0 seconds
debug1: Bytes per second: stdin 0.0, stdout 0.0, stderr 0.0
debug1: Exit status 1

MPI communication errors

Shared Memory Bug

There is presently a bug in the p4 (default) comm device on the x86_64 mpich library with shared memory. If you run a job using mpiexec and use the default comm (mpich-p4), and run on an SMP node (all our nodes are SMP nodes) you may have a shared memory problem. This problem is likely to occur if you compile and run an MPI job with the default options, requesting more than one CPU and you get multiple CPUs on the same node (ie, the most likely scenario). If you request an equal number of nodes to the number of CPUs you want, then your job will likely not have any errors.

The error reported in your error file will look like:

bm_slave_1_478: (0.035156) process not in process table; my_unix_id = 478 my_host=centurion002
bm_slave_1_478: (0.035156) Probable cause:  local slave on uniprocessor without shared memory
bm_slave_1_478: (0.035156) Probable fix:  ensure only one process on centurion002
bm_slave_1_478: (0.035156) (on master process this means 'local 0' in the procgroup file)
bm_slave_1_478: (0.035156) You can also remake p4 with SYSV_IPC set in the OPTIONS file
bm_slave_1_478: (0.035156) Alternate cause:  Using localhost as a machine name in the progroup
bm_slave_1_478: (0.035156) file.  The names used should match the external network names.
bm_slave_1_478:  p4_error: p4_get_my_id_from_proc: 0
p0_477: (98.054688) net_send: could not write to fd=4, errno = 32
rm_14580: (-) net_recv failed for fd = 3
rm_26216: (-) net_recv failed for fd = 3
rm_31238: (-) net_recv failed for fd = 3

If you see something like that, please specify the mpiexec option to not use shared memory:

mpiexec -mpich-p4-no-shmem ./my_mpi_prog

When the next release of the OSC mpiexec for PBS becomes available, we will upgrade to MPICH2 which should resolve this problem.

IPC errors "p4_error: semget failed for setnum:"

Several users have reported the problem that the MPI_INITIALIZE() fails because a previous MPI job has crashed on a node, and left the MPI semaphores locked. The typical error looks like:

rm_13010:  p4_error: semget failed for setnum: 0
p0_15425:  p4_error: net_recv read:  probable EOF on socket: 1
p0_15425: (6.359375) net_send: could not write to fd=4, errno = 32
rm_6498:  p4_error: net_recv read:  probable EOF on socket: 3
rm_12042:  p4_error: net_recv read:  probable EOF on socket: 3

There is a simple binary included with mpi which is used to clean up any semaphores left behind for a given user, called (suggestively) "clearipcs"; it's located in /usr/sbin on all the mpi-enabled systems, and will be part of your default environment. There are two ways to invoke this simple command - either before your script executes, or after, the latter being a very simple cleanup.

To cleanup any old semaphores left behind from your previous (crashed) jobs, insert the following loop in your PBS Script file before you call mpiexec; this will loop through all the nodes you are assigned and call the cleanipcs command:

## Loop through nodes before executing my job to be sure there aren't any MPI shm semaphores left
for i in `cat $PBS_NODEFILE | sort -u` ; do
     echo "removing IPC shm segments on $i"
     ssh $i "/usr/sbin/cleanipcs"
done

After this runs, you should see something similar to this in your output file:

removing IPC shm segments on centurion004
removing IPC shm segments on centurion005
removing IPC shm segments on centurion006
removing IPC shm segments on centurion007
removing IPC shm segments on centurion008
removing IPC shm segments on centurion009
removing IPC shm segments on centurion010
removing IPC shm segments on centurion011
removing IPC shm segments on centurion015
removing IPC shm segments on centurion016
removing IPC shm segments on centurion017
removing IPC shm segments on centurion018
removing IPC shm segments on centurion019


Alternately, you can simply "source" the binary at the end of your script; this is generally a good practice anyway, as it ensures proper cleanup after your job completes, even if there is a crash. Just place the following line at the bottom of your script:

# Cleanup any leftover IPC semaphores after execution
. /usr/sbin/cleanipcs

Please make note of the little "." at the beginning of that line; this is the magic shell character which tells the interpreter to "source" the filename which follows (execute or concatenate, for text files); if you are using bash, there is also a builtin 'source' command which can be used instead of the '.'.

"bad interpreter" error

If you create your script file on a Windows PC using some text editor, your file will be saved using "DOS" text. The EOL 'character' in a DOS/Windows text is actually a carriage-return and a linefeed character for the "newline" symbol, whereas on a *nix system only a linefeed is used. This extra character is often ignored by many *nix applications - they will display the file normally and "hide" the extra character. If the particular application will show the character, then it appears as a CTRL+M (shown here in /bin/vi on Solaris):

#!/bin/bash^M
#PBS -l walltime=00:02:00^M
#PBS -l select=4:mpiprocs=1^M
#PBS -m a^M
#PBS -o bigtest.mpich-p4-no-shmem^M
#PBS -j oe^M
^M
^M
# show my node file^M
cat $PBS_NODEFILE^M
^M
cd $PBS_O_WORKDIR^M
mpiexec -mpich-p4-no-shmem ./testMPI^M

Although most editors on the dept. Ubuntu systems will hide this extra character, the *nix shell (bash) and PBS will not ignore it. You will get an error message back from the PBS server that looks like this:

jpr9c@power1
: /af13/jpr9c/work/mpi ; more bigtest.mpich-p4-no-shmem 
No mail for jpr9c
/bin/stty: standard input: Invalid argument
-bash: /var/spool/PBS/mom_priv/jobs/92840.centurion001.SC: /bin/bash^M: bad interpreter: No such file or directory

The system believes that the file name for the interpreter is "bash^M" not "bash".

To work around this, there is a handy utility for stripping off this extra character: "dos2unix", which is installed on the dept. systems. On the Ubuntu systems, all you need to do is run the program once:

jpr9c@power1
: /af13/jpr9c/work/mpi ; dos2unix test.dos.sh

jpr9c@power1
: /af13/jpr9c/work/mpi ; 

That's all! Then you should be able to successfully submit your job.