10.0 MPI

MPI (Message Passing Interface) is a standard for writing parallel programs in message-passing environments. For more information please see the MPI web site, <http://www-unix.mcs.anl.gov/mpi/>.

The current Legion implementation supports a core MPI interface, which includes messages passing, data marshaling and heterogeneous conversion. Legion supports legacy (native) MPI codes and provides an enhanced MPI environment using Legion features such as security and placement services. A link-in replacement MPI library uses the primitive services provided by Legion to support the MPI interface. MPI processes map directly to Legion objects.

The remainder of this section is divided into two sections: legion MPI (section 10.1, below) and native MPI (section 10.2). Legion MPI programs have been adapted to run in Legion, are linked to Legion libraries, and can only be run on machines that have the Legion binaries installed. Native MPI programs do not need to fulfill any of these requirements (although they may: a Legion MPI program can be run as a native MPI program but not vice versa).

10.1 Legion MPI

10.1.1 Task classes

MPI implementations generally require that MPI executables reside in a given place on disk. Legion's implementation objects serve a similar role. We have provided a tool, legion_mpi_register, to register executables of different architectures for use with Legion MPI.

10.1.2 Installing Legion MPI

The Legion MPI library can be found in $LEGION/src/ServiceObjects/MPI.

10.1.3 Compilation

MPI code may be compiled as before. Link it against libLegionMPI and the basic Legion libraries. The final linkage must be performed using a C++ compiler or by passing C++ appropriate flags to ld. A sample Legion MPI makefile is shown in Figure 9 (below).

Figure 9: Sample Legion MPI makefile

C       =       gcc
CC      =       g++
F77     =       g77
OPT     =       -g
INC_DIR =       /home/appnet/Legion/include/MPI
FFLAGS  =       -I$(INC_DIR) -ffixed-line-length-none 

mpitest: mpitest.o
        legion_link -Fortran -mpi mpitest.o -o mpitest

mpitest.o: mpitest.f
        $(F77) -c mpitest.f -I$(INC_DIR) $(FFLAGS)

mpitest_c: mpitest_c.o
        legion_link -mpi mpitest_c.o -o mpitest_c

mpitest_c: mpitest_c.c
        $(C) $(CFLAGS) mpitest_c.c -o mpitest_c $(LDFLAGS) $(LIBS)

10.1.4 Register compiled tasks

Run legion_mpi_register. Usage of this tool is:

legion_mpi_register <class name>
   <binary path> <platform type>
   [-help]

The example below registers /myMPIprograms/vdelay (the binary path) as using a Linux architecture.

$ legion_mpi_register vdelay /myMPIprograms/vdelay linux 

If necessary Legion will create the MPI-specific contexts (/mpi, /mpi/programs, and /mpi/instances), a class object for this program, an implementation for this program, and registers the name in Legion context space. In a secure system (a Legion system with security turned on), these contexts will be placed in /home/<user_name>. If the user is logged in as admin or the net is running without security then the mpi contexts will be placed in the / (root) context.

The legion_mpi_register command may be executed several times if you have compiled your program on different architectures.

10.1.5 Running the MPI application

MPI programs are started using the program legion_mpi_run:

legion_mpi_run
   {-f <options file> [<flags>]} |
   {-n <num processors> [<flags>]
      <program class> [<args>]}

There are number of parameters and flags for this command. See page 76 in the Reference Manual for more information.

To control object scheduling, use a specification file. This can be a local file or a Legion file object and contains a list of acceptable hosts and how many objects can be run on each host. You can use the legion_make_schedule command to produce a file or write it by hand. To write a specification file by hand, list one host (by context path) and one optional integer (indicating the number of objects the host can create -- default is 1) per line. A host can be listed multiple times in one file, in which case the integer values accumulate. E.g.:

/hosts/BootstrapHost	5
/hosts/slowMachine1
/hosts/bigPBSqueue 	100
/hosts/slowMachine2 	1 

Use the -hf or -HF flag to use a specification file.

You can start multiple MPI applications with an option file. All of the applications must have been previously registered (with legion_mpi_register) and must use a common MPI_COMM_WORLD. As the example below shows, an option file has one application per line, including any necessary arguments as they would appear on the command line. Each line can also contain any of the legion_mpi_run flags except the -f flag.

-n 2 /mpi/programs/mpitest
-n 3 /mpi/programs/mpitest_c 

This would start a run of five instances (two of mpitest and three of mpitest_c). All five instances would use a single MPI_COMM_WORLD.

You must include -f when you run legion_mpi_run in order to use a option file. Note that if you use an option file, the -n flag, program name, and any arguments will be ignored. Any other <flags> will be treated as defaults and applied to all processes executed by this command, unless otherwise specified in the option file.

If you don't use an option file, you must provide an MPI program name on the command line, along with any flags and arguments. For example:

$ legion_mpi_run -n 2 /mpi/programs/vdelay 

Note that the application name here is the full context space path for the class created by legion_mpi_register in section 10.1.4.

As the program is running you can view your application's instances in /mpi/instances/<program_name>:

$ legion_ls /mpi/instances/vdelay 

In a secure system the instances will be in /home/<user_name>/mpi/instances/<program_name>. If you run multiple versions of an application simultaneously, you can use the -p flag to specify a different context to hold your instance LOIDs.

To view all of your instances, use legion_list_instances:

$ legion_list_instances /mpi/programs/<program_name> 

To destroy them, use legion_destroy_instances command. Be aware, though that this command will destroy all of the class's instances.

You can use legion_mpi_probe to check jobs you've started on remote hosts. You can check on jobs, move input and output files between your local and execution hosts. Please see page 74 in the Reference Manual for more information.

MPI programs cannot be deactivated.

10.1.6 Example

There is a sample MPI program included in a new Legion system, written in C (mpitest_c.c) and Fortran (mpitest.f). Go to the directory containing the program's binary files:

$ cd $LEGION/src/ServiceObjects/MPI/examples 

You must then register whichever version you wish to use with the legion_mpi_register command. The sample output below uses mpitest_c and shows Legion creating three new contexts (mpi, programs, and instances, the last two being subcontexts of mpi), creating and tagging a task class, and registering the implementation of the mpitest_c program.

$ legion_mpi_register mpitest_c \
      $LEGION/bin/$LEGION_ARCH/mpitest_c linux
"/mpi" does not exist - creating it
"/mpi/programs" does not exist - creating it
"/mpi/instances" does not exist - creating it
Task class "/mpi/programs/mpitest_c" does not exist -
	creating it
"/mpi/instances/mpitest_c" does not exist - creating it
Set the class tag to: mpitest_c
Registering implementation of mpitest_c
$ 

In order to view the output of the program, you will need to create and set a tty object if you have not already done so (see page 71).

$ legion_tty /context_path/mytty 

You can now run the program. Since you are not using an option file, you must specify how many processes the program should use to run the program with the -n flag. If you do not specify which hosts the program should run its processes on, Legion will arbitrarily choose the hosts from the host objects that are in your system. Here it runs three processes on three different hosts.

$ legion_mpi_run -n 3 /mpi/programs/mpitest_c
Hello from node 0 of 3 hostname Host2
Hello from node 1 of 3 hostname Host3
Node 0 has a secret, and it's 42
Node 1 thinks the secret is 42
Hello from node 2 of 3 hostname BootstrapHost
Node 2 thinks the secret is 42
Node 0 thinks the secret is 42
Node 0 exits barrier and exits.
Node 1 exits barrier and exits.
Node 2 exits barrier and exits.
$ 

Another time it might run the three processes on two hosts.

$ legion_mpi_run -n 3 /mpi/programs/mpitest_c
Hello from node 1 of 3 hostname BootstrapHost
Hello from node 0 of 3 hostname Host3
Node 0 has a secret, and it's 42
Node 1 thinks the secret is 42
Hello from node 2 of 3 hostname BootstrapHost
Node 2 thinks the secret is 42
Node 0 thinks the secret is 42
Node 0 exits barrier and exits.
Node 1 exits barrier and exits.
Node 2 exits barrier and exits.
$ 

10.1.7 Input and output files

If your program requires input files and/or generates output files, you may need to move these files between your local host and one or more remote hosts. There are two ways to accomplish this when you start the job: through flags in an option file or on the command line and through subroutines in your code. After the job has started, you can use legion_mpi_probe (see page 74 in the Reference Manual).

10.1.7.1 Flags

Like legion_run, legion_run_mpi uses -in/-out and -IN/-OUT flags to copy input and output files between your local host or context space and a remote host. The -in/-IN flags copy a specified input file to a job's remote host before execution. The -out/-OUT flags copy a specified output file from the remote host back to your local host or context space once the program has finished. You can repeat these flags multiple times in order to organize multiple files.

Note that the default setting will distribute files in these parameters to node 0 only. Use the -a flag if you need to get these files to other nodes. Be aware, though, this flag changes the naming pattern of your output files. The object's MPI object identification will be tacked on to the end of the file name. For example:

$ legion_mpi_run -n 2 -a -OUT done \
    /mpi/programs/mpiFoo 

When mpiFoo has finished, it will have produced output files done on two nodes. Legion will copy both of the files to your local directory and name them done.1183374393-0.0 and done.1183374393-0.1.

You can also use wildcards to work with groups of files. The following characters are wild when used with -in/-out and -IN/-OUT:

*

match 0 or more characters

?

match any one character

[-]

match any character listed between the brackets (use these to specify a range of characters)

\

treat the character as a literal

So, you could use * to identify any file that begins with "done" as an input file for mpiFoo:

$ legion_mpi_run -n 2 -IN done.* \
	/mpi/programs/mpiFoo 

Similarly, if you wanted to use done.1, done.2, done.3 ... done.9 as your input files, you can use square brackets to identify them as a group:

$ legion_mpi_run -n 2 -IN done.[0-9] \
    /mpi/programs/mpiFoo 

Wildcards can only be used with file names, not with directories.

Please see page 76 in the Reference Manual for more information on these flags.

10.1.7.2 Subroutines

You may prefer to edit your program, so as to transparently access other files for input or output. Legion has three subroutines that allow your program to read and write files in Legion context space: LIOF_LEGION_TO_TEMPFILE, LIOF_CREATE_TEMPFILE, and LIOF_TEMPFILE_TO_LEGION.

The first, LIOF_LEGION_TO_TEMPFILE, lets you copy a file from Legion context space into a local file. For example,

call LIOF_LEGION_TO_TEMPFILE ('input', INPUT, ierr)
open (10, file = INPUT, status = 'old') 

will copy a file with the Legion filename input into a local file and store the name of that local file in the variable INPUT. The second line opens the local copy of the file.

The other two subroutines can be used to create and write to a local file. For example:

call LIOF_CREATE_TEMPFILE (OUTPUT, IERR)
open (11, file = OUTPUT, status = 'new')
call LIOF_TEMPFILE_TO_LEGION (OUTPUT, 'output', IERR) 

The first line creates a local file and stores the file's name in the variable OUTPUT. The second line opens the local copy of the file. The third line copies the local file OUTPUT into a Legion file with the name output.

While this approach allows you to run MPI programs and transparently read and write files remotely, it does have one limitation: it does not support heterogeneous conversions of data. If you run this program on several machines which have different formats for an integer, such as Intel PC's (little-endian) and IBM RS/6000's (big-endian), using unformatted I/O may produce surprising results. If you want to use such a heterogeneous system, you will have to either use formatted I/O (all files are text) or use the "typed binary" I/O calls instead of Fortran READ and WRITE statements. These "typed binary" I/O calls are discussed in "Buffered I/O library, low impact interface" in the Legion Developer Manual.

10.1.8 Scheduling MPI processes

As noted above, Legion may not choose the same number of hosts objects as processes specified in the -n parameter. If you specify three processes, Legion will arbitrarily choose to run them on one, two, or three hosts.

We will run mpitest_c on four hosts, which we will assume are already part of the system. If you wish to specify which hosts are used to run your processor, use the -h flag. It tells Legion to look in a given context for a set of potential hosts for running your program. You can create this context with legion_make_hostlist (see page 30 in the Reference Manual) or by hand.

To do it by hand, you must create the new context then link the desired hosts to names in the new context. For example, if you want to create an mpi-hosts context, run legion_context_create:

$ legion_context_create mpi-hosts 

Then create context names for the hosts that you wish to use for this program. Use legion_ln to map the new names to the host objects' existing LOIDs.

$ legion_ln /hosts/Host1 /mpi-hosts/mpitest_Host1
$ legion_ln /hosts/Host2 /mpi-hosts/mpitest_Host2
$ legion_ln /hosts/Host3 /mpi-hosts/mpitest_Host3
$ legion_ln /hosts/Host4 /mpi-hosts/mpitest_Host4 

The program can now be run with legion_mpi_run, using -h to specify that Legion should look in /mpi-hosts for hosts and -n to specify that they be spread over four nodes. Note that the nodes are placed on the hosts in order.

$ legion_mpi_run -h /mpi-hosts -n 4 /mpi/programs/mpitest
Hello from node 0 of 4 hostname Host1.xxx.xxx
Hello from node 1 of 4 hostname Host2.xxx.xxx
Node 0 has a secret, and it's 42
Hello from node 2 of 4 hostname Host3.xxx.xxx
Node 1 thinks the secret is 42
Node 2 thinks the secret is 42
Hello from node 3 of 4 hostname Host4.xxx.xxx
Node 3 thinks the secret is 42
Node 0 thinks the secret is 42
Node 0 exits barrier and exits.
Node 1 exits barrier and exits.
Node 2 exits barrier and exits.
Node 3 exits barrier and exits.
$ 

The number of processes and number of host object do not have to match: if you wanted to run only two processes on the hosts named in mpi-hosts you could use -n 2.

$ legion_mpi_run -h /mpi-hosts -n 2 /mpi/programs/mpitest
Hello from node 0 of 2 hostname Host1.xxx.xxx
Node 0 has a secret, and it's 42
Hello from node 1 of 2 hostname Host2.xxx.xxx
Node 1 thinks the secret is 42
Node 0 thinks the secret is 42
Node 0 exits barrier and exits.
Node 1 exits barrier and exits.
$ 

The program runs on the first two hosts listed in the mpi-hosts context.1

10.1.9 Debugging support

A utility program, legion_mpi_debug, allows the user to examine the state of MPI objects and print out their message queues. This is a handy way to debug deadlock. Usage is:

legion_mpi_debug [-q]
   {[-c] <program instance context>}
   [-help]

The <program instance context> is the context that is holding your program's instances. Normally it is /mpi/instances/<program_name> or (in a secure system) /home/<user_name>/mpi/instances/<program_name>, unless you specified otherwise (with -p) when you started the program. The command will return a list of all of the program's instances and what each one is doing.

The -q flag will list the contents of the queues. For example:

$ legion_mpi_debug -q /mpi/instances/vdelay
Process 0 state: spinning in MPI_Recv source 1 tag 0 comm?
Process 0 queue:

(queue empty)

Process 1 state: spinning in MPI_Recv source 0 tag 0 comm?
Process 1 queue:

MPI_message source 0 tag 0 len 8
$ 

Do not be alarmed that process 1 hasn't picked up the matching message in its queue: it will be picked up after the command has finished running.

There are a few limitations in legion_mpi_debug: If an MPI object doesn't respond, it will hang, and it won't go on to query additional objects. An MPI object can respond only when it enters the MPI library; if it is in an endless computational loop, it will never reply.

Output from a Legion MPI object to standard output or Fortran unit 6 is automatically directed to a tty object (see page 71). Note, though, that this output will be mixed with other output in your tty object so you may wish to segregate MPI output. Legion MPI supports a special set of library routines to print on the Legion tty, which is created using the legion_tty command. This will cause the MPI object's output to be printed in segregated blocks.

LMPI_Console_Output (string)

Buffers a string to be output later.

LMPI_Console_Output_Flush ()

Flushes all buffered strings to the tty.

LMPI_Console_Output_And_Flush (string)

Buffers and flushes in one call.

The Fortran version of these routines adds a carriage return at the end of each string, and omits trailing spaces. The C version does not.

10.1.10 Checkpointing support

As you increase the number of objects or the length of time for which an MPI application runs, you increase the probability that the application will crash due to a host or network failure.

To tolerate such failures, we provide a checkpointing library for SPMD-style (Single Program Multiple Data) applications.2 SPMD-style applications are characterized by a regular communication pattern. Typically, each task is responsible for a region of data and periodically exchanges boundary information with a neighboring task.

We exploit the regularity of SPMD applications to provide an easy-to-use checkpointing library. Users are responsible for (1) deciding when to take a checkpoint and (2) writing functions to save and restore checkpoint state. The library provides checkpoint management support, i.e. fault detection, checkpoint storage management, and recovery of applications. Furthermore, users are responsible for taking checkpoints at a consistent point in the program. In the general case, this requirement would overwhelm most programmers. However, in SPMD applications, there are natural places to take checkpoints, i.e. at the top or bottom of the main loop [2].

10.1.10.1 Example

To use the checkpoint library, users are responsible for the following tasks: (1) recovery of checkpoint data, (2) saving the checkpoint data, and (3) deciding how often to checkpoint.

The following example is included in the standard Legion distribution ($LEGION/src/ServiceObjects/MPI/ft_examples) and consists of tasks that perform some work and exchange boundary information at each iteration.

Example: Sample uses of the checkpoint library

//
// Save state. State consists of the iteration count.
//
do_checkpoint(int iteration) {
	// pack state into a buffer
	MPI_Pack(&iteration, 1, MPI_INT, user_buffer,
		1000, &position, MPI_COMM_WORLD);

	// save buffer into checkpoint
	MPI_FT_Save(user_buffer, 20);

	// done with the checkpoint
	MPI_FT_Save_Done();
}
//
// State consists of the iteration count
//
do_recovery(int rank, int &start_iteration) {
	 // Restore buffer from checkpoint
	size = MPI_FT_Restore(user_buffer, 1000);

	// Extract state from buffer
	MPI_Unpack(user_buffer, 64, &position,
		&start_iteration, 1,
	MPI_INT, MPI_COMM_WORLD);
}

//
// C pseudo-code
// Simple SPMD example
//
// mpi_stencil_demo.c
//
main(int argc, char **argv) {
	// declaration of variables omitted...

	 // MPI Initialization
	MPI_Init (&argc, &argv);
	MPI_Comm_rank (MPI_COMM_WORLD, &nodenum);
	MPI_Comm_size (MPI_COMM_WORLD, &nprocs);

	// Determine if we are in recovery mode
	recovery = MPI_FT_Init(nodenum, start_iteration);
	 if (recovery)
		do_recovery(nodenum, start_iteration);
	 else
		start_iteration = 0;

	for (iteration=start_iteration; 
		 iteration < NUM_ITERATIONS; ++iteration) {

	exchange_boundary();

	 // ...do some work here...

	 // take a checkpoint every 10th iteration
	if (iteration%10==0)
		do_checkpoint(iteration);
	}
}

The first function called is MPI_FT_Init(). MPI_FT_init() initializes the checkpointing library and returns TRUE if the object is in recovery mode. If in recovery mode, we restore the last consistent checkpoint. Otherwise, we proceed normally and periodically take a checkpoint.

In do_checkpoint() we save state by packing variables into a buffer using MPI_Pack() and then calling MPI_FT_Save(). To save more than one data item, we can pack several items into a buffer (by repeatedly calling MPI_Pack) and then call MPI_FT_Save(). When the state to be saved is very large, we can break it down into smaller chunks and call MPI_FT_Save() for each chunk. When all data items for the checkpoints have been saved, we call MPI_FT_Save_Done().

In do_recovery() we recover the state. Note that the sequences of MPI_Unpack() and MPI_FT_Restore() must be mirror images of the sequences of MPI_Pack() and MPI_FT_Save() in do_checkpoint().

10.1.10.2 API (C & Fortran)

C API

Fortran API

Description

int MPI_FT_Init(int rank)

MPI_FT_INIT(RANK, RECOVERY)
INTEGER RANK, RECOVERY

Initializes the checkpointing library. Returns TRUE if in recovery mode

int MPI_FT_ON()

MPI_FT_ON(FT_IS_ON)
INTEGER FT_IS_ON

Returns TRUE if this application is running with checkpointing turned on.

void MPI_FT_SAVE(
char *buffer, int size)

MPI_FT_SAVE(
BUFFER, SIZE, IERR)
<type> BUFFER(*)
INTEGER SIZE, RET_SIZE

Saves checkpoint onto Checkpoint Server.

void MPI_FT_SAVE_DONE()

MPI_FT_SAVE_DONE(IERR)
INTEGER IERR

Done with this checkpoint.

int MPI_FT_RESTORE(
char *buffer, int size)

MPI_FT_RESTORE(
BUFFER, SIZE, RET_SIZE)
<type> BUFFER(*)
INTEGER SIZE, RET_SIZE

Restore data from current checkpoint.

10.1.10.3 Running the above example

First, run the legion_ft_initialize script on the command line. You should see the following output:

$ legion_ft_initialize
legion_ft_initialize: Could not find a CheckpointStorage object ...
Attempting to register one
Stateful class "/CheckpointStorage" does not exist - creating it.
Registering implementation for class "/CheckpointStorage"
legion_ft_initialize: success initializing FT SPMD Checkpointing support
$

Note that you only need to run this command once.

Compile the application and register the program with Legion:

$ cd $LEGION/src/ServiceObjects/MPI/ft_examples
$ make reg_stencil 

Create an instance of the CheckpointStorage server and place it in context space:

$ legion_create_object /CheckpointStorage /home/<user name>/Foo

There is a set of special legion_mpi_run flags for running in a fault tolerant mode.

-ft

Turn on fault tolerance.

-s <checkpoint server>

Specifies the checkpoint server to use.

-R <checkpoint server>

Recovery mode.

-g <ping interval>

Ping interval. Default is 90 seconds.

-r <reconfiguration interval>

When failure is detected (an object has not responded in the last x seconds) restart the application from the last consistent checkpoint. Default value is 360 seconds.

These flags are used in specific combinations.

  • You must use -ft in all cases
  • You must use either -s or -R
  • The -g and -r flags are optional

Using these rules, run the application and specify a ping interval and reconfiguration time:

$ legion_mpi_run -n 2 -ft -s /home/<user_name>/Foo -g 200 \
   -r 500 mpi/programs/mpi_stencil_demo 

10.1.10.4 Recovering from failure

If an object fails (i.e. has not responded in the last <reconfigurationInterval> seconds) the application automatically restarts itself.

10.1.10.5 Restarting application

Under catastrophic circumstances (e.g., prolonged network outages), the recovery protocol described above may not work. In this case users can restart (i.e., rerun) the entire application with the -R flag.3 Continuing the above example:

$ legion_mpi_run -n 2 -ft -R /home/<user_name>/Foo -g 200 \ 
   -r 500 mpi/programs/mpi_stencil_demo 

The application will restart from the last consistent checkpoint. Note that the number of processes and checkpoint server name must match that of the original run.

Once the application has successfully completed you should delete the Checkpoint Server by typing:

$ legion_rm -f /home/username/ss 

If you would like to test the restart mechanism, you can simulate failure then restart the application. Kill the application by typing ^C (Control-C).

10.1.10.6 Compiling/makefile

To compile C applications with the checkpointing library, link your application with the -lLegionMPIFT -lLegionFT libraries. For Fortran applications, link with the -lLegionMPIFTF library also.

A sample makefile and application is provided in the standard Legion distribution package in the $LEGION/src/ServiceObjects/MPI/ft_examples directory.

10.1.10.7 Another example

For a more complicated example, please look at the application in $LEGION/src/Applications/ocean. This Fortran code contains the skeleton for an ocean modeling program.

10.1.10.8 Limitations

We use pings and timeout values to determine whether an object is alive or dead. If the timeout values are set too low, we may declare an object dead falsely. We recommend setting the ping interval and reconfiguration time to conservatively high values.

We do not handle file recovery. When running applications that write data to files, users are responsible for recovering the file pointer.

10.1.11 Functions supported

Please note that data packed/unpacked using the MPI_Pack and MPI_Unpack functions will not be properly converted in heterogeneous networks.

All MPI 1.1 functions are currently supported in Legion MPI. Some 2.0 functions are supported: please contact us at legion-help@virginia.edu for more information.

10.1.12 Running a Legion MPI code with the fewest changes

Please see the Legion web site for an on-line tutorial called "Running an MPI code in Legion with the fewest changes." The tutorial shows how to adapt a sample MPI program to run in Legion. Please note, however, that while this "fewest changes" approach allows you to run the program and to transparently read and write files remotely, it does not support heterogeneous conversions of data. If you run this program on several machines which have different integer formats, such as Intel PCs (little-endian) and IBM RS/6000s (big-endian), using unformatted I/O will produce surprising results. If you want to use such a heterogeneous system, you will have to either use formatted I/O (all files are text) or use the "typed binary" I/O calls instead of Fortran READ and WRITE statements. These "typed binary" I/O calls are discussed in "Buffered I/O library, low impact interface in the Legion Developer Manual.

10.2 Native MPI

Native MPI code is code written for an MPI implementation. Legion supports running native MPI programs without any changes. You only need the binary and a host to run it on. You can, if you wish, adapt your program to make Legion calls.

You will need a Legion host object with native MPI properties set to run these programs. Please see section 15.0 in the System Administrator manual for more information.

10.2.1 Task classes

Native MPI implementations generally require that MPI executables reside in a given place on disk. We have provided a tool, legion_native_mpi_register, to register executables of different architectures for use with native MPI.

10.2.2 Compilation

Native MPI code may be compiled independently of Legion, unless your code makes Legion calls (see Making Legion calls from native MPI programs). In that case, you must link your program to the Legion libraries. A sample makefile for this situation is in Figure 10.

Figure 10: Sample makefile for native MPI code that makes Legion calls

CC      =       mpiCC
MPI_INC =  /usr/local/mpich/include

mpimixed:       mpimixed.c
        $(CC) -g -I$(MPI_INC) -I$(LEGION)/include -D$(LEGION_ARCH) -DGNU \
             $(LEGION)/lib/$(LEGION_ARCH)/$(CC)/libLegion1.$(LEGION_LIBRARY_VERSION).so \
             $(LEGION)/lib/$(LEGION_ARCH)/$(CC)/libLegion2.$(LEGION_LIBRARY_VERSION).so \
             $(LEGION)/lib/$(LEGION_ARCH)/$(CC)/libLegion1.$(LEGION_LIBRARY_VERSION).so \
             $(LEGION)/lib/$(LEGION_ARCH)/$(CC)/libBasicFiles.so $< -o $@

10.2.3 Register compiled tasks

Run legion_native_mpi_register. Usage of this tool is:

legion_native_mpi_register <class name>
   <binary path> <architecture>
   [-help]

The example below registers /myMPIprograms/charmm (the binary path) as using a Linux architecture.

$ legion_native_ mpi_register charmm \ 
    /myMPIprograms/charmm linux 

You can run register a program multiple times, perhaps with different architectures or different platforms. If you have not registered this program before, this will create a context in the current context path (the context's name will be the program's class name: charmm, in this case) and registers the name in Legion context space.

10.2.4 Running a native MPI application

MPI programs are started using the program legion_native_mpi_run. Usage is:

legion_native_mpi_run
   [-v] [-a <architecture>]
   [-h <host context path>
   [-IN <local input file>]
   [-OUT <local result file>]
   [-in <Legion input file>]
   [-out <Legion result file>]
   [-n <nodes>] [-t <minutes>]
   [-legion] [-help] [-debug]
   <program context path>
   [<arg 1> <arg 2> ... <argn>]

The following parameters are used for this command:

-h <host context path>

Specify the host on which the program will run. The default setting is the system's default placement, which is to pick a compatible host and try to run the object. If the host fails, the system will try another compatible host.

-v

Verbose option.

-a <architecture>

Specify the architecture on which to run.

-IN <local input file>

Specify an input file that should be copied to the remote run from the local file system.

-OUT <local result file>

Specify a result file that should be copied back from the remote run to the local file system.

-in <Legion input file>

Specify an input file that should be copied to the remote run from the Legion context space.

-out <Legion result file>

Specify a result file that should be copied out from the remote run into Legion context space.

-n <nodes>

Specify the number of nodes for the remote MPI run.

-t <minutes>

Specify the amount of time requested for the remote MPI run. This option is only meaningful if the host selected to run the remote program enforces time limits for jobs.

-legion

Indicate whether the application makes Legion calls (see section 10.2.5, below).

<arg1> <arg2> ... <argn>

Arguments to be passed to the remote MPI program.

You can examine the running objects of your application using

$ legion_ls program_name 

This context will have an entry for each object in this MPI application.

If you need to kill the program and its implementations, run:

$ legion_rm program_name 

10.2.5 Making Legion calls from native MPI programs

The -legion flag indicates that your MPI code makes Legion calls (e.g., BasicFile calls). If you do not use this flag and your code attempts to make Legion calls, your program may not run correctly. If your program makes Legion calls you must:

  • link your program with the Legion libraries (see Compilation)
  • specify the -legion flag when you run the code

10.2.6 Example

We have provided two sample native MPI programs, available in $LEGION/src/ServiceObjects/MPI/examples/. The first, nativempi.c, produces exactly the same output as the mpitest_c.c program discussed in Example. The second, mixedmpi.c, is the same code but with Legion calls.

Please note two important adaptations that were made to mixedmpi.c in order to access Legion files. There are two new include files:

#include "legion/Legion_libBasicFiles.h"
#include "legion/LegionNativeMPIUtils.h"

and a new function call:

LegionNativeMPI_init(&argc, &argv);

These lines must be added to a native MPI code that makes any kind of Legion call.

10.2.7 Scheduling native MPI processes

Legion does not schedule native MPI processes. When a program is run with legion_native_mpi_run, the local mpirun utility on the destination host schedules the processes.


1. If you use the -h flag in this way, be sure to double-check the order in which the host objects are listed in your context. back

2. Support for generic communication patterns will be provided in a future release of Legion. back

3. The restart mechanism can also be used to migrate applications by specifying a different set of hosts with the -h flag. back

Directory of Legion 1.7 Manuals
[Home] [General] [Documentation] [Software]
[Testbeds] [Et Cetera] [Map/Search]

Free JavaScripts provided by The JavaScript Source

legion@Virginia.edu
http://legion.virginia.edu/