Legion Parameter Space Studies
Table of Contents
Other relevant on-line documents:
The Legion tutorials offer quick and simple instructions for various key procedures for a Legion system. More complete explanations of all of the procedures discussed here are available on separate pages, and can be found by clicking on the icon.
Depending on how your system is set up, you may need to set up your access to your system before you can run Legion commands. This will probably involve running a command such as this:
$ . ~LEGION/setup.sh
The exact syntax will depend on what kind of shell you are using and on where your Legion files are installed. Consult your system administrator for more information.
$ source ~LEGION/setup.csh
Parameter Space Studies
- One common problem faced in computational science is efficiently running a program--either serial or parallel--many times with many different inputs. For example, we might want to learn about the lift generated by an airplane wing at various airspeeds, angles of attack, and air pressures. We have a serial program that takes airspeed, angle of attack, and air pressure as input parameters; each individual problem might be small enough to run on a single node. Yet the total runtime of all the jobs might be CPU-years.
What can Legion do to help this situation? Ideally, we might imagine that we could run these jobs on computational resources all over the country, on machines that the user does not necessarily have an account on, using secure methods to access input and output files transparently. The system would also ensure that jobs ruined by system faults would be re-run.
Legion provides most of the capabilities today. We have constructed the first version of a system for parameter space studies, targeted at serial programs and programmers with no experience or interest in parallel programming. This system consists of a few hundred lines of code and mostly re-uses existing Legion mechanisms.
"Flogging" a Parameter Space Study
- We perform a parameter space study by running the same program on many different inputs ("flogging" the program over the inputs). To do flogging, we need a few basic components:
- A serial program which reads one or more input files, and writes one or more output files, compiled normally for one or more architectures. Source code is not needed.
- A program which writes out a series of input files for all the different parameter combinations wanted by the user. This program is called the "generator."
- A small file which describes the input files and output files read and written by the serial program.
The ideal flogger could take these inputs, and run as many copies of the serial program as possible on all resources available to the Legion system, returning the output files to the user.
Since we are only now completing our first implementation of scheduling in Legion, our prototype flogger has the same interface as the ideal flogger, but also takes a number N as input, and attempts to run N copies of the serial program simultaneously. As these programs finish, more copies are immediately started.
- Legion uses the legion_run_multi command to run a previously registered serial program with all of the different input files, using a simple specification file to describe the names of the expected input and output files.
The example serial program in this tutorial performs a Monte-Carlo simulation related to the physics of small collections of atoms. The program accepts two inputs: a size of the space, and the fraction of the space occupied by atoms. We will run this program for a number of combinations of sizes and fractions.
- Set up the Legion environment if you have not already done so.
- Use . ~LEGION/setup.sh or source ~LEGION/setup.csh, depending on which shell you are using.
- Start a tty object
- This is where Legion will send all the output the serial program writes to the tty, as well as status information issued by some internal parts of the Legion system.
$ legion_tty mytty
For more information on using tty objects, please see the tutorial or the Basic User Manual (available here in postscript or PDF).
- Go to the directory with sample files [***Greg, how should steps 4-6 work here? are there sample files in the Legion release?]
- In Lab 1 you created a subdirectory containing the example files. Go there now:
$ cd ~/UserName/examples/parameter
- Make the example programs. Type
- Generate a series of input files.
- We have provided a generator program:
./generator 4 8 4 0.3 0.51 0.1
- This command line varies the size of the space from four to eight stepping by four, and the coverage fraction from 0.3 to 0.51 stepping by 0.1. The resulting grid has six combinations.
If you now run ls in the directory, you will see 6 new input files for the serial program, named in.dat.00000 through in.dat.00005.
Our serial program, serial.c, actually reads from a file named in.dat and writes to a file named out.dat. If you copy one of these input files (i.e., in.dat.00002) to the name in.dat, you can run the serial program serial and get an output file named out.dat:
cp in.dat.00002 in.dat
- The flogger, which we will use to run the serial program on all of the different input files, uses a simple specification file to describe the names of the expected input and output files. Take a look at this specification file:
- You should see these contents:
in in.dat in.dat.*
- in identifies the line as a description of the input files (so it is a keyword). in.dat is the file name that the serial program expects for its input file. The final string, in.dat.*, is a pattern that the flogger will use to match all the different input files in the current directory that will be individually fed to different runs of the serial program. out is a keyword indicating the output file specification, and out.dat is the name of the output file from the serial program. If in.dat.123 is an input file, the flogger will direct the output to out.dat.123.
- Register the serial program with the Legion system
- In order for Legion to transfer your binary executable transparently to wherever the program is run, you need to tell Legion about the binary:
legion_register_program UserName_serial serial $LEGION_ARCH
- This command creates a "class object" in the shared Legion namespace with the name UserName_serial. If you compiled your program for several architectures, you could run this command once for each architecture, and Legion would be capable of running this same program on several different architectures.
- Use the legion_run_multi command to run the serial program on all of these input files: [***Greg, do we need more explanation of the legion_run_multi flags here? Or just a note saying for information on using the flags please see the commands reference page?]
./legion_run_multi -n 3 -p UserName_serial -f flog_spec
- For each input file, a corresponding output file is created. For example, the result of the serial program run on in.00003 is out.00003.
If legion_run_multi crashes in the middle of your run, you can run it again, and it will skip computing any out.00000 files that it finds already exist.