|1. ||What does this command do?||[go]|
|2. ||What is the difference between legion_run and legion_run_multi?||[go]|
|3. ||What are the prerequisites?||[go]|
|4. ||What are the required parameters?||[go]|
|5. ||What is a schedule file?||[go]|
|The specification file|
|6. ||What is a specification file?||[go]|
|7. ||What are the keywords?||[go]|
|8. ||How does the pattern field work?||[go]|
|9. ||How does a specification file work?||[go]|
|10. ||A step by step look at a specification file in action||[go]|
|11. ||What is an exception file?||[go]|
|After the job has started|
|12. ||What happens if a job fails?||[go]|
|13. ||How do I control where the program runs?||[go]|
|Problems while the program is running|
|14. ||I need an empty input file to start my program, but Legion ignores it||[go]|
|15. ||I need an empty output file to signal that my program is done, but Legion is marking the job as incomplete||[go]|
|16. ||Can I redirect legion_run_multi's output to a file?||[go]|
|17. ||Can I redirect a particular job's -debug output to a file?||[go]|
|18. ||Can I use the debug option to debug my program?||[go]|
|19. ||Some hints to make life easier||[go]|
What does this command do?
- The legion_run_multi command is essentially a script that distributes copies of a serial program on one or more remote hosts and executes a Legion command on each copy. The default command is legion_run, but you can use the -e flag to run others. The command is fully documented here.
What is the difference between legion_run and legion_run_multi?
- Both of these commands are used to run programs on remote Legion hosts. But where legion_run executes a single copy of a program on a single host, legion_run_multi executes multiple copies of a serial program on one or more remote hosts. You can also use legion_run_multi to run other Legion commands.
What are the prerequisites?
- You must have previously registered the program with Legion, with either legion_register_program or legion_register_runnable. If there are any required input files, they must be visible in your local file space or context space. You must provide the number of processors that can run at any time and/or a schedule file.
You must also have a specification file.
What are the required parameters?
- You must provide the program's Legion class path (created when you registered the program in Legion), a schedule file path or number of processors (via the -s or -n flags), a specification file path, and any required command-line arguments.
What is a schedule file?
- A schedule file is a text file that contains a list of host context paths and maximum number of processes that can be run on each host. If you use a schedule file, those hosts will be used as a pool for distributing your jobs. You cannot determine which jobs run on which hosts in this file (use an exception file for that). The format is one host per line:
What is a specification file?
- A specification file is a text file that provides that names of the input and output files your program expects to find in the remote execution space. There are three fields on each line (one file per line) which indicate whether the file is input or output, where it can be found, and how it should be named. It looks like this:
KEYWORD FILENAME PATTERN
- keyword: The keyword field tells the file's location and purpose.
- filename: The filename field is the name of the file that the program expects to find when it executes.
- pattern: The pattern field is the actual file name or naming pattern of the input prior to execution and output files after execution.
What are the keywords?
- There are nine keywords:
Files labeled constant, in, or out can be found in context space. Files labeled CONSTANT, stdin, stdout, stderr, IN, or OUT, can be found in local file space. A constant/CONSTANT keyword refers to a single input file that is used by every run (e.g., a password file). The std* keywords refer to standard input, output, and error files for the program. The in/IN and out/OUT files indicate sets of input and output files.
How does the pattern field work?
- The pattern field refers to sets of files used by and generated from your program. It can contain a single file (when refering to a constant, for example) or a naming pattern. A naming pattern identifies a set of files with a common pattern in their names. For example, the set of text files below all have "input" and "txt" in their names:
You can use the naming pattern input*.txt to refer to all three files:
These three files will be treated as three jobs: job A, B, and C. The IN keyword indicates that the naming pattern refers to files in your local directory.
IN Foo input*.txt
Similarly, you can use a naming pattern for your output files. If you wanted all of them to have "output" and "txt" in their names, you could use output*.txt:
In this case, there are three input files that fit the naming pattern, so the program will run three times. Each time the program runs, it generates an output file called Bar. Legion then copies the contents of each Bar to your local directory and names them according to which job they came from: outputA.txt, outputB.txt, and outputC.txt.
IN Foo input*.txt
OUT Bar output*.txt
How does a specification file work?
- A simple specification file looks something like this:
The first line uses the IN keyword to indicate that any files(s) in local directory space that match the naming pattern in the pattern field are input files for this program. The filename Foo is the file name that the program will expect to use as input when it is running on a remote host. The pattern field indicates that all files whose paths file the /localdir/input*Stuff.1 pattern should be considered input files for this program. Each possible input file is considered a job and is named according to the part of the file name that does not match the pattern. Thus /localdir/inputAStuff.1 would be job A, /localdir/inputBStuff.1 would be job B, etc.
IN Foo /localdir/input*Stuff.1
OUT Bar /localdir/*output.1
The second line uses the OUT keyword to indicate that any file(s) in local directory space that match the naming pattern in the pattern field are output files for this program. The filename Bar is the name of the output file that the program will generate on each remote host when it is finished. Legion will then copy it back to your host and assign it a name according to the pattern. The name will correspond to the job name: if /localdir/inputAStuff.1 is the original input file, the output file will be named /localdir/Aoutput.1.
A step by step look at a specification file in action
- Suppose you have the following specification file:
Each time the program runs, it will look for Foo, Next, Password, and User in the local directory of whichever host it runs on. It will also write files Bar and Done by the time it is finished. The specification file announces which files will be copied into the remote hosts and how they will get back to your local host.
IN Foo input*.txt
in Next /home/myContext/other*.stuff
OUT Bar output*.txt
out Done /home/myContext/finito*
CONSTANT Password /etc/passwd
constant User /home/admin/user-list
- Legion will look first for files that match the in and then files that match the IN naming patterns. Suppose that it finds the following:
- It then discards any mismatched patterns.
This leaves five possible jobs: Alpha, Beta, Gamma, Zippo, and Groucho.
- Legion then looks for files that match the the next keywords on the list, out and OUT:
- Any mismatched patterns are discarded.
You'll notice that the two discarded files match jobs Zippo and Groucho from step 2. It looks as if those jobs have been run before. The specification file, however, calls for two output files, so Legion assumes that these jobs have not yet successfully run.
- The matching output files match the Alpha job, so Legion assumes that Alpha has already successfully run and crosses it off the list.
That leaves four jobs that need to be run: Beta, Gamma, Zippo, and Groucho.
- Legion selects four hosts to run these jobs, HostA, HostB, HostC, and HostD. It then copies the constants and the sets of input files to the hosts, renaming them as the specification file instructs:
- The jobs run and generate output files Bar and Done on all four hosts. Legion then copies the output files back to your local host or your context space, following the naming pattern to name each set of outputs according to which job produced it. A job is not considered finished until the full set of output files have been generated and named according to the specification file.
What is an exception file?
- An exception file is an addendum to a specification file. It specifies additional parameters for specific patterns. For example, continuing the previous example, your specification file comes up with the following jobs:
You have already provided input/output information for these jobs in the specification file and command-line flags. However, you may need to further fine-tune Zippo and Groucho (perhaps you want them to run on specific hosts or use a special input file). You can name these two jobs in an exception file and provide any additional instructions. For example:
Zippo -h /hosts/special_host_name
Groucho -v -IN /tmp/myJob/ExtraStuff
NOTE! These instructions are passed to legion_run or whichever command you have specified (with the -e flag). You are essentially providing extra command-line instructions for a particular job. The syntax in the exception file must be legal for that command and for that host. If you tell Legion to run the job on a specific host, make sure that it matches the program's architecture.
What happens if a job fails?
- By default, Legion will simply mark the job as incomplete and abandon it. If you use the -r flag, though, it will restart the job on a selected host (from the same list used when the job was first attempted).
How do I control where the program runs?
- If you want to control placement of specific jobs, use an exception file to specify architectures or hosts. Otherwise, use a schedule file to set up a pool of possible remote hosts.
I need an empty input file to start my program, but Legion ignores it
- Use the -z flag. It allows zero-sized input files.
I need an empty output file to signal that my program is done, but Legion is marking the job as incomplete
- Use the -z flag. It allows zero-sized output files.
Can I redirect legion_run_multi's output to a file?
- Yes, by resetting stdout in the specification file. This is especially helpful if you are using the -debug option.
Can I redirect a particular job's -debug output to a file?
- Yes. In an exception file, give the job name, assign it a remote host, and name a file to catch the job's debug output. For example:
ABC -h /hosts/centurion007.cs.virginia.edu -debug > debug.ABC
This line says that job ABC will run on centurion007.cs.virginia.edu and that its debug output will be directed to a file called debug.ABC.
Can I use the debug option to debug my program?
- No. The -debug flag will give you tons of debugging information about legion_run_multi (i.e., about the Legion objects that are working to remotely start your program), but not about the program itself or its input and output files.
Some hints to make life easier
- Be careful not to run the program on the wrong architecture.
- Avoid asking for conflicting architectures. Be sure to check the specification file, the schedule file, and the exception file (if you use one).
- If you use an exception file, be sure to use legal syntax. Remember that the information in an exception file is passed to the remote host and given directly to legion_run (or whichever command you are running).
- If you use an exception file and a schedule file, check that any hosts named in the exception file also appear in the schedule file.
- Doublecheck your input and output filenames.
- A program can fail for multiple reasons, including internal bugs. Try to be sure that your program is bugfree before you run it.
- We suggest that you always use -v, especially if the program will be running for more than a few seconds.