UVACSE

From VCGR Wiki

Jump to: navigation, search

Contents


Printing Code

 enscript -2G -e4 -r2 -T 4 --mark-wrapped-lines=arrow --file-align=2 -o <outputfile> <filename>
 ps2pdf <outputfile>
  • only works on a double-sided printer

Getting STDOUT/STDERR

1. Add <Output> and <Error> elements to the <POSIXApplication> element after the <Executable> element and before the </Argument> elements.

<POSIXApplication xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl-posix">
  <Executable>Sample.ext</Executable>
  <Output>stdout</Output>
  <Error>stderr</Error>
  <Argument>arg</Argument>
</POSIXApplication>

2. Stage the stdout and stderr files out like any other output file.

<DataStaging>
 <FileName>stdout</FileName>
 <CreationFlag>overwrite</CreationFlag>
 <DeleteOnTermination>true</DeleteOnTermination>
 <Target><URI>rns:/GRID_PATH_WHERE_TO_SAVE_FILE/DESRIED_FILE_NAME</URI></Target>
</DataStaging>

<DataStaging>
  <FileName>stderr</FileName>
  <CreationFlag>overwrite</CreationFlag>
  <DeleteOnTermination>true</DeleteOnTermination>
  <Target><URI>rns:/GRID_PATH_WHERE_TO_SAVE_FILE/DESRIED_FILE_NAME</URI></Target>
</DataStaging>


Maintaining Reza's Jobs

Reza's 100 Windows Jobs (grid user: sarsara)

Login as gbg@camillus

Grid export contains job files located at /localtmp/gbg/rezaExport/convrg'

  • 50 jobs in composite directory
  • 50 jobs in quadruple directory
  • Files located in these directories are accessible via the grid for staging in and out.

These are month long jobs that have been partitioned into one to two hour chunks. When a job chuck is runs successfully, it writes two files to an Action directory indicating that next job chunk can be started.

  • *.act fine indicating next job chunk to start
  • *.act.done file indicating job was indeed completed (as sometimes *.act file is created regardless of whether the job actually ran successfully)

To be Done Regularly

Santity Check Script to run every couple of days to check if any jobs are STUCK

    >cd /localtmp/gbg/jobDispatcher
    >./progressFix.2.sh


Job Scheduling Infrastructure

  • located at /localtmp/gbg/jobDispatcher
  • Action directory contains the files of jobs to be started as described above
  • submitter.sh is the job dispatching script.
  • Every couple of hours, submitter.sh concats to output.txt the current iteration of all jobs.

Output files created during Job Dispatching:

  • /localtmp/gbg/jobDispatcher/avg.out - contains avgs of each jobs set's current iteration every two hours
  • /localtmp/gbg/jobDispatcher/output.txt - contains the current iteration for each job
  • /localtmp/gbg/jobDispatcher/proress.out - contains output for which jobs were found to be stuck

Stopping & Restarting from current iteration

1. Stop submitter.sh script

      Get job's ID:
         >ps aux | grep submitter.sh
         >kill <job-id> (might be multiple IDs)

2. Delete/Complete sarsara jobs from the queue (consult with Reza if he's any other jobs) via qmgr tool in grid if you have access to trajan

         start grid.sh
         >qmgr /queues/grid-queue

If no access to trajan, pull up grid client on camillus, cat out qstat which will give you all of reza's jobs, then write bash script that kills each or completes each job

         >~/XCG/GenesisII/grid
         >qstat /queues/grid-queue > qstat.file
         for each job id in file, 
         if running or requeued
         >qkill /queues/grid-queue <job id>
         >qcomplete /queues/grid-queue <job id>
         if finished
         >qcomplete /queues/grid-queue <job id>
         

3. Clean Action directory (when no more files are trickling into Action directory, all jobs have been deleted)

         >cd /localtmp/gbg/jobDispatcher/Action
         >rm *

4. Restart jobs

        a. regenerate jsdl and action files for current iterations of jobs

>cd /localtmp/gbg/jobDispatcher/Scripts

                  >./restartAll.sh
        b. start submitter.sh to get jobs submitted to grid
                  >cd /localtmp/gbg/jobDispatcher
                  >nohup submitter.sh > sub.out 2> sub.err < /dev/null &

Stopping & Restarting from zero

1. Stop submitter.sh script

      Get job's ID:
         >ps aux | grep submitter.sh
         >kill <job-id> (might be multiple IDs)

2. Delete/Complete sarsara jobs from the queue (consult with Reza if he's any other jobs) as above

3. Clean Action directory (when no more files are trickling into Action directory, all jobs have been deleted)

4. Saved old output and then delete the files >cd localtmp/gbg/rezaExport/convrg

         >tar -czf composite.DATE.tar.gz composite
         >tar -czf quadruple.DATE.tar.gz quadruple
         >cd /localtmp/gbg/jobDispatcher/Scripts
         >./rmResults.sh /localtmp/gbg/rezaExport/convrg/composite
         >./rmResults.sh /localtmp/gbg/rezaExport/convrg/quadruple

5. Delete old JSDL XML files

         >cd /localtmp/gbg/jobDispatcher/Scripts
         >./cleanXML.sh quad
         >./cleanXML.sh comp

5. Regenerate intial input files

         >cd /localtmp/gbg/jobDispatcher/Scripts
         >./addOuts.sh /localtmp/gbg/rezaExport/convrg/composite
         >./addOuts.sh /localtmp/gbg/rezaExport/convrg/quadruple

6. Generate JSDL XML files and start jobs from zero

        a. generate jsdl and action files from zero

>cd /localtmp/gbg/jobDispatcher/Scripts

               >./startAll.sh
        b. start submitter.sh to get jobs submitted to grid
                  >cd /localtmp/gbg/jobDispatcher
                  >nohup submitter.sh > sub.out 2> sub.err < /dev/null &


XCG Intall Issues

  • Security attribute is not yet valid
clock skew;make sure machine time is up to date
  • Container.log only contains two lines. No services start.
In the wrapper.log you will notice an "info" message about port being in use. This is a symptom of another java XCGContainer processes already running occupying the port and the container not being able to start.

Restarting Reza's Jobs

  • Stop jobs via qmgr tool and kill job submitter script
  • Save old results and post link
  • Cleanup Actions folder once all jobs stop running
  • Remove old result files
 ./rmResults.sh <location>
  • Cleanup XML
 ./cleanXML.sh <location>
  • Create XML and scripts for new set of jobs
 ./startAll.sh
  • Check job count in Actions
  • Start submitter script

Adding Desktop BESes to Grid

1. Follow the Installation guide to

An XCG identify is needed so you can install your grid container as your own
Choose installer correct for you system
download installer from website
choose full container option, remaining options leave default


2. Send name of your newly installed container to uvacse for initialization. To confirm the name of your machine as perceived by the grid:

  • From the grid shell, list the containers under the uninitialized-containers directory
 $> login ...
 $> ls /uninitialized-containers
  • determine which is your machine and send name to uvacse@virginia.edu

3. Once your container is initialized, you will need to create the BES resource that will enable jobs to be run on your container

  • From the grid shell, run the besmanager tool
 $> besmanager
  • send chosen BES resource location to admin
your resource will be added to the queue
  • configure BES options
choose to have jobs run on your machine anytime or only when you are NOT logged in
  • From the grid shell, ru nthe besmanager tool

Creating updated BES Resources with correct memory specifications

Update Properties File

  • get sample pbs bes properties files from repository
  • uncomment:
edu.virginia.vcgr.genii.native-q.resource-override.physical-memory
  • set to memory value in bytes:
 2 gigabytes = 2.147483648E9 bytes
 3 gigabytes = 3.221225472E9 bytes
 4 gigabytes = 4.294967296E9 bytes
 24 gigabytes = 2.5769803776E10 bytes

Create BES Resources

  • unlink bes resources from queue
 cd /queues/grid-queue/resources
 unlink itc-smallmem-pbs.from_elder
 unlink generals-pbs.from-camillus
 ulnink centurion-pbs.from-camillus 
 unlink sunfire-pbs.from-camillus
 unlink radio-pbs.from-camillus
  • rm bes resources from path (not successful - instead created differently named updated resources)
 cd /bes-containers/UVA/CS/GBG/Generals
 cd /bes-containers/UVA/ITC/Cluster/
  • create new bes resources using updated files
 cd /containers/UVA/CS/GBG/Generals/camillus/Services
 create-resource --creation-properties=props/centurion-pbs.properties --rns GeniiBESPortType /bes-containers/UVA/CS/GBG/Generals/pbs-centurion.from-camillus
 create-resource --creation-properties=props/generals-pbs.properties --rns GeniiBESPortType /bes-containers/UVA/CS/GBG/Generals/pbs-generals.from-camillus
 create-resource --creation-properties=props/radio-pbs.properties --rns GeniiBESPortType /bes-containers/UVA/CS/GBG/Generals/pbs-radio.from-camillus
 create-resource --creation-properties=props/sunfire-pbs.properties --rns GeniiBESPortType /bes-containers/UVA/CS/GBG/Generals/pbs-sunfire.from-camillus
 cd /containers/UVA/ITC/Cluster/elder.itc.virginia.edu/Services
 create-resource --creation-properties=props/smallmem-pbs.properties --rns GeniiBESPortType /bes-containers/UVA/ITC/Cluster/pbs-smallmem.from-elder
create-resource --creation-properties=props/lc4-pbs.properties --rns GeniiBESPortType /bes-containers/UVA/ITC/Cluster/pbs-lc4.from-elder
  • Add permission so all can read and submit jobs to bes resources
 chmod /bes-containers/UVA/CS/GBG/Generals/pbs-centurion.from-camillus +r --everyone
 chmod /bes-containers/UVA/CS/GBG/Generals/pbs-centurion.from-camillus +x /groups/uva-idp-group
 chmod /bes-containers/UVA/CS/GBG/Generals/pbs-centurion.from-camillus +x /groups/uva-standard-assurance-new
 ...


  • add updated resources back to queue
 ln /bes-containers/UVA/CS/GBG/Generals/pbs-radio.from-camillus /queues/grid-queue/resources/pbs-radio.from-camillus
 ln /bes-containers/UVA/CS/GBG/Generals/pbs-sunfire.from-camillus /queues/grid-queue/resources/pbs-sunfire.from-camillus
 ln /bes-containers/UVA/CS/GBG/Generals/pbs-centurion.from-camillus /queues/grid-queue/resources/pbs-centurion.from-camillus
 ln /bes-containers/UVA/CS/GBG/Generals/pbs-generals.from-camillus /queues/grid-queue/resources/pbs-generals.from-camillus
 ln /bes-containers/UVA/ITC/Cluster/pbs-smallmem /queues/grid-queue/resources/pbs-smallmem.from-elder
  • configure resources slots
 qconfigure /queues/grid-queue pbs-radio.from-camillus 32
 qconfigure /queues/grid-queue pbs-sunfire.from-camillus 32
 qconfigure /queues/grid-queue pbs-centurion.from-camillus 64
 qconfigure /queues/grid-queue pbs-generals.from-camillus 24
 qconfigure /queues/grid-queue pbs-smallmem.from-elder 24

How to Test a Port is Open

http://www.wfu.edu/~borwicjh/howto/port-test.html

 telnet <server-name> PORTNUMBER
 telnet modular.math.jmu.edu 18443

UNIX system info

  • /proc/meminfo
  • /proc/cpuinfo
  • uname -a

Installing MPICH2

  • acquired source from
 http://www.mcs.anl.gov/research/projects/mpich2/
  • acquired script for configuring with options from Bill Pemberton
! /bin/sh

# options used on our cluster with mpich2-1.0.8p1.tar.gz
# change the --prefix option in the configure line

OPTIMIZATION_FLAG="-O3 -fno-strict-aliasing"
CONFIG_ENABLE_F77="--enable-f77"
CONFIG_ENABLE_F90="--enable-f90"
export CC=gcc
export CXX=g++
export FC=gfortran
export F77=gfortran
export F90=gfortran
export CFLAGS="-Wall"
export FFLAGS="-fPIC"
export CXXFLAGS="-fPIC"
export F90FLAGS=""
export CONFIG_FLAGS="--enable-debuginfo --enable-totalview"

export CFLAGS="$CFLAGS $OPTIMIZATION_FLAG -fPIC -g"
export USER_CFLAGS
export MPE_OPTS
export MPE_CFLAGS
export LDFLAGS

./configure --enable-sharedlib --prefix=/home/griduser/mpich2 $CONFIG_ENABLE_F77 $CONFIG_ENABLE_F90 $COMPILER_CONFIG $EXTRA_CFLAG $CONFIG_FLAGS

make
  • followed README file to install
  • followed CS wiki to setup passwordless login to nodes
 https://www.cs.virginia.edu/~csadmin/wiki/index.php/PBS#Logging_on_to_the_Cluster

Updating JSDL Tool

  1. Checkout from svn at
 svn://camillus.cs.virginia.edu:9002/JSDLTool/trunk
  1. Commit changes in svn and get jar file
  2. Place updated jar file in vcgr account at
 public_html/XCG/2.0/downloads/JSDLTool.jar 

Headnode Issues

mpdcheck -l reveals:

   **********
   Your unqualified hostname resolves to 127.0.0.1, which is
   the IP address reserved for localhost. This likely means that
   you have a line similar to this one in your /etc/hosts file:
   127.0.0.1   $uqhn
   This should perhaps be changed to the following:
   127.0.0.1   localhost.localdomain localhost
   **********

Since this is not a file I can change, I need to not include the headnode in the ring?

SVN Commands

Creating Branches

  • svn copy SRC DST
Copies the source to the path. This copy is not a simple file copy but is a repository copy (a special CHEAP copy that preserves history. This is the main way of creating branches and tags.

To create the XxPrivateBranch branch, you would issue the command:

 svn copy svn://camillus.cs.virginia.edu:9002/GenesisII/trunk svn://camillus.cs.virginia.edu:9002/GenesisII/branches/XsPrivateCopy
 svn copy svn://camillus.cs.virginia.edu:9002/GenesisII/trunk svn://camillus.cs.virginia.edu:9002/GenesisII/branches/karolina/JSDLupdate4Mem

Checking out

Check out the main trunk of the Genesis II repository into a local directory called GenII

 svn checkout svn://camillus.cs.virginia.edu:9002/GenesisII/trunk GenII

Check out branch XsPrivateCopy into local directory called Genii

 svn checkout svn://camillus.cs.virginia.edu:9002/GenesisII/branches/XsPrivateCopy genii

Merging Source in SVN

Merging a branch with the trunk:

  • checkout trunk
 svn checkout svn://camillus.cs.virginia.edu:9002/GenesisII/trunk trunk
  • checkout branch
 svn checkout svn://camillus.cs.virginia.edu:9002/GenesisII/branches/<branchPath> branch
  • get the change number for the branch checkout (this is the last r<NUM> from the output)
 svn log --stop-on-copy -v svn://camillus:9002/GenesisII/branches/<branchPath> 
  • get latest changes from repository in trunk checkout and note change number
 svn update

  • merge branch into trunk (called from trunk checkout)
 svn merge -r <branchNum>:<repoNum> svn://camillus.cs.virginia.edu:9002/GenesisII/branches/<branchPath>

Merging after implementing updates to propagating memory request to PBS script.

 svn merge -r 1377:1383 svn://camillus.cs.virginia.edu:9002/GenesisII/branches/karolina/JSDLupdate4MemPBS
  • see updated files that will be checked in on commit in trunk checkout
 svn status
  • remove file from being checked-in in trunk checkout
 svn revert <filename>
  • run tests to confirm all is working
 ant clean
 ant
  • finalize changes into repository in trunk checkout
 svn commit 

Testing BES/JSDL Changes for Memory Based Scheduling

Bootstrap test grid

script deployments/default/configuration/bootstrap.xml

Login and confirm

login deployments/default/security/keys.pfx

Create queue

 cd /containers/BootstrapContainer/Services
 create-resource --rns QueuePortType /testq
 ln /bes-containers/BootstrapContainer /testq/resources/bs

Copy test files into grid

cp --local-src me/javaHW.xml /
cp --local-src me/hw.java /home/
cp --local-src me/hw.class /home/
cp --local-src me/run.bat /home/
cp --local-src javaHW.xml /
cp --local-src hw.java /home/
cp --local-src hw.class /home/
cp --local-src run.sh /home/

Run job directly on resource

 run --jsdl=me/javaHW.xml /bes-containers/BootstrapContainer
 run --jsdl=me/javaHW2.xml /bes-containers/BootstrapContainer
 run --jsdl=javaHW.xml /bes-containers/BootstrapContainer

Run job via queue

 cp javaHW.xml /testq/submission-point
 cp javaHW2.xml /testq/submission-point

Check results

 cat /home/stdout
 cat /home/stderr

Memory request values

 1 MB = 1000000
 105 GB = 112345678901
 <TotalPhysicalMemory>
    <UpperBoundedRange>1000000</UpperBoundedRange>
  </TotalPhysicalMemory>

Testing BES/JSDL Changes

Bootstrap test grid

script deployments/default/configuration/bootstrap.xml

Login and confirm

login deployments/default/security/keys.pfx

Create PBS BES resource

 cd /containers/BootstrapContainer/Services
 create-resource --creation-properties=mpi-test.properties --rns GeniiBESPortType /bes-containers/power2-pbs
 create-resource --creation-properties=me/pbs.properties --rns GeniiBESPortType /bes-containers/pbs
 ln /bes-containers/pbs /testq/resources/pbs
 chmod /bes-containers/pbs +r+x --everyone

Copy test file into grid

cp --local-src  mpiTest.o /home/mpiTest.o

Run test

 run --jsdl=mpiTestPPN.xml /bes-containers/power2-pbs
 run --jsdl=javaHW.xml /bes-containers/pbs
 run --jsdl=me/javaHW2.xml /bes-containers/pbs

Check results

 cat /home/stdout
 cat /home/stderr

PBS Info

7.3 Command examples qsub -I submit an interactive job qsub -q queue submit a job directly to a paci�ed queue qstat list information about queues and jobs qstat -q list all queues on system qstat -Q list queue limits for all queue 5 Draft qstat -a list all jobs on system qstat -au userid list all jobs owned by userid qstat -s list all jobs with status comments qstat -r list all running jobs qstat -f jobid list all information known about speci�ed job qstat -Qf queue list all information known about speci�ed queue qstat -B list summary information about the PBS server qdel jobid delete speci�ed job qalter jobid modify the attributes of the job or jobs speci�ed by jobid pbsnodes -a list all nodes are currently available and some node characteristics.

  • pbsnodes -l list all nodes are currently offine

man PBS_Resources

  • nodes - Number and/or type of nodes to be reserved for exclusive use by the job.

VI Replacing

Replace all occurrences of ten with 10

:1,$s/ten/10/g

Setting Up Hess@VT

Config Steps

  1. Installed grid
  2. Compiled R and added to PATH
  3. Configured container
  4.  script configure-containers.xml hess.arc.vt.edu /containers/VT hess.arc.vt.edu
    
  5. Created bes resource for pbs queue
  6.  create-resource --creation-properties=hess.properties --rns GeniiBESPortType /bes-containers/VT/hess.arc.vt.edu
     
     chmod /bes-containers/VT/hess.arc.vt.edu +r --everyone
     chmod /bes-containers/VT/hess.arc.vt.edu +x /groups/uva-idp-group
     chmod /bes-containers/VT/hess.arc.vt.edu +x /groups/uva-standard-assurance-new
    
  7. Ran test R job on BES resource
  8.  run --jsdl=singleTestL.xml /bes-containers/VT/hess.arc.vt.edu
    
  9. Added BES resource to grid queue and configured slots
  10.  ln /bes-containers/VT/hess.arc.vt.edu /queues/grid-queue/resources/vt-pbs.from-hess
     qconfigure /queues/grid-queue vt-pbs.from-hess 16
    

PBS Node info for hess as of 6/30/09

  • 96 total nodes
  • 53 free nodes
  • 25 busy nodes
  • 12 offline nodes

Adding Matching Params

  • Syntax for adding matching parameter for scheduling
 matching-params /bes-containers/<path> "add(<param>, <value>)"
 matching-params /bes-containers/UVA/ITC/Cluster/elder-smallmem "add(supports-mpi, true)"

Grid-Queue Resource Status

Slots vs Available CPUs as of 6/29/2009

  • Radio-pbs 32/52
  • Centurion-pbs 64/128
  • Sunfire-pbs 32/44
  • Generals-pbs 24/32
  • THNA233 2/2
  • trajan 8/8
  • pompey 8/8

Persistent Datastaging

Changes to JSDL:

0

 <DataStaging>
   <FileName>R-2.9.1-win32.exe</FileName>
   <CreationFlag>dontOverwrite</CreationFlag>
   <DeleteOnTermination>false</DeleteOnTermination>
   <Source><URI>rns:/home/sarsara/csExport/R-2.9.1-win32.exe</URI></Source>
 </DataStaging>


1

 </JobName>(after)
 <JobAnnotation>R-2.9.1</JobAnnotation>


2

 (before)<OperatingSystem>
 <FileSystem name="SCRATCH"/>


3

 (before)</POSIXApplication>
 <Argument filesystemName="SCRATCH">R-2.9.1-win32.exe</Argument>


4

 (before)<CreationFlag>dontOverwrite
 <FilesystemName>SCRATCH</FilesystemName>

Declaring Persistant Datastaging in JSDL

1. Choose a global name for jobs that will be using the persistent files. After the <Job Name> element, insert the following element declaring this name. This element should be included by all jobs that will be using the persistent files.

<JobAnnotation>GLOBAL_JOB_NAME</JobAnnotation>


2. In the <DataStaging> element that declares the staging-in of a file that is to persist, add a <FilesystemName> element and set the <CreationFlag> element to dontOverwrite and the <DeleteOnTermination> element to false. These changes indicate that the file should not be downloaded if a file with that name already exists on disk.

<DataStaging>
  <FileName>FILE_NAME</FileName>
  <FilesystemName>SCRATCH</FilesystemName>
  <CreationFlag>dontOverwrite</CreationFlag>
  <DeleteOnTermination>false</DeleteOnTermination>

  <Source><URI>rns:PATH_TO_FILE_IN_GRID</URI></Source>
</DataStaging>

3. Before the <OperatingSystem> element under the <Resources> element, insert the following element declaring that we are using the filesystem.

<FileSystem name="SCRATCH"/>

4. If you are passing the name of a persistent file as an argument to your program, insert filesystemName="SCRATCH" in that arguments XML tag to indicate it should be pulled from the filesystem if possible.

<Argument filesystemName="SCRATCH">FILE_NAME</Argument>


Creating Lightweight Export with Security for User

export --create /containers/camillus/Services/LightWeightExportPortType /localtmp/gbg/rezaExport /home/sarsara/tmpExport
 vcgr:$>chmod /containers/camillus +rx /users/sarsara
 vcgr:$>chmod /containers/camillus/Services +rx /users/sarsara
 vcgr:$>chmod /containers/camillus/Services/LightWeightExportPortType +rx /users/sarsara
 
chmod <export dir> +rwx <user>
export --create /containers/UVA/CS/GBG/Generals/sulla/Services/LightWeightExportPortType /localtmp/gbg/tester /home/tester/sullaExport
export --create /containers/UVA/CS/GBG/Generals/camillus/Services/LightWeightExportPortType /localtmp/gbg/rezaExport /home/sarsara/csExport

Container Log Locations

/GeniiNet/GeniiNetInstall/ext/JavaServiceWrapper/bin/container.log
or
/GeniiNet/GeniiNetInstall/container.log

Poster Printing

Printer Properties>Options>ConfigName>Portrait D 24 x 36

  • Ensure paper size is set to 24 x 36.
  • Saving to jpeg/image formats can flat out layers.
  • Always look at print preview mode to check on margins.

Nohup

 nohup myprogram > foo.out 2> foo.err < /dev/null &
 nohup submitter.sh > sub.out 2> sub.err < /dev/null &

SVN CheckUp

If SVN is not responding in Eclipse, look for SVN server running on Camillus on port 9002 via ps aux. It not running, restart SVN using script startSVN located @ gbg home. Run only on Camillus.

Connecting to XCG Grid

 connect http://www.cs.virginia.edu/~vcgr/GeniiNet/root.xml GeniiNetFullContainer

Installing Genii containers that point to pbs queue on cluster front ends

Setting up PBS Queue on Elder. Cedar, & Dogwood

Installing Genii

  1. Removed clean.sh if applicable
  2. Created local directory for genii state and linked to .genesisII where applicable
 ~/.genesisII-2.0 -> /export/vanet/genesisII-state
  1. Created local directory for install and created link
  ~/XCG/GenesisII -> /export/vanet/GeniiNet2_0Install
  1. downloaded newest installer from website and ran on commandline to install under ~/XCG/GenesisII
  ./XCG-installer.sh -c
  1. Checked that container is running
  ps aux | grep java

Configuring Containers

Examined old script (configure-containers.xml) and picked out the functions such that the itc cluster nodes will have their security initialized and will be linked at the correct permanent location. New script is located on cicero at ~/XCG/GenesisII/KarsScripts/configure-containers.xml.

Functions for the new configuration script:

  • <function name="makeDirectories"> - Creates directory with passed in path under /containers ; path may already exist
  • <function name="configureContainerService"> - Gives read writes to everyone; gives read and execute rights to all under groups
  • rmUndo, unlinkUndo, attempUndo - remove/unlink/undo function
  • <function name="configureContainer"> - set read rights for everyone on uninitialied containers and its services; unlinks from uninitialized containers and links to /containers (relativePath - path where container should appear under /containers)
  • <function name="configureContainers"> (containerPattern -arg1, relativePath - arg2) - calls makeDirectories for relative path; configures each container; If 3 arguments, configureContainers function is called.

Commandline arguments:

 <container-pattern> <relative-path-name> <container-link-name>

  • <container-pattern> - name of container under /uninitialized-containers
/uninitialized-containers/<container-pattern>
  • <relative-path-name> - where to place container under /containers
/containers/<relative-path-name>
  • <container-link-name> - what to name new container
/containers/<relative-path-name>/<container-link-name>

Note: It took over 10 minutes for each container to be configured. Having trouble with lc3 aka dogwood. Keeps failing while security is being set for container services. Reinstalling fixed problem.

Creating PBS BES Resources

A pbs.properties file needs to be created for each new BES resource. Example files under SVN in example-nativeq directory. Also under ~/XCG/GenesisII/KarsScripts on Cicero.

For ITC Clusters, copied file as /itc-pbs.properties and changed the following per emailing with Katherine.

  • qname = workq
  • mpich1 commandline-args=-comm mpich-p4

Under each new containers services, created resource as follows:

>create-resource --creation-properties=example-nativeq/itc-pbs.properties --rns GeniiBESPortType /bes-containers/UVA/ITC/<node_name>-workq

Command syntax:

 create-resource [--creation-properties=<properties-file>] [--rns | --url] <service-path-or-url> [<new-rns-path>]
  • Add permission so all can read and submit jobs to bes resource.
>chmod /bes-containers/UVA/ITC/<node_name>-workq +r --everyone
>chmod /bes-containers/UVA/ITC/<node_name>-workq +x /groups/uva-idp-group
>chmod /bes-containers/UVA/ITC/<node_name>-workq +x /groups/uva-standard-assurance-new

Testing new PBS BES Resources

Ran sample job to confirm newly added queues are working.

>run --jsdl=example-jsdl/example-posix-jsdl.xml <queue_name>-workq

  • cedar: success
  • dogwood: success
  • elder: success

Create New "Grid-Queue" Queue

8/7/9 - Creating new queue to resolve queue issues.

1. Under titus container services, create new queue resource. There is no requirement to do this on titus. Any container that provides at QueuePortType will do. We are choosing titus as this container was chosen to host the grid-queue previously.

Command syntax: create-resource [--creation-properties=<properties-file>] [--rns | --url] <service-path-or-url> [<new-rns-path>]

>create-resource --rns QueuePortType /queues/grid-queue

2. Add resources to which queue will be submitting jobs by linking in desired bes-containers. In this case we want the grid-queue to submit all available resources. We link in the bes-containers under the queue's resources.

Command Syntax: ln <source-path> [<target-path>]

>ln /bes-containers/<resource-path> /queues/grid-queue/resources/<resource-name>

3. Add permission so all can read and submit jobs to queue.

>chmod /queues/grid-queue +r --everyone
>chmod /queues/grid-queue +x /groups/uva-idp-group
>chmod /queues/grid-queue +x /groups/uva-standard-assurance-new

4. Configure queue to allow only a set number (in this case 8) of slots from one individual

>qconfigure /queues/grid-queue <resource> <# of slots>


 vt-pbs.from-hess(24) 
 generals-pbs.from-camillus(24) 
 centurion-pbs.from-camillus(64) 
 trajan-win(8)
 pompey-win(8) 
 sunfire-pbs.from-camillus(32) 
 radio-pbs.from-camillus(32)
 THN...(2)

5. Time to completion: 1 hour

Create New "Windows Short" Queue

Create a queue for short windows jobs that submits to trajan.

1. Under existing trajan container services, create new queue resource. There is no requirement to do this on trajan. Any container that provides at QueuePortType will do.

Command syntax: create-resource [--creation-properties=<properties-file>] [--rns | --url] <service-path-or-url> [<new-rns-path>]

>create-resource --rns QueuePortType /queues/windows-short-queue

2. Add resource to which queue will be submitting jobs by linking in desired bes-container. In this case we want the windows-short-queue to submit all jobs to trajan so we link in trajan's bes-container under the queue's resources.

Command Syntax: ln <source-path> [<target-path>]

>ln /bes-containers/trajan.cs.virginia.edu /queues/windows-short-queue/resources/trajan

3. Add permission so all can read and submit jobs to queue.

>chmod /queues/windows-short-queue +r --everyone
>chmod /queues/windows-short-queue +x /groups/uva-idp-group
>chmod /queues/windows-short-queue +x /groups/uva-standard-assurance-new

4. Configure queue to allow only a set number (in this case 8) of slots from one individual

>qconfigure /queues/windows-short-queue trajan 8

5. Time to completion: 1 hour

Restarting Genii Containers

After restarting a container, connect to the Grid and ensure you can ping the container. If not, consult container log (container.log) for debugging info. This is located in the installation directory.

Check Status/Restart Windows Container

Pull up the Services dialog under control panels. Locate the Genesis II Container Service. It's status will be indicated. If not running, you can left click and choose the restart option.

Restart Linux Container

Log into the machine and cd into install directory. Run the manage script with the status or start option:

>./Genesis-manage-script start|start

Queue Info

Bes-Containers

vcgr:$>ls /bes-containers bes-containers:

  • BootstrapContainer
  • UVA
  • camillus
  • centurion001-pbs
  • pompey.cs.virginia.edu
  • romulus.cs.virginia.edu
  • sulla
  • titus
  • trajan.cs.virginia.edu

Linux Queue

vcgr:$>ls /queues/linux-queue/resources resources:

  • titus
  • dogwood-workq
  • camillus
  • sulla
  • cedar-workq
  • centurion001-pbs

Windows R Queue

Hosted on titus

Memory Configs

A currently working scheme for heap memory limits:

  • Cicero container with Linux-queue: 1024MB
  • Centurion container with PBS BES: 1024MB
  • Camillus container with Test-queue: 1024MB
  • Dogwood(lc3)/Cedar(lc2)/Elder(lc4) with PBS BES: 1024MB

Login Info

  • Default login time of 1 week instituted 3/23.
  • Get verbose login info via whoami and verbosity flag.

Setting Permissions for New Users

vcgr:$>chmod /groups/uva-idp-group null +x /home/<user>

Blas Package Installing

Ways to list what's installed

  • locate blas
  • rpm -qa | grep -i blas
  • yum list installed | grep -i lapack

Debugging Reza's Jobs & Linux Queue

03/23 - job resubmission after two new patches to resolve bulk qstats and default login times; installing patches with Jes

Creating Resolver for Replicated Directories

Creating Resolvers

Locating Files

Differences between unix cmds:

  • which - shows the full path of (shell) commands
  • locate - provides a way to index and quickly search for files on your system
  • find - search for files in a directory hierarchy

Running MPI on ITC Clusters

Following information pulled from ITC site.

Compiling

To compile parallel programs, the open source MPI libraries OpenMPI and MPICH2 have been provided. A module corresponding to the compiler you wish to use must be loaded in order to set up the correct environment.

 module load ompi-intel

Once the module is loaded the following commands should be used to compile programs that use MPI code:

  • mpicc [options] file.c (C)
  • mpiCC [options] file.C (C++)
  • mpif77 [options] file.f (Fortran 77)
  • mpif90 [options] file.f (Fortran 90)



Sample PBS Scripts

MPICH Job

!/bin/sh
#PBS -l nodes=4:ppn=2
#PBS -l walltime=12:00:00
#PBS -m abe
#PBS -M mst3k@virginia.edu

cd $PBS_O_WORKDIR

source /opt/Modules/default/init/sh
module add ompi-pgi

NP=`wc -l < $PBS_NODEFILE`

mpiexec --prefix $OMPI_MPIHOME -np $NP -machinefile $PBS_NODEFILE myexec

MPICH2 Job

#!/bin/sh
#PBS -l nodes=4:ppn=2
#PBS -l walltime=12:00:00
#PBS -m abe
#PBS -M mst3k@virginia.edu

cd $PBS_O_WORKDIR

mpiexec -comm mpich2-pmi myexec