Help:HPCC Torque: Difference between revisions

From CSDMS
m (Add job array option to list of torque commands)
m (Change heading name)
Line 72: Line 72:
|}
|}


=== Within-Script Torque Commands ===
=== Job Submission Options ===


{|
{|

Revision as of 15:25, 18 May 2009

Submitting Jobs to the CSDMS HPCC

The CSDMS High Performance Computing Cluster uses Torque/Maui as a job scheduler. With Torque you can allocate resources, schedule and manage job execution, monitor and view the status of your jobs.

Torque uses instructions given on the command line and embedded within comments of the shell script that runs your program. This page describes basic Torque usage. Please visit the Torque website for a more complete guide.

The Torque Cheat Sheet

Frequently Used Commands

Command Description Basic Usage Example
qsub Submit a pbs job qsub [script] > qsub job.pbs
qstat Show status of pbs batch jobs qstat [job_id] > qstat 44
qdel Delete pbs batch job qdel [job_id] > qdel 44
qhold Hold pbs batch jobs qhold [job_id] > qhold 44
qrls Release hold on pbs batch jobs qrls [job_id] > qrls 44

Check Queue and Job Status

Command Description
qstat -q List all queues
qstat -a List all jobs
qstat -au <userid> list jobs for userid
qstat -r List running jobs
qstat -f <job_id> List full information about job_id
qstat -Qf <queue> List full information about queue
qstat -B List summary status of the job server
pbsnodes List status of all compute nodes

Job Submission Options

Command Description
#PBS -N myjob Set the job name
#PBS -m ae Mail status when the job completes
#PBS -M your@email.address Mail to this address
#PBS -l nodes=4 Allocate specified number of nodes
#PBS -l walltime=1:00:00 Inform the PBS scheduler of the expected runtime
#PBS -t 0-5 Start a job array with IDs that range from 0 to 5

Basic Usage

Torque dynamically allocates resources for your job. All you need to do is submit it to the queue (with qsub) and it will find the resources for you. Note though that Torque is not aware of the details of the program that you are wanting to run and so may need to tell it what resources you require (memory, nodes, cpus, etc.).

Submitting a job

To submit a job to the queue you must write a shell script that torque will use to run your program. In its simplest form, a torque command file would look like the following: <geshi>

  1. !/bin/sh

my_prog </geshi> This shell script simply runs the program, my_prog. To submit this job to the queue, use the qsub command, <geshi> > qsub run_my_prog.sh </geshi> where the contents of the file run_my_prog.sh is code snippet above. Torque will respond with the job number and the server name, <geshi> 45.beach.colorado.edu </geshi> In this case Torque has identified your job with job number 45. You have now submitted your job to the default queue and will be run as soon as there are resources available for it. By default, the standard error and output of your script are redirected to files in your home directory. They will have the name <job_name>.o<job_no>, and <job_name>.e<job_no> for standard output and error, respectively. Thus, for our example, standard output will be written to run_my_prog.sh.o45, and standard error will be written to run_my_prog.sh.e.45.


Deleting a job

If you want to delete a job that you already submitted, use the qdel command. This immediately removes your job from the queue and kills it if it is already running. To delete the job from the previous example (job number 45), <geshi> >qdel 45 </geshi>

Check the status of a job

Use qstat to check the status of a job. This returns a brief status report of all your jobs that are either queued or running. For example, <geshi> >qstat

      Job id                    Name             User            Time Use S Queue
      ------------------------- ---------------- --------------- -------- - -----
      45.beach.colorado.edu     STDIN            username               0 R workq
      46.beach.colorado.edu     STDIN            username               0 Q workq

</geshi> In this case, job number 45 is running ('R'), and job number 46 is queued ('Q'). Both have been submitted to the workq.

Advanced Usage

As mentioned before, Torque is not aware of what resources your program will need and so may need to give it some hints. This can be done on the command line when calling qsub or within your Torque command file. Torque will parse comments within your command file of the form #PBS. Text that follows this is interpreted as if it were given on the command line with the qsub command. please see the qsub man page for a full list of options (man qsub).

Job Submission: There are options in the shell script that can be used to customize your job. Continuing with the example of the previous section, the command script could be customized, <geshi>

  1. !/bin/sh
  2. PBS -N example_job
  3. PBS -l mem=2gb
  4. PBS -o my_job.out
  5. PBS -e my_job.err

my_prog </geshi> Here we have renamed the job to be example_job, tell Torque that the job will use 2GB of memory, and redirect standard output and error to the files my_job.out and my_job.err, respectively. Torque looks for lines that begin with #PBS at the beginning of your command file (ignoring a first line starting with #!). Once it encounters a non-comment line (that isn't blank), it ignores any other directives that might be present.

<geshi>

  1. PBS -r n # The job is not rerunnable.
  2. PBS -r y # The job is rerunnable
  3. PBS -q testq # The queue to submit to
  4. PBS -N testjob # The name of the job
  5. PBS -o testjob.out # The file to print the output to
  6. PBS -e testjob.err # The file to print the error to
  7. Mail Directives
  8. PBS -m abe # The points durring the execution to send an email
  9. PBS -M me@colorado.edu # Who to Mail to
  1. PBS -l walltime=01:00:00 # Specify the walltime
  2. PBS -l pmem=100mb # Memory Allocation for the Job
  3. PBS -l nodes=4 # Number of nodes to Allocate
  1. PBS -l nodes=4:ppn=3 # Number of nodes and the number processors per node

</geshi>

You can use any of the above options in the script to customize your job. If all of the above options are used, the job will be named testjob and be put into the testq. It will only run for 1 hour and mail me@colorado.edu at the beginning and end of the job. It will use 4 nodes with 3 processors per node, with a total of 12 processors and 100 mb of memory.

Job Arrays

Sometimes you may want to submit a large number of jobs based on the same script. An example might be a Monte Carlo simulation where each simulation uses a different input file or set of input files. Torque uses job arrays to handle this situation. Job arrays allow the user to submit a large number of jobs with a single qsub command. For example, <geshi> > qsub -t 10-23 my_job_script.sh </geshi> would submit 14 jobs to the queue with each job sharing the same script and running in a similar environment. When the script is run for each job, torque defines the envrionment variable PBS_ARRAYID that is set to the array index of the job. For the above example, the array indices would range from 10 to 23. The script then is able to use the PBS_ARRAYID variable to take particular action depending on its id. For instance, it could gather particular input files that are identified by its id.

Torque references the set of jobs generated by such a command with a slightly different naming convention, <geshi> > qsub -t 100,102-105 45.beach.colorado.edu > qstat 45-100.beach.colorado.edu ... 45-102.beach.colorado.edu ... 45-103.beach.colorado.edu ... 45-104.beach.colorado.edu ... 45-105.beach.colorado.edu ... </geshi> You can now refer to all of the jobs as a group or individual jobs. For example, if you would like to stop all of the jobs <geshi> > qdel 45 </geshi> If you would like to stop a single job of the group <geshi> > qdel 45-103 </geshi>

Torque environment variables

Before Torque runs your script it defines a set of environment variables that you can use anywhere within your script. That is, either in PBS directives or in commands. For example, <geshi>

  1. !/bin/sh
  2. PBS -N example_job
  3. PBS -l mem=2gb
  4. PBS -o my_job.out
  5. PBS -e my_job.err

IN_FILE=${PBS_O_HOME}/my_input_file.txt

my_prog ${IN_FILE} </geshi> Torque has set the environment variable PBS_O_HOME to be the home directory on which the qsub command was run.

The following environment variables relate to the machine on which qsub was executed:

Variable Name Description
PBS_O_HOST The name of the host machine.
PBS_O_LOGNAME The login name of the user running qsub.
PBS_O_HOME Home directory of the user running qsub.
PBS_O_WORKDIR The working directory.

The following variables relate to the environment on the machine where the job is to be run:

Variable Name Description
PBS_ENVIRONMENT Evaluates to PBS_BATCH for batch jobs and to PBS_INTERACTIVE for interactive jobs.
PBS_O_QUEUE The original queue to which the job was submitted.
PBS_JOBID The identifier that PBS assigns to the job.
PBS_JOBNAME The name of the job.
PBS_NODEFILE The file containing the list of nodes assigned to a parallel job.

Check the status of a job

You can check the status of your jobs with either Torque or Maui.

Torque provides the qstat command to check job status. Please see the qstat man page for a full list of options (man qstat). Some useful options that were not listed above include:

Option Description
-n Show which nodes are allocated to each job.
-f Show a full status display.
-u Show status for jobs owned by a specified user.
-q Show status for a particular queue.

Maui

If maui is installed on your system, you will have access to another set of tools. One of these is showq. showq is a tool like qstat. It will show the queue information.

<geshi> >showq

      ACTIVE JOBS--------------------
      JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME
      624                   user1    Running     4    21:00:01  Fri Apr 24 13:34:17
      621                   user2    Running     2 95:21:19:49  Mon Apr 20 13:54:06
      622                   user2    Running     2 95:21:23:06  Mon Apr 20 13:57:23
      623                   user2    Running     2 96:04:13:37  Mon Apr 20 20:47:54
           4 Active Jobs      10 of   20 Processors Active (50.00%)
                               5 of    7 Nodes Active      (71.43%)
      IDLE JOBS----------------------
      JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


      0 Idle Jobs
      BLOCKED JOBS----------------
      JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


      Total Jobs: 4   Active Jobs: 4   Idle Jobs: 0   Blocked Jobs: 0

</geshi>