Help:HPCC Torque: Difference between revisions

From CSDMS
m (Add basic usage section)
m (Add qdel section)
Line 117: Line 117:


=== Deleting a job ===
=== Deleting a job ===
To delete a job use "qdel". Qdel will remove the job from the queue,
If you want to delete a job that you already submitted, use the <tt>qdel</tt> command. This immediately removes your job from the queue and kills it if it is already running.  To delete the job from the previous example (job number 45),
and it will not be run. If it is being run it will stop the job.
 
<geshi>
<geshi>
>qdel 2607
>qdel 45
</geshi>
</geshi>


Check the status of a job:
=== Check the status of a job ===
To check the status of a job use "qstat". Qstat is a command that will
Use <tt>qstat</tt> to check the status of a job. This returns a brief status report of all your jobs that are either queued or running. For example,
return all queued and running jobs.
 
<geshi>
<geshi>
>qstat
>qstat
       Job id                    Name            User            Time Use S Queue
       Job id                    Name            User            Time Use S Queue
       ------------------------- ---------------- --------------- -------- - -----
       ------------------------- ---------------- --------------- -------- - -----
       2607.servername          STDIN            username              0 R workq
       45.beach.colorado.edu    STDIN            username              0 R workq
       2608.servername          STDIN            username              0 Q workq
       46.beach.colorado.edu    STDIN            username              0 Q workq
</geshi>
</geshi>
 
In this case, job number 45 is running ('R'), and job number 46 is queued ('Q').  Both have been submitted to the workq.
The 'S' parameter tells the status of the job. 'R' for running, 'Q'
for queued.
 


=== Advanced usage ===
=== Advanced usage ===

Revision as of 11:40, 13 May 2009

Submitting Jobs to the CSDMS HPCC

The CSDMS High Performance Computing Cluster uses Torque/Maui as a job scheduler. With Torque you can allocate resources, schedule and manage job execution, monitor and view the status of your jobs.

Torque uses instructions given on the command line and embedded within comments of the shell script that runs your program. This page describes basic Torque usage. Please visit the Torque website for a more complete guide.

The Torque Cheat Sheet

Frequently Used Commands

Command Description Basic Usage Example
qsub Submit a pbs job qsub [script] > qsub job.pbs
qstat Show status of pbs batch jobs qstat [job_id] > qstat 44
qdel Delete pbs batch job qdel [job_id] > qdel 44
qhold Hold pbs batch jobs qhold [job_id] > qhold 44
qrls Release hold on pbs batch jobs qrls [job_id] > qrls 44

Check Queue and Job Status

Command Description
qstat -q List all queues
qstat -a List all jobs
qstat -au <userid> list jobs for userid
qstat -r List running jobs
qstat -f <job_id> List full information about job_id
qstat -Qf <queue> List full information about queue
qstat -B List summary status of the job server
pbsnodes List status of all compute nodes

Within-Script Torque Commands

Command Description
#PBS -N myjob Set the job name
#PBS -m ae Mail status when the job completes
#PBS -M your@email.address Mail to this address
#PBS -l nodes=4 Allocate specified number of nodes
#PBS -l walltime=1:00:00 Inform the PBS scheduler of the expected runtime

Basic Usage

Torque dynamically allocates resources for your job. All you need to do is submit it to the queue (with qsub) and it will find the resources for you. Note though that Torque is not aware of the details of the program that you are wanting to run and so may need to tell it what resources you require (memory, nodes, cpus, etc.).

Submitting a job

To submit a job to the queue you must write a shell script that torque will use to run your program. In its simplest form, a torque command file would look like the following: <geshi>

  1. !/bin/sh

my_prog </geshi> This shell script simply runs the program, my_prog. To submit this job to the queue, use the qsub command, <geshi> > qsub run_my_prog.sh </geshi> where the contents of the file run_my_prog.sh is code snippet above. Torque will respond with the job number and the server name, <geshi> 45.beach.colorado.edu </geshi> In this case Torque has identified your job with job number 45. You have now submitted your job to the default queue and will be run as soon as there are resources available for it. By default, the standard error and output of your script are redirected to files in your home directory. They will have the name <job_name>.o<job_no>, and <job_name>.e<job_no> for standard output and error, respectively. Thus, for our example, standard output will be written to run_my_prog.sh.o45, and standard error will be written to run_my_prog.sh.e.45.


Deleting a job

If you want to delete a job that you already submitted, use the qdel command. This immediately removes your job from the queue and kills it if it is already running. To delete the job from the previous example (job number 45), <geshi> >qdel 45 </geshi>

Check the status of a job

Use qstat to check the status of a job. This returns a brief status report of all your jobs that are either queued or running. For example, <geshi> >qstat

      Job id                    Name             User            Time Use S Queue
      ------------------------- ---------------- --------------- -------- - -----
      45.beach.colorado.edu     STDIN            username               0 R workq
      46.beach.colorado.edu     STDIN            username               0 Q workq

</geshi> In this case, job number 45 is running ('R'), and job number 46 is queued ('Q'). Both have been submitted to the workq.

Advanced usage

Torque allows you to use advanced features and customizations when running jobs. The below sections are continuations of the sections above.

Job Submission: There are options in the shell script that can be used to customize your job.

A Basic script. <geshi> >cat test.sh

      #!/bin/bash
      #PBS -N testjob
      cat $PBS_NODEFILE
      sleep 30

</geshi>

$PBS_NODEFILE is the location of a file that contains a list of the nodes allocated for this job.

  1. PBS specifies an option to Torque. There are many listed below, but

more can be found in the man page for qsub.

<geshi>

  1. PBS -r n # The job is not rerunnable.
  2. PBS -r y # The job is rerunnable
  3. PBS -q testq # The queue to submit to
  4. PBS -N testjob # The name of the job
  5. PBS -o testjob.out # The file to print the output to
  6. PBS -e testjob.err # The file to print the error to
  7. Mail Directives
  8. PBS -m abe # The points durring the execution to send an email
  9. PBS -M me@colorado.edu # Who to Mail to
  1. PBS -l walltime=01:00:00 # Specify the walltime
  2. PBS -l pmem=100mb # Memory Allocation for the Job
  3. PBS -l nodes=4 # Number of nodes to Allocate
  1. PBS -l nodes=4:ppn=3 # Number of nodes and the number processors per node

</geshi>

You can use any of the above options in the script to customize your job. If all of the above options are used, the job will be named testjob and be put into the testq. It will only run for 1 hour and mail me@colorado.edu at the beginning and end of the job. It will use 4 nodes with 3 processors per node, with a total of 12 processors and 100 mb of memory.


Check the status of a job: Torque and Maui allow you to check the status of jobs and the queue status.

In Torque: Qstat has many options for checking a job status. The basic way is running the command with out any options which is showed above. Again the man pages are the best resources for information.

Other options include: -n, -f, -Q, -B, -u, -q

The -n option will show which nodes are running which jobs. <geshi> >qstat -n

      server.colorado.edu:
                                                                       Req'd  Req'd   Elap
      Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
      -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
      78.server.colorado     user     workq    STDIN              4811   --   --    --    --  R   --
         node34/0
      79.server.colorado     user     workq    STDIN              4830   --   --    --    --  R   --
         node34/1
      80.server.colorado     user     workq    STDIN              3867   --   --    --    --  R   --
         node33/0
      81.server.colorado     user     workq    STDIN              4821   --   --    --    --  R   --
         node32/0
      82.server.colorado     user     workq    STDIN              4840   --   --    --    --  R   --
         node32/1
      83.server.colorado     user     workq    STDIN              4859   --   --    --    --  R   --
         node32/2

</geshi>

The -f option will show the full details for a specified job. <geshi> >qstat -f 78 Job Id: 84.server.colorado.edu

  Job_Name = STDIN
  Job_Owner = username@server.colorado.edu
  resources_used.cput = 00:00:00
  resources_used.mem = 1704kb
  resources_used.vmem = 8028kb
  resources_used.walltime = 00:00:01
  job_state = R
  queue = workq
  server = server.colorado.edu
  Checkpoint = u
  ctime = Fri Apr 24 16:21:51 2009
  Error_Path = server.colorado.edu:/tmp/STDIN.e84
  exec_host = node34/0
  Hold_Types = n
  Join_Path = n
  Keep_Files = n
  Mail_Points = a
  mtime = Fri Apr 24 16:21:53 2009
  Output_Path = server.colorado.edu:/tmp/STDIN.o84
  Priority = 0
  qtime = Fri Apr 24 16:21:51 2009
  Rerunable = True
  Resource_List.neednodes = node34
  session_id = 4877
  substate = 42
  Variable_List = PBS_O_HOME=/tmp,PBS_O_LOGNAME=username,
      PBS_O_PATH= /usr/local/bin:/usr/bin
      PBS_O_SHELL=/bin/tcsh,PBS_SERVER=server.colorado.edu,
      PBS_O_HOST=server.colorado.edu,PBS_O_WORKDIR=/tmp/
      PBS_O_QUEUE=workq
  euser = username
  egroup = server
  hashname = 84.server.colorado.edu
  queue_rank = 83
  queue_type = E
  etime = Fri Apr 24 16:21:51 2009
  start_time = Fri Apr 24 16:21:53 2009
  start_count = 1

</geshi>

The -u option will show all jobs owned the specified user.

The -Q option will show the queue information. If a specific queue is specified it will only show the information from that queue.

<geshi> >qstat -Q

      Queue              Max   Tot   Ena   Str   Que   Run   Hld   Wat   Trn   Ext T
      ----------------   ---   ---   ---   ---   ---   ---   ---   ---   ---   --- -
      testing              0     0   yes   yes     0     0     0     0     0     0 E
      normal               8     1   yes   yes     0     1     0     0     0     0 E
      short                0     0   yes   yes     0     0     0     0     0     0 E
      long                 0     3   yes   yes     0     3     0     0     0     0 E
      special              0     0   yes   yes     0     0     0     0     0     0 E

</geshi>


Maui

If maui is installed on your system, you will have access to another set of tools. One of these is showq. showq is a tool like qstat. It will show the queue information.

<geshi> >showq

      ACTIVE JOBS--------------------
      JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME
      624                   user1    Running     4    21:00:01  Fri Apr 24 13:34:17
      621                   user2    Running     2 95:21:19:49  Mon Apr 20 13:54:06
      622                   user2    Running     2 95:21:23:06  Mon Apr 20 13:57:23
      623                   user2    Running     2 96:04:13:37  Mon Apr 20 20:47:54
           4 Active Jobs      10 of   20 Processors Active (50.00%)
                               5 of    7 Nodes Active      (71.43%)
      IDLE JOBS----------------------
      JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


      0 Idle Jobs
      BLOCKED JOBS----------------
      JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


      Total Jobs: 4   Active Jobs: 4   Idle Jobs: 0   Blocked Jobs: 0

</geshi>