Help:HPCC Torque
Submitting Jobs to the CSDMS HPCC
The CSDMS High Performance Computing Cluster uses Torque/Maui as a job scheduler. With Torque you can allocate resources, schedule and manage job execution, monitor and view the status of your jobs.
Torque uses instructions given on the command line and embedded within comments of the shell script that runs your program. This page describes basic Torque usage. Please visit the Torque website for a more complete guide.
The Torque Cheat Sheet
Frequently Used Commands
Command | Description | Basic Usage | Example |
---|---|---|---|
qsub | Submit a pbs job | qsub [script] | > qsub job.pbs |
qstat | Show status of pbs batch jobs | qstat [job_id] | > qstat 44 |
qdel | Delete pbs batch job | qdel [job_id] | > qdel 44 |
qhold | Hold pbs batch jobs | qhold [job_id] | > qhold 44 |
qrls | Release hold on pbs batch jobs | qrls [job_id] | > qrls 44 |
Check Queue and Job Status
Command | Description |
---|---|
qstat -q | List all queues |
qstat -a | List all jobs |
qstat -au <userid> | list jobs for userid |
qstat -r | List running jobs |
qstat -f <job_id> | List full information about job_id |
qstat -Qf <queue> | List full information about queue |
qstat -B | List summary status of the job server |
pbsnodes | List status of all compute nodes |
Within-Script Torque Commands
Command | Description |
---|---|
#PBS -N myjob | Set the job name |
#PBS -m ae | Mail status when the job completes |
#PBS -M your@email.address | Mail to this address |
#PBS -l nodes=4 | Allocate specified number of nodes |
#PBS -l walltime=1:00:00 | Inform the PBS scheduler of the expected runtime |
Basic Usage
Torque dynamically allocates resources for your job. All you need to do is submit it to the queue (with qsub) and it will find the resources for you. Note though that Torque is not aware of the details of the program that you are wanting to run and so may need to tell it what resources you require (memory, nodes, cpus, etc.).
Submitting a job
To submit a job to the queue you must write a shell script that torque will use to run your program. In its simplest form, a torque command file would look like the following: <geshi>
- !/bin/sh
my_prog </geshi> This shell script simply runs the program, my_prog. To submit this job to the queue, use the qsub command, <geshi> > qsub run_my_prog.sh </geshi> where the contents of the file run_my_prog.sh is code snippet above. Torque will respond with the job number and the server name, <geshi> 45.beach.colorado.edu </geshi> In this case Torque has identified your job with job number 45. You have now submitted your job to the default queue and will be run as soon as there are resources available for it. By default, the standard error and output of your script are redirected to files in your home directory. They will have the name <job_name>.o<job_no>, and <job_name>.e<job_no> for standard output and error, respectively. Thus, for our example, standard output will be written to run_my_prog.sh.o45, and standard error will be written to run_my_prog.sh.e.45.
Deleting a job
If you want to delete a job that you already submitted, use the qdel command. This immediately removes your job from the queue and kills it if it is already running. To delete the job from the previous example (job number 45), <geshi> >qdel 45 </geshi>
Check the status of a job
Use qstat to check the status of a job. This returns a brief status report of all your jobs that are either queued or running. For example, <geshi> >qstat
Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 45.beach.colorado.edu STDIN username 0 R workq 46.beach.colorado.edu STDIN username 0 Q workq
</geshi> In this case, job number 45 is running ('R'), and job number 46 is queued ('Q'). Both have been submitted to the workq.
Advanced Usage
As mentioned before, Torque is not aware of what resources your program will need and so may need to give it some hints. This can be done on the command line when calling qsub or within your Torque command file. Torque will parse comments within your command file of the form #PBS. Text that follows this is interpreted as if it were given on the command line with the qsub command. please see the qsub man page for a full list of options (man qsub).
Job Submission: There are options in the shell script that can be used to customize your job. Continuing with the example of the previous section, the command script could be customized, <geshi>
- !/bin/sh
- PBS -N example_job
- PBS -l mem=2gb
- PBS -o my_job.out
- PBS -e my_job.err
my_prog </geshi> Here we have renamed the job to be example_job, tell Torque that the job will use 2GB of memory, and redirect standard output and error to the files my_job.out and my_job.err, respectively. Torque looks for lines that begin with #PBS at the beginning of your command file (ignoring a first line starting with #!). Once it encounters a non-comment line (that isn't blank), it ignores any other directives that might be present.
<geshi>
- PBS -r n # The job is not rerunnable.
- PBS -r y # The job is rerunnable
- PBS -q testq # The queue to submit to
- PBS -N testjob # The name of the job
- PBS -o testjob.out # The file to print the output to
- PBS -e testjob.err # The file to print the error to
- Mail Directives
- PBS -m abe # The points durring the execution to send an email
- PBS -M me@colorado.edu # Who to Mail to
- PBS -l walltime=01:00:00 # Specify the walltime
- PBS -l pmem=100mb # Memory Allocation for the Job
- PBS -l nodes=4 # Number of nodes to Allocate
- PBS -l nodes=4:ppn=3 # Number of nodes and the number processors per node
</geshi>
You can use any of the above options in the script to customize your job. If all of the above options are used, the job will be named testjob and be put into the testq. It will only run for 1 hour and mail me@colorado.edu at the beginning and end of the job. It will use 4 nodes with 3 processors per node, with a total of 12 processors and 100 mb of memory.
Torque environment variables
Before Torque runs your script it defines a set of environment variables that you can use anywhere within your script. That is, either in PBS directives or in commands. For example, <geshi>
- !/bin/sh
- PBS -N example_job
- PBS -l mem=2gb
- PBS -o my_job.out
- PBS -e my_job.err
IN_FILE=${PBS_O_HOME}/my_input_file.txt
my_prog ${IN_FILE} </geshi> torque has set the environment variable PBS_O_HOME
Check the status of a job:
Torque and Maui allow you to check the status of jobs and the queue
status.
In Torque: Qstat has many options for checking a job status. The basic way is running the command with out any options which is showed above. Again the man pages are the best resources for information.
Other options include: -n, -f, -Q, -B, -u, -q
The -n option will show which nodes are running which jobs. <geshi> >qstat -n
server.colorado.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 78.server.colorado user workq STDIN 4811 -- -- -- -- R -- node34/0 79.server.colorado user workq STDIN 4830 -- -- -- -- R -- node34/1 80.server.colorado user workq STDIN 3867 -- -- -- -- R -- node33/0 81.server.colorado user workq STDIN 4821 -- -- -- -- R -- node32/0 82.server.colorado user workq STDIN 4840 -- -- -- -- R -- node32/1 83.server.colorado user workq STDIN 4859 -- -- -- -- R -- node32/2
</geshi>
The -f option will show the full details for a specified job. <geshi> >qstat -f 78 Job Id: 84.server.colorado.edu
Job_Name = STDIN Job_Owner = username@server.colorado.edu resources_used.cput = 00:00:00 resources_used.mem = 1704kb resources_used.vmem = 8028kb resources_used.walltime = 00:00:01 job_state = R queue = workq server = server.colorado.edu Checkpoint = u ctime = Fri Apr 24 16:21:51 2009 Error_Path = server.colorado.edu:/tmp/STDIN.e84 exec_host = node34/0 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Fri Apr 24 16:21:53 2009 Output_Path = server.colorado.edu:/tmp/STDIN.o84 Priority = 0 qtime = Fri Apr 24 16:21:51 2009 Rerunable = True Resource_List.neednodes = node34 session_id = 4877 substate = 42 Variable_List = PBS_O_HOME=/tmp,PBS_O_LOGNAME=username, PBS_O_PATH= /usr/local/bin:/usr/bin PBS_O_SHELL=/bin/tcsh,PBS_SERVER=server.colorado.edu, PBS_O_HOST=server.colorado.edu,PBS_O_WORKDIR=/tmp/ PBS_O_QUEUE=workq euser = username egroup = server hashname = 84.server.colorado.edu queue_rank = 83 queue_type = E etime = Fri Apr 24 16:21:51 2009 start_time = Fri Apr 24 16:21:53 2009 start_count = 1
</geshi>
The -u option will show all jobs owned the specified user.
The -Q option will show the queue information. If a specific queue is specified it will only show the information from that queue.
<geshi> >qstat -Q
Queue Max Tot Ena Str Que Run Hld Wat Trn Ext T ---------------- --- --- --- --- --- --- --- --- --- --- - testing 0 0 yes yes 0 0 0 0 0 0 E normal 8 1 yes yes 0 1 0 0 0 0 E short 0 0 yes yes 0 0 0 0 0 0 E long 0 3 yes yes 0 3 0 0 0 0 E special 0 0 yes yes 0 0 0 0 0 0 E
</geshi>
Maui
If maui is installed on your system, you will have access to another set of tools. One of these is showq. showq is a tool like qstat. It will show the queue information.
<geshi> >showq
ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME
624 user1 Running 4 21:00:01 Fri Apr 24 13:34:17 621 user2 Running 2 95:21:19:49 Mon Apr 20 13:54:06 622 user2 Running 2 95:21:23:06 Mon Apr 20 13:57:23 623 user2 Running 2 96:04:13:37 Mon Apr 20 20:47:54
4 Active Jobs 10 of 20 Processors Active (50.00%) 5 of 7 Nodes Active (71.43%)
IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
0 Idle Jobs
BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Total Jobs: 4 Active Jobs: 4 Idle Jobs: 0 Blocked Jobs: 0
</geshi>