Scheduling on regis cluster
From SPCTools
(Difference between revisions)
Current revision
From Joe Slagel on Oct. 16, 2009:
Below are my notes on the current Maui/OpenPBS configuration. Hopefully they will clear up some of the any misconceptions that may have emerged.
- The servers regis, regis1-3, and regis4-8 are currently configured as a computing cluster. I use the term "cluster" over the term "grid" as a cluster typically refers to a group of tightly-coupled nodes, sharing a dedicated network and filesystems, is densely located, and has homogeneous nodes. Grids are typically a loosely coupled collection of computers.
- The cluster software on regis comprises of Torque(OpenPBS), version 2.2.1 and MAUI, version 3.2.6p19. Torque is used for resource management and providing control over batch jobs and distributed compute nodes. MAUI is an open source job scheduler for clusters and supercomputers. The two packages are linked together and communicate back in forth to provide the cluster functionality.
- Torque is installed in /hpc, /PBS
- MAUI is installed in /usr/local/maui (-> /var/local/maui)
- Torque currently has 4 nodes configured (regis, regis1, regis2, and regis3).
- regis1, regis2, and regis have properties xtandem, serial
- regis has properties sequest
- Torque currently has 13 queues: memhog, xtandem, serial, sequest_standrd, sequest_<lab group>...
- The xtandem queue is configured to run a maximum of 6 jobs at once (max_running)
- The xtandem queue is configured to allow a maximum of 500 queued jobs (max_queuable)
- The xtandem queue is configured to run a maximum of 3 jobs at a time for a single user (max_user_run)
- Some useful commands (may require admin privileges):
qmgr -c 'list node @active' # lists all nodes qmgr -c 'list queue @active' # lists all queues showconfig # lists maui scheduler configuration showstats # lists historical usage/fairshare usage