Child pages
  • Why hasn't my job started?
Skip to end of metadata
Go to start of metadata

You can get information about what is preventing a queued job from running with checkjob:

$ checkjob jobid

The output of checkjob is complicated and technical. Mostly a job remains in the queue because it is waiting for resources to become available (you can check how busy the system is with pbstop). Other likely causes are that it is waiting on a job dependency, or you have reached the limit of simultaneously running jobs for a single user. If your job has been waiting a long time and you would like help understanding why, contact us.

In the example below the job requested 12 large-memory nodes, and the blue text on the last line indicates that the scheduler has not yet found a large enough timeslot slot in which it can run (note that it has found four such nodes available).

$ checkjob 3718378
job 3718378

AName: testme.q
State: Idle
Creds: user:sl151 group:users account:ITS class:p12 qos:p12
WallTime: 00:00:00 of 00:01:00
BecameEligible: Thu Feb 13 12:47:51
SubmitTime: Thu Feb 13 12:47:46
(Time Queued Total: 00:00:10 Eligible: 00:00:04)

NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 24

Req[0] TaskCount: 24 Partition: ALL
Opsys: --- Arch: --- Features: mem48gb
Dedicated Resources Per Task: PROCS: 1 MEM: 1365M


Notification Events: JobFail

IWD: /home/sl151/batch_scheduler
Flags: RESTARTABLE
Attr: checkpoint
StartPriority: 1999
compute-9-0 available: 12 tasks supported
compute-9-4 available: 12 tasks supported
compute-9-7 available: 12 tasks supported
compute-9-13 available: 12 tasks supported
NOTE: job cannot run in partition crunch (insufficient idle nodes available: 4 < 12)

  • No labels