December 18, 2023
Full documentation on submitting and modifying jobs through LSF is here (only available internally). These notes are incomplete and intended for advanced users of the LSF scheduler.
If jobs are using more resources than allocated by the job scheduler the queue or resource reservation may be changed, but only when the job is in a "PEND" state, not while it is RUNNING or SUSPended.
Notes
If output files were created by the running jobs which would prevent a second run of the same job being successful these would need to be cleaned up.
For one or two jobs, use the bqueue / bswitch / bmod commands directly, these instructions apply to large numbers of jobs
When you have many similar jobs to run we recommend testing the script within an interactive session ( http://rc.partners.org/kbase?cat_id=45&art_id=405) to determine what memory and CPU reservation to make for the batch jobs. When you have an interactive session open you may "ssh" to the node in another terminal and run "top" or other Linux tools to monitor the resources used by your job. Alternatively look for the historical resource usage of the node in Ganglia: https://hpcweb2.partners.org/ganglia/
Requeuing many running jobs
To modify jobs that are suspended, first use "requeue". Under certain circumstances an administrator may requeue jobs that are exceeding their resource reservation, in which case you can choose to modify the existing job queue and resource requirements or kill these jobs are re-submit from the beginning. If you choose to modify the existing jobs please notify Scientific Computing as an administrator will need to resume the jobs.
- Use "bjobs" and awk to filter job numbers - select only the specific user and specific queue
- brequeue -H option instructs jobs to stay pending after being returned to the queue
- First do a trial run to echo output
bjobs -u abc123 -q normal | awk 'NR > 1 {print $1} ' | xargs -i echo brequeue -H '{}'
- Then execute for real (remove echo from the above)
bjobs -u abc123 -q normal | awk 'NR > 1 {print $1} ' | xargs -i brequeue -H '{}'
-
- An alternative to removing 'echo' is to copy the first command printed to the terminal by the above, paste to the terminal prompt and execute
- Verify the command works on the first job listed
- Then copy/paste the remaining commands
- Wait 15-30 minutes depending on the number of jobs moved
- Check status
Batch modification of many jobs
- Use "bjobs" and "awk" to filter job numbers - select only the specific user and specific queue
- Check status
bjobs -u abc123 -q normal
- Once all jobs show as PSUSP state, modify parameters
- Eg. change from normal queue to big queue
- Note: Best practice is to dry-run the command first using echo, then execute for real, then verify the changes are successfully applied before moving on to the next command
bjobs -u abc123 -q normal | awk 'NR > 1 && $3 ~ /PSUSP/ {print $1} ' | xargs -i echo bswitch big '{}'
bjobs -u abc123 -q normal | awk 'NR > 1 && $3 ~ /PSUSP/ {print $1} ' | xargs -i bswitch big '{}'
-
- Eg. increase memory reservation
bjobs -u abc123 -q big | awk 'NR > 1 && $3 ~ /PSUSP/ {print $1} ' | xargs -i echo bmod -M 20000 -R rusage[mem=20000] '{}'
bjobs -u abc123 -q big | awk 'NR > 1 && $3 ~ /PSUSP/ {print $1} ' | xargs -i bmod -M 20000 -R rusage[mem=20000] '{}'
- Finally verify that for a sample of jobs the new settings are applied, and resume the jobs
bjobs -l <JOBID>
bjobs -u abc123 -q big | awk 'NR > 1 && $3 ~ /PSUSP/ {print $1} ' | xargs -i echo bresume '{}'
bjobs -u abc123 -q big | awk 'NR > 1 && $3 ~ /PSUSP/ {print $1} ' | xargs -i bresume '{}'