Adopt-a-Node Priority Resources

Adopted nodes

The ERISTwo Cluster includes both "general" nodes, available on a fair-share basis to all investigators, and "adopted" nodes, where priority is given to the investigator or group who purchased those nodes. The adopt-a-node owner also has the ability to leverage all shared resources, while incurring zero ($0) support costs for maintenance of their nodes. When idle, these dedicated nodes may share compute capacity with the general pool, benefiting the Mass General Brigham research community as a whole.

Where resource isolation is required for clinical workflows, ERISTwo supports "private clusters" where storage and compute are separated from the main cluster. Contact Scientific Computing for details.

Adopt-a-node purchasing

There is a one-time cost for adopting a node, no charges for maintenance will be passed on after the initial purchase.  In some cases a one-time networking charge will also apply.  To adopt-a-node, complete this Request Form.

Compute node models are of the class of hardware described here HP blade BL460c generation 10. The most recent price updates appear in the Request Form, last updated August 6, 2018.

These models have two of the latest Intel processors inside.  Other specifications are available on request.  ERIS guarantees a minimum 4.5 year lifetime for adopt-a-node's from the time they join the compute cluster. After this 4.5 year period, they may remain as dedicated nodes or be decommissioned or re-purposed by ERIS.

Access and status information

  • Priority access to adopted nodes is controlled through membership of unix groups - type "groups"  to see what groups you are a member of.
groups acil eris1 PosixUsers 
  • All available adopted node groups, and the access controls that apply to each can be seen by typing "bsla"
  • Once you know the name of your Service Class, just type "bsla mygroup_sc".  This sample output shows two nodes guaranteed under the Service Class, of which 1 is currently in use.  Access in this case is controlled by membership of the "acil" user group.
bsla acil_sc SERVICE CLASS NAME: acil_sc -- Service class for ACIL group ACCESS CONTROL: USERS[acil/ lsfadmins/] AUTO ATTACH: Y GOAL: GUARANTEE                      GUAR  GUAR  TOTAL POOL NAME             TYPE CONFIG  USED  USED acilpool             hosts    2    1    1 
  • To see how busy your adopted nodes (your host group) are, type "bhosts mygroup_hg" - using the name assigned to your lab group in place of "mygroup". This example shows node cmu063 having 6 jobs running:
bhosts acil_hg HOST_NAME     STATUS    JL/U  MAX NJOBS  RUN SSUSP USUSP  RSV cmu063       ok       -   12   6   6   0   0   0 cmu064       ok       -   12   0   0   0   0   0 
  • To see what jobs are submitted (either running or pending) under your Service Class, type "bjobs -u all -sla mygroup_sc"
bjobs -u all -sla acil_sc 

Submitting jobs

See the page on Selecting a queue .  Provided that your account has membership of the appropriate group, simply submit jobs to any of the queues listed in the "Priority node allocation" section on that page.  When the priority node group is busy, jobs will be submitted to the next available node, either in the priority group or the general pool.  For example, using the medium queue:

bsub -q medium < myscript.lsf 

To force submission to your host group and not the general nodes, use

bsub -q medium -m mygroup_hg < myscript.lsf

 

Understanding Service Classes (SLAs)

Service Classes govern access to the adopted nodes.  In most cases the Service Class for your node(s) automatically accepts jobs from people who are members of the associated group.

Specifying the Service Class

When nodes are not configured to automatically accept jobs - in which case the output of the "bsla" command (see above) includes the line "AUTOATTACH: N" - you must specify the service class in order to use them, e.g.:

bsub -q long -sla mylab_sc < myscript.lsf

Similarly, if you belong to two lab groups each with adopt-a-nodes then specify the node group Service Class to which you want the jobs to run under

Access to ERIS Service Classes

The general pool of nodes on the cluster also uses Service Classes to better schedule certain types of jobs. When your adopt-a-node group is busy your jobs are able to run on general cluster nodes instead but they do so outside of the ERIS Service Classes. If you anticipate your adopt-a-nodes being busy and wish to run jobs on the general nodes, use the following service classes:

Queue Service class
 medium  erismedium_sc
 normal  erismedium_sc
 big  erisbig_sc
 big-multi  erisbigmulti_sc

This is not required for jobs to run on ERIS general nodes but it helps improve the scheduling time

Priority node loan policy

When idle, some job slots on adopted nodes are on loan to the "vshort" queue and the "rerunnable" queue.  Both these queues will release job slots in under 15 minutes when jobs are submitted under the SLA, ensuring priority access to node owners.

Node administration

You can request to stop nodes accepting new jobs; for example, if there is a problem with a node causing jobs to fail, you may wish to close it. Request changes by emailing @email.

Notes

Currently the compute cluster schedules jobs on physical servers. ERISTwo may transition to a cloud computing model in which jobs are scheduled on virtual servers and at that time adopt-a-node owners will be allocated priority on virtual resources equivalent to their physical adopt-a-node server capacity.

Go to KB0027907 in the IS Service Desk

Related articles