ERIS Scientific Computing has implemented a new deep learning GPU Cluster, ERISXdl (ERIS Extreme Deep Learning). This system is built with NVIDIA DGX-1, an integrated system that includes high-performance GPU interconnects which deliver industry-leading performance for AI and deep learning. The current system includes 5 nodes each containing 8 X NVIDIA Tesla V100 GPUs with an aggregate of more than 200 thousand CUDA cores, 25600 Tensor cores, 1280 GB of GPU memory, and 35TB of local storage for data processing, to allow the user to more quickly train larger models as part of a wider series of experiments.
 
Increased productivity and performance benefits come from the fact that ERISXdl is an integrated and NVIDIA-supported hardware-software system that is tuned for deep learning. This platform is therefore ideal for addressing deeper and more complex neural networks which offer dramatic increases in accuracy but at a cost of longer compute times and increased latency. In general ERISXdl represents a good fit for the requirements of Deep Learning and Neural Network modelling more generally. 
 
ERISXdl platform provides:
  • Efficient, high-bandwidth streaming of training data. Each system comes configured with a single 480 GB boot OS SSD, and four 1.92 TB SAS SSDs (7.6 TB total) configured as a RAID 0 striped volume for high-bandwidth performance.
  • Multi-gpu and multi-system with GPU performance designed for HPC and Deep Learning applications. Multi-system scaling of Deep Learning computational workloads, both inside the system and between systems, to match the significant GPU performance of each system.
  • The system memory capacity is higher than the GPU memory capacity to enable simplified buffer management and balance for deep learning workloads.
  • Kubernetes and Docker containerized environments to easily emulate the entire software workflow and maintain portability and reproducibility.
  • Jupyter notebooks for rapid development, integration with Github and HPC scheduler Slurm to distribute the workload across the system.
  • Access to high-bandwidth, low-latency Briefcase storage.

 

ERISXdl Pricing

The pricing of GPU usage has been calculated based upon the initial capital cost for the ERISXdl platform in addition to ongoing maintenance costs. Groups will be charged a flat fee of $0.01 /min (as of Nov 1st 2023) for each GPU requested in a given SLURM job. However, in order to allow for initial testing and debugging of code a free tier called the Basic partition has been created to allow users to run test jobs of up to 10min free of charge. More details about the available partitions can be found at ERISXdl: Using SLURM Job Scheduler .

A HARBOR account will be opened for the PAS group to allow its members to store customized Docker Images that are suitable for running on the dgx compute nodes in Slurm jobs. For convenience, the name of the harbor account will be the lowercase form of the PAS group name. Base images are available from Nvidia's catalog (see below) and which contains a wide selection of popular deep learning applications tuned for GPU use. To learn more about Harbor and using Docker containers on ERISXdl with Slurm, please see the articles linked below.

Please note that fees (per min of GPU time) are reviewed annually and, if necessary, will be updated from October 1st of that year. 

 

Deep Learning Frameworks

Deep learning is a subset of AI and machine learning that uses multi-layered artificial neural networks to deliver state-of-the-art accuracy. GPU-accelerated deep learning frameworks offer flexibility to design and train custom deep neural networks and provide interfaces to commonly-used programming languages.

Users get convenient access to NVIDIA optimized deep learning framework containers with deep learning examples, that are performance tuned and tested for NVIDIA GPUs.

 
Deep Learning Frameworks: 
  • Caffe/Caffe2
  • Microsoft Cognitive Toolkit
  • Pytorch
  • TensorFlow
  • mxnet
  • theano
  • Torch

Take a look at the NVIDIA Catalog to find all the available applications. 

 

Support of Containerized Environments

 
Containers (Docker) available for the ERISXdl system include multiple optimized deep learning frameworks, the NVIDIA DIGITS deep learning training application, third-party accelerated solutions, and the NVIDIA CUDA Toolkit.
 
Advantages of containerized software architecture : 
  • Each deep learning framework is deployed in a separate container and where each framework can use different versions of libraries libc, cuDNN, etc., without interaction from the host operating system or other containers. 
  • Deep learning frameworks in the NVIDIA Docker containers are automatically configured to use parallel routines optimized for the Tesla V100 GPU architecture in the ERISXdl system.
  • As deep learning frameworks are improved for performance or have bug fixes, new versions of the containers are made available in the NVIDIA Container Registry. 
  • Reproducibility is a key advantage of containerization and particularly in research.
  • Containers require fewer system resources than traditional or hardware virtual machine environments since they don't include operating system images and are able to distribute the resources more efficiently.

Users can additionally deploy customized Docker images, possibly resulting from a collaboration, on ERISXdl and so assist in faster model development. Some care is required in these cases however to ensure that the GPUs on the compute nodes are optimally utilized.

 

SLURM Scheduler for Resource Management

 

ERISXdl employs the job scheduler Slurm to manage user access to the cluster's compute resources. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. In order to ensure a robust operation the Slurm controller itself runs inside a kubernetes pod with two-fold redundancy. By this means a failed Slurm control pod can rapidly regenerate on any of the three available login nodes, erisxdl1, erisxdl2 or erisxdl3. clusters.

 

Application Procedure for the ERISXdl platform

We first note that only Groups with fund numbers will be allowed to use Harbor and Slurm on the ERISXdl platform.

Then, the procedure for application is:

  • Confirm your group has a PAS group, and if not follow the registration procedure described here, so that the PI/keygiver authorizes users of the PAS group to access the charged Slurm partitions.
  • Complete the application form for erisxdl services here with a suitable fund number
  • Confirm members of the PAS group have an SciC Linux Clusters Account if they wish to use Harbor and Slurm on ERISXdl. Please note, users registered for erisxdl services get access to the free tier called the Basic partition.

 

Using ERISXdl

For more information about getting started with ERISXdl, please see these articles:

 

Go to KB0038510 in the IS Service Desk