ERISXdl Linux GPU Platform

ERIS Scientific Computing has implemented a new deep learning GPU Cluster, ERISXdl (ERIS Extreme Deep Learning). This system is built with NVIDIA DGX-1, an integrated system that includes high-performance GPU interconnection, delivering industry-leading performance for AI and deep learning. The system includes currently 5 units each containing 8X NVIDIA Tesla V100 with an aggregate of more than 200 thousand CUDA cores, 25600 Tensor cores, 1280 GB of GPU memory, and 35TB of local storage for data processing, to allow you to experiment faster, train larger models, and get insights faster in this new environment.
 
More productivity and performance benefits come from the fact that ERISXdl is an integrated system, with a complete optimized software platform aimed at deep learning supported by NVIDIA. As neural networks get deeper and more complex, they provide a dramatic increase in accuracy, but training these higher accuracy networks requires much higher computation time, and their complexity increases prediction latency. ERISXdl provides an environment that fits to Deep Learning and other Neural Network models requirements. 
 
ERISXdl platform provides:
  • Efficient, high-bandwidth streaming of training data. Each system comes configured with a single 480 GB boot OS SSD, and four 1.92 TB SAS SSDs (7.6 TB total) configured as a RAID 0 striped volume for high-bandwidth performance.
  • Multi-gpu and multi-system with GPU performance designed for HPC and Deep Learning applications. Multi-system scaling of Deep Learning computational workloads, both inside the system and between systems, to match the significant GPU performance of each system.
  • The system memory capacity is higher than the GPU memory capacity to enable simplified buffer management and balance for deep learning workloads.
  • Kubernetes and Docker containerized environments to easily emulate the entire software workflow and maintain portability and reproducibility.
  • Jupyter notebooks for rapid development, integration with Github and HPC scheduler Slurm to distribute the workload across the system.
  • Access to high-bandwidth, low-latency Briefcase storage.

 

ERISXdl Pricing

The pricing of GPU usage has been calculated based upon the initial capital cost for the ERISXdl platform. In order to allow for initial testing however, there is a  free tier called the Basic partition in Slurm and which will permit 10 min single GPU jobs up to a maximum of 20 jobs per month. For other partitions, a fee for GPU usage is charged at a rate of $0.008 /min GPU.

A HARBOR account will be opened for the PAS group to allow its members to access Nvidia-specific Podman containers that enable the optimal use of ERISXdl's GPUs in Slurm jobs. To learn more about Harbor and using Docker containers on ERISXdl with Slurm, please see the articles linked below.

 

Deep Learning Frameworks

Deep learning is a subset of AI and machine learning that uses multi-layered artificial neural networks to deliver state-of-the-art accuracy. GPU-accelerated deep learning frameworks offer flexibility to design and train custom deep neural networks and provide interfaces to commonly-used programming languages.

Users get easy access to NVIDIA optimized deep learning framework containers with deep learning examples, that are performance tuned and tested for NVIDIA GPUs.

 
Deep Learning Frameworks: 
  • Caffe/Caffe2
  • Microsoft Cognitive Toolkit
  • Pytorch
  • TensorFlow
  • mxnet
  • theano
  • Torch

Take a look at the NVIDIA Catalog to find all the available applications. 

 

Support of Containerized Environments

 
Containers (Docker) available for the ERISXdl system include multiple optimized deep learning frameworks, the NVIDIA DIGITS deep learning training application, third-party accelerated solutions, and the NVIDIA CUDA Toolkit.
 
Advantages of containerized software architecture : 
  • Each deep learning framework is deployed in a separate container, each framework can use different versions of libraries like libc, cuDNN, and others, running completely independent without interference. 
  • Deep learning frameworks in the NVIDIA Docker containers are automatically configured to use parallel routines optimized for the Tesla V100 GPU architecture in the ERISXdl system.
  • As deep learning frameworks are improved for performance or bug fixes, new versions of the containers are made available in the NVIDIA Container Registry. 
  • Reproducibility is a key advantage of containerization specially in research.
  • Containers require less system resources than traditional or hardware virtual machine environments because they don't include operating system images and are able to distribute the resources in a better way. 

Users can bring also prepared Docker environments that can be deployed on ERISXdl getting their collaboration workflows to improve, and develop models faster. 

 

SLURM Scheduler for Resource Management

 

To provide high flexibility for development in ERISXdl system the resource manager Slurm has been implemented and provide access to Kubernetes environment for Docker containers to be deployed. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We have setup a free tier partition (Basic) that can be used by any Scientific Computing (SciC) Linux Clusters user to test the ERISXdl Platform and test and debug their workflow. 

There is a cost associated to the usage of most of the Slurm partitions and the limitations of the partitions are in place to ensure no job runs for an extended time without control of the cost associated. 

 

Application Procedure for the ERISXdl platform

We first note that only Groups with fund numbers will be allowed to use Harbor and Slurm on the ERISXdl platform.

Then, the procedure for application is:

  • Confirm your group has a PAS group, or register one here, so that the PI/keygiver authorizes users of the PAS group to access the charged Slurm partitions.
  • Registration of a chargeback group account (here) with a fund number 
  • Confirm members of the PAS group have an SciC Linux Clusters Account if they wish to use Harbor and Slurm on ERISXdl. Please note, all cluster users get access to the free tier called the Basic partition.

 

Using ERISXdl

For more information about getting started with ERISXdl, please see these articles:

 

Go to KB0038510 in the IS Service Desk