ERISXdl Linux GPU Platform

ERIS Scientific Computing has implemented a new deep learning GPU Cluster, ERISXdl (ERIS Extreme Deep Learning). This system is built with NVIDIA DGX-1, an integrated system that includes high-performance GPU interconnection, delivering industry-leading performance for AI and deep learning. The system includes currently 5 units each containing 8X NVIDIA Tesla V100 with an aggregate of more than 200 thousand CUDA cores, 25600 Tensor cores, 1280 GB of GPU memory, and 35TB of local storage for data processing, to allow you to experiment faster, train larger models, and get insights faster in this new environment.
More productivity and performance benefits come from the fact that ERISXdl is an integrated system, with a complete optimized software platform aimed at deep learning supported by NVIDIA. As neural networks get deeper and more complex, they provide a dramatic increase in accuracy, but training these higher accuracy networks requires much higher computation time, and their complexity increases prediction latency. ERISXdl provides an environment that fits to Deep Learning and other Neural Network models requirements. 
ERISXdl platform provides:
  • Efficient, high-bandwidth streaming of training data. Each system comes configured with a single 480 GB boot OS SSD, and four 1.92 TB SAS SSDs (7.6 TB total) configured as a RAID 0 striped volume for high-bandwidth performance.
  • Multi-gpu and multi-system with GPU performance designed for HPC and Deep Learning applications. Multi-system scaling of Deep Learning computational workloads, both inside the system and between systems, to match the significant GPU performance of each system.
  • The system memory capacity is higher than the GPU memory capacity to enable simplified buffer management and balance for deep learning workloads.
  • Kubernetes and Docker containerized environments to easily emulate the entire software workflow and maintain portability and reproducibility.
  • Jupyter notebooks for rapid development, integration with Github and HPC scheduler Slurm to distribute the workload across the system.
  • Access to high-bandwidth, low-latency Briefcase storage.

Deep Learning Frameworks

Deep learning is a subset of AI and machine learning that uses multi-layered artificial neural networks to deliver state-of-the-art accuracy. GPU-accelerated deep learning frameworks offer flexibility to design and train custom deep neural networks and provide interfaces to commonly-used programming languages.

Users get easy access to NVIDIA optimized deep learning framework containers with deep learning examples, that are performance tuned and tested for NVIDIA GPUs.
Deep Learning Frameworks: 
  • Caffe/Caffe2
  • Microsoft Cognitive Toolkit
  • Pytorch
  • TensorFlow
  • mxnet
  • theano
  • Torch

Take a look at the NVIDIA Catalog to find all the available applications. 

Support of Containerized Environments

Containers (Docker) available for the ERISXdl system include multiple optimized deep learning frameworks, the NVIDIA DIGITS deep learning training application, third-party accelerated solutions, and the NVIDIA CUDA Toolkit.
Advantages of containerized software architecture : 
  • Each deep learning framework is deployed in a separate container, each framework can use different versions of libraries like libc, cuDNN, and others, running completely independent without interference. 
  • Deep learning frameworks in the NVIDIA Docker containers are automatically configured to use parallel routines optimized for the Tesla V100 GPU architecture in the ERISXdl system.
  • As deep learning frameworks are improved for performance or bug fixes, new versions of the containers are made available in the NVIDIA Container Registry. 
  • Reproducibility is a key advantage of containerization specially in research.
  • Containers require less system resources than traditional or hardware virtual machine environments because they don't include operating system images and are able to distribute the resources in a better way. 

Users can bring also prepared Docker environments that can be deployed on ERISXdl getting their collaboration workflows to improve, and develop models faster. 

SLURM Scheduler for Resource Management

To provide high flexibility for development in ERISXdl system the resource manager Slurm has been implemented and provide access to Kubernetes environment for Docker containers to be deployed. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We have setup a free tier partition (Basic) that can be used by any ERISOne user to test the ERISXdl Platform and test and debug their workflow. 

There is a cost associated to the usage of most of the Slurm partitions and the limitations of the partitions are in place to ensure no job runs for an extended time without control of the cost associated. 

User and Groups Requirements

Here are the requirements to use the ERISXdl platform:

  • ERISOne Account. All cluster users get access to free tier however contact Scientific Computing at hpcsupport@partners.org to let us know your interest and we can follow up with you for your requirements during the early adopter phase. 
  • Chargeback Group Account  (Link) Each group need to request a PAS group for access regulation, PI/keygiver can authorize users for access to the charged Slurm partitions.
  • User register their chargeback group access. (Link) The users has to register their authorization to be added to the group invoice process. 

(Links for registrations will be available once the platform will be open to production.)

ERISXdl Cost (During production)

(The platform will be open for early adopters for free until it's ready for production.)

The cost for ERISXdl usage has been calculated considering the capital cost. There is a free tier on the Basic partition in Slurm which allows 10 min jobs with 1 GPU usage with a monthly quota of 20 jobs. The cost per job its calculated with the job runtime and GPU usage on any other partition with $ 0.008 /min GPU.