November 29, 2022
- Efficient, high-bandwidth streaming of training data. Each system comes configured with a single 480 GB boot OS SSD, and four 1.92 TB SAS SSDs (7.6 TB total) configured as a RAID 0 striped volume for high-bandwidth performance.
- Multi-gpu and multi-system with GPU performance designed for HPC and Deep Learning applications. Multi-system scaling of Deep Learning computational workloads, both inside the system and between systems, to match the significant GPU performance of each system.
- The system memory capacity is higher than the GPU memory capacity to enable simplified buffer management and balance for deep learning workloads.
- Kubernetes and Docker containerized environments to easily emulate the entire software workflow and maintain portability and reproducibility.
- Jupyter notebooks for rapid development, integration with Github and HPC scheduler Slurm to distribute the workload across the system.
- Access to high-bandwidth, low-latency Briefcase storage.
ERISXdl Pricing
The pricing of GPU usage has been calculated based upon the initial capital cost for the ERISXdl platform. In order to allow for initial testing however, there is a free tier called the Basic partition in Slurm and which will permit 10 min single GPU jobs up to a maximum of 20 jobs per month. For other partitions, a fee for GPU usage is charged at a rate of $0.008 /min GPU.
A HARBOR account will be opened for the PAS group to allow its members to access Nvidia-specific Podman containers that enable the optimal use of ERISXdl's GPUs in Slurm jobs. To learn more about Harbor and using Docker containers on ERISXdl with Slurm, please see the articles linked below.
Deep Learning Frameworks
Deep learning is a subset of AI and machine learning that uses multi-layered artificial neural networks to deliver state-of-the-art accuracy. GPU-accelerated deep learning frameworks offer flexibility to design and train custom deep neural networks and provide interfaces to commonly-used programming languages.
- Caffe/Caffe2
- Microsoft Cognitive Toolkit
- Pytorch
- TensorFlow
- mxnet
- theano
- Torch
Take a look at the NVIDIA Catalog to find all the available applications.
Support of Containerized Environments
- Each deep learning framework is deployed in a separate container, each framework can use different versions of libraries like libc, cuDNN, and others, running completely independent without interference.
- Deep learning frameworks in the NVIDIA Docker containers are automatically configured to use parallel routines optimized for the Tesla V100 GPU architecture in the ERISXdl system.
- As deep learning frameworks are improved for performance or bug fixes, new versions of the containers are made available in the NVIDIA Container Registry.
- Reproducibility is a key advantage of containerization specially in research.
- Containers require less system resources than traditional or hardware virtual machine environments because they don't include operating system images and are able to distribute the resources in a better way.
Users can bring also prepared Docker environments that can be deployed on ERISXdl getting their collaboration workflows to improve, and develop models faster.
SLURM Scheduler for Resource Management
To provide high flexibility for development in ERISXdl system the resource manager Slurm has been implemented and provide access to Kubernetes environment for Docker containers to be deployed. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We have setup a free tier partition (Basic) that can be used by any Scientific Computing (SciC) Linux Clusters user to test the ERISXdl Platform and test and debug their workflow.
There is a cost associated to the usage of most of the Slurm partitions and the limitations of the partitions are in place to ensure no job runs for an extended time without control of the cost associated.
Application Procedure for the ERISXdl platform
We first note that only Groups with fund numbers will be allowed to use Harbor and Slurm on the ERISXdl platform.
Then, the procedure for application is:
- Confirm your group has a PAS group, or register one here, so that the PI/keygiver authorizes users of the PAS group to access the charged Slurm partitions.
- Registration of a chargeback group account (here) with a fund number
- Confirm members of the PAS group have an SciC Linux Clusters Account if they wish to use Harbor and Slurm on ERISXdl. Please note, all cluster users get access to the free tier called the Basic partition.
Using ERISXdl
For more information about getting started with ERISXdl, please see these articles: