January 24, 2022
- Efficient, high-bandwidth streaming of training data. Each system comes configured with a single 480 GB boot OS SSD, and four 1.92 TB SAS SSDs (7.6 TB total) configured as a RAID 0 striped volume for high-bandwidth performance.
- Multi-gpu and multi-system with GPU performance designed for HPC and Deep Learning applications. Multi-system scaling of Deep Learning computational workloads, both inside the system and between systems, to match the significant GPU performance of each system.
- The system memory capacity is higher than the GPU memory capacity to enable simplified buffer management and balance for deep learning workloads.
- Kubernetes and Docker containerized environments to easily emulate the entire software workflow and maintain portability and reproducibility.
- Jupyter notebooks for rapid development, integration with Github and HPC scheduler Slurm to distribute the workload across the system.
- Access to high-bandwidth, low-latency Briefcase storage.
Deep Learning Frameworks
Deep learning is a subset of AI and machine learning that uses multi-layered artificial neural networks to deliver state-of-the-art accuracy. GPU-accelerated deep learning frameworks offer flexibility to design and train custom deep neural networks and provide interfaces to commonly-used programming languages.
- Microsoft Cognitive Toolkit
Take a look at the NVIDIA Catalog to find all the available applications.
Support of Containerized Environments
- Each deep learning framework is deployed in a separate container, each framework can use different versions of libraries like libc, cuDNN, and others, running completely independent without interference.
- Deep learning frameworks in the NVIDIA Docker containers are automatically configured to use parallel routines optimized for the Tesla V100 GPU architecture in the ERISXdl system.
- As deep learning frameworks are improved for performance or bug fixes, new versions of the containers are made available in the NVIDIA Container Registry.
- Reproducibility is a key advantage of containerization specially in research.
- Containers require less system resources than traditional or hardware virtual machine environments because they don't include operating system images and are able to distribute the resources in a better way.
Users can bring also prepared Docker environments that can be deployed on ERISXdl getting their collaboration workflows to improve, and develop models faster.
SLURM Scheduler for Resource Management
To provide high flexibility for development in ERISXdl system the resource manager Slurm has been implemented and provide access to Kubernetes environment for Docker containers to be deployed. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We have setup a free tier partition (Basic) that can be used by any ERISOne user to test the ERISXdl Platform and test and debug their workflow.
There is a cost associated to the usage of most of the Slurm partitions and the limitations of the partitions are in place to ensure no job runs for an extended time without control of the cost associated.
User and Groups Requirements
Here are the requirements to use the ERISXdl platform:
- ERISOne Account. All cluster users get access to free tier however contact Scientific Computing at firstname.lastname@example.org to let us know your interest and we can follow up with you for your requirements during the early adopter phase.
- Chargeback Group Account (Link) Each group need to request a PAS group for access regulation, PI/keygiver can authorize users for access to the charged Slurm partitions.
- User register their chargeback group access. (Link) The users has to register their authorization to be added to the group invoice process.
(Links for registrations will be available once the platform will be open to production.)
ERISXdl Cost (During production)
(The platform will be open for early adopters for free until it's ready for production.)
The cost for ERISXdl usage has been calculated considering the capital cost. There is a free tier on the Basic partition in Slurm which allows 10 min jobs with 1 GPU usage with a monthly quota of 20 jobs. The cost per job its calculated with the job runtime and GPU usage on any other partition with $ 0.008 /min GPU.
Once out of its pilot phase, ERISXdl users will also be charged for container storage on the Harbor registry. To learn more about Harbor and using Docker containers on ERISXdl, see the articles linked below.
For more information about getting started with ERISXdl, please see these articles: