February 21, 2024
Introduction
Python is one of the most popular programming languages for science, engineering, data analytics, and deep learning applications. However, as an interpreted language, it’s been considered too slow for high-performance computing. A GPU Accelerated Computing Linux VDI has just been added into Data Enclave environment, which might address the need for high-performance computing.
GPU’s have more cores than CPU and hence when it comes to parallel computing of data, GPUs perform exceptionally better than CPUs even though GPUs has lower clock speed and it lacks several core management features as compared to the CPU.
Thus, running a python script on GPU can prove to be comparatively faster than CPU. NVIDIA’s CUDA Python provides a driver and runtime API for existing toolkits and libraries to simplify GPU-based accelerated processing. This document provides the know-how, python examples on how to run the NVIDIA CUDA Toolkit and cuDNN libraries on a Linux VDI instance.
Requirements or Prerequisites
What the end user needs to get started are:
- You can access a Linux VDI instance running Ubuntu 20.04 and with the following software installed.
- NVIDIA vGPU Driver Version: 460.106.00
- CUDA Version: 11.2
- cuDNN: 8.1.
To run CUDA Python, you’ll need the CUDA Toolkit installed on a system with CUDA-capable GPUs.
- You have installed Numba—a Python compiler from Anaconda that can compile Python code for execution on CUDA®-capable GPUs
Step-by-Step Instructions
What are the exact, detailed steps to accomplish the task?
- Installation
You will need to install Anaconda, if it is not installed already. You can add anaconda to the environment while installing.
Here is a Conda repo that Rod created. https://gitlab.partners.org/rd482/enclave-conda
Same instructions as in Confluence, just copy & paste this command and should install everything for you.
$ curl -sfL https://gitlab.partners.org/rd482/enclave-conda/-/blob/main/conda-setup.sh | sh -s |
This will install Anaconda3 in your $HOME folder.
Then update your ~/.bashrc file
$ echo "export PATH=${PATH}:${HOME}/anaconda3/bin" >> ~/.bashrc && source ~/.bashrc |
- Run the bash script to install Anaconda3
With the bash installer script downloaded, run the .sh script to install Anaconda3. Ensure you are in the directory where the installer script downloaded:
$ ls Anaconda3-2022.05-Linux-x86_64.sh |
Run the installer script with bash.
$ bash Anaconda3-2022.05-Linux-x86_64.sh |
Accept the License Agreement and allow Anaconda to be added to your PATH. By adding Anaconda to your PATH, the Anaconda distribution of Python will be called when you type $ python in a terminal.
- Create a Project Environment
Next, create a new environment and activate it.
$ conda create -n GPU_env -y $ conda activate GPU_env |
- Install Python 3.8
$ conda install python=3.8 |
- Install numba cudatoolkit pyculib
To get started with Numba, the first step is to download and install the Anaconda Python distribution that includes many popular packages (Numpy, SciPy, Matplotlib, iPython, etc.) and “conda,” a powerful package manager.
Once you have Anaconda installed, install the required CUDA packages by typing
conda install numba cudatoolkit pyculib
If the installation fails, you might have to search for alternative channels.
$ conda install numba |
Check the packages installed.
$ conda list |
$ conda list
# packages in environment GPU_env:
#
- Develop the Python scripts
We will use the numba.jit decorator for the function we want to compute over the GPU. The decorator has several parameters but we will work with only the target parameter.
Target tells the jit to compile codes for which source(“CPU” or “Cuda”). “Cuda” corresponds to GPU.
However, if the CPU is passed as an argument then the jit tries to optimize the code run faster on CPU and improves the speed too.
Create a python script called test_GPU.py with the following code:
from numba import jit, cuda import numpy as np # to measure exec time from timeit import default_timer as timer # normal function to run on cpu def func(a): for i in range(10000000): a[i]+= 1 # function optimized to run on gpu @jit(target_backend='cuda') def func2(a): for i in range(10000000): a[i]+= 1 if name ==" main ": n = 10000000 a = np.ones(n, dtype = np.float64)
start = timer() func(a) print("without GPU:", timer()-start)
start = timer() func2(a) print("with GPU:", timer()-start) |
- Run the python script
$ python3 test_GPU.py
Output:
without GPU: 3.448683829017682
with GPU: 0.16991339999367483
- Create the test_CUDA_example.py script as below
from future import division from numba import cuda import numpy import math
# CUDA kernel @cuda.jit def my_kernel(io_array): pos = cuda.grid(1) if pos < io_array.size: io_array[pos] *= 2 # do the computation
# Host code data = numpy.ones(256) threadsperblock = 256 blockspergrid = math.ceil(data.shape[0] / threadsperblock) my_kernel[blockspergrid, threadsperblock](data) print(data) |
- Run the test cuda example script
$ python3 test_CUDA_example.py
Output:
$ python3 test_CUDA_example.py
[2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
Programming Numba for CUDA
GitHub - ContinuumIO/gtc2017-numba: Numba tutorial for GTC 2017 conference
https://github.com/ContinuumIO/gtc2017-numba.git
Deep Learning
Deep Learning | NVIDIA Developer
Deep learning differs from traditional machine learning techniques in that they can automatically learn representations from data such as images, video or text, without introducing hand-coded rules or human domain knowledge. Their highly flexible architectures can learn directly from raw data and can increase their predictive accuracy when provided with more data.
Deep learning is commonly used across apps in computer vision, conversational AI and recommendation systems.
For developers integrating deep neural networks into their cloud-based or embedded application, Deep Learning SDK includes high-performance libraries that implement building block APIs for implementing training and inference directly into their apps. With a single programming model for all GPU platform - from desktop to datacenter to embedded devices, developers can start development on their desktop, scale up in the cloud and deploy to their edge devices - with minimal to no code changes.
NVIDIA provides optimized software stacks to accelerate training and inference phases of the deep learning workflow. Learn more on the links below.
PyTorch
PyTorch is a Python package that provides two high-level features:
- Tensor computation (like numpy) with strong GPU acceleration.
- Deep Neural Networks (DNNs) built on a tape-based autograd system.
Reuse your favorite Python packages, such as numpy, scipy and Cython, to extend PyTorch when needed.
PyTorch on NGC Sample models Automatic mixed precision
TensorFlow
TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device without rewriting code. For visualizing TensorFlow results, TensorFlow offers TensorBoard, a suite of visualization tools.
TensorFlow on NGC TensorFlow on GitHub Sample models Automatic mixed precision TensorFlow for JetPack
Model Deployment
For high performance inference deployment for TensorFlow trained models:
- Use the TensorFlow-TensorRT integration to optimize and deploy models within TensorFlow.
- Export the TensorFlow model to ONNX and import, optimize, and deploy with NVIDIA TensorRT, an SDK for high performance deep learning inference.
Limitations
- Only NVIDIA GPUs are supported for now and the ones which are listed on this page. If your graphics card has CUDA cores, then you can proceed further with setting up things.
- Running a python script on GPU can prove to be comparatively faster than CPU, however, it must be noted that for processing a data set with GPU, the data will first be transferred to the GPU’s memory which may require additional time so if data set is small then CPU may perform better than GPU.