In this blog post we are looking at how to install the NVIDIA CUDA Toolkit and its basic setup. In the previous blog post we looked at the typical problem encountered when trying to install the CUDA Toolkit in a virtualized environment with a vGPU. In the follow on to this post (Part 3) I will detail why this is substantial for both developers and organizations.
A quick review of part 1 before we get into the guts of how to deploy the NVIDIA CUDA Toolkit on a Linux VM. Below is the diagram from the previous post. We can’t use the CUDA Toolkit package manager (RPM/Deb) to deploy the toolkit as it installs a prepackaged driver. That means we have to install it using the run file for the CUDA Toolkit as it has some options around the GPU driver.
That said, how do we get the NVIDIA CUDA Toolkit installed on a virtual machine? It’s a multi-step process which is rooted in a correctly configured virtual machine (VM). That means we need properly built virtual infrastructure, specifically ESXi hosts with NVIDIA GPUs. So lets start there but before we do a couple of quick notes.
- At the time of this writing (October 2017) VMware does not support Pascal vGPUs in Linux Desktops running on VMware Horizon. If you need to do this in a supported manner, you will want to use NVIDIA M6, M10, or M60 GPUs.
- I built this on my lab environment, which you can read about here, note that my configuration is not supported by either VMware nor NVIDIA. It works for my purposes but, it’s not supported.
Now lets get you those three wishes and get started with setting up our hardware.
Physical Hardware
I’m going to assume we will be using currently (October 2017) supported GPU’s in our ESXi hosts. That means NVIDIA M6, M10, M60, P4, P40, P6, or P100. If you you are still using Kepler GPUs, K1 or K2, these steps should work but haven’t been tested and would be substantially unsupported as the Kepler GPUs have reached EOL. Other GPUs are not currently supported by VMware. You can check to see if your GPU is supported in the VMware HCL for Shared Pass-Through Graphics (aka vGPU).
You will need to follow your hardware vendors instructions for installing the GPUs in physical servers (ESXi hosts) as hardware vendors are all a bit different, or you may be lucky and it came pre-installed.
That gets us to the installation of the virtual environment. I wont explain how to install ESXi or add the host to a vCenter instance (or even setup a vCenter environment if you are starting completely from scratch). There are plenty of posts that explain how to do this.
The Base Virtual Environment
For the next three sections I will be summarizing material I presented at the GPU Tech Conference in 2017 (Maxwell based cards) and at VMworld 2017 (Pascal based cards) on setting up and configuring vGPUs for a Linux environment. Also as of this writing (October 13, 2017) it should be noted that VMware does not support the use of Pascal GPUs for Linux virtual desktops, my build, using an NVIDIA P4 GPU, is unsupported. (If you need this support for Pascal GPUs with Linux VMs in your environment be sure to contact your VMware sales representative).
Now on to configuring the host.
- If we are using the Maxwell GPUs (M6, M10, and M60) we first need to use GPUmodeSwitch and set the card to graphics mode (GTC slides 10 and 11). This is not necessary for Pascal series GPUs.
- At this point we are able to install the Virtual GPU manager, also known as the VIB. To do this we want to upload the VIB to a datastore on the ESXi host and install it like any other VIB (GTC slide 12 \ VMworld slide 10)
- If we are using Pascal based GPUs we will need to change the ECC mode to 0 on the GPU. (VMworld slide 11)
- Next we want to set our graphics in the ESXi host to shared direct (VMworld slide 12)
- We should then check and make sure the GPUs are not enabled for Passthrough (GTC slides 13)
That gets the basics of the virtual environment setup.
VMware Horizon
Now we need to have a quick side chat about virtual desktops as compared to other VMs. To use these virtual desktops to their full capability we need an alternative display adapter to gain access to the VMs. We can’t just use the VMware Virtual Console to access the virtual machine. In a few cases it might be acceptable just to let users connect to a development system by SSH. However it will probably be more desirable to access a full GUI.
The simple reason that the VMware Virtual Console won’t work when a vGPU is used is when the virtual machine is configured to use the vGPU, the vGPU is not mapped back to the default console. Think of it like when you install a new GPU in a physical desktop you either plug in to the on board display adapter or you plug into the video card, the VMs cable is plugged into the on board adapter not the vGPU. (See diagram on the right)
So how does one display a desktop to developers? The best way, in my opinion (I work for Dell, major share holder of VMware), to do this is using VMware Horizon which provides virtual desktop infrastructure (VDI) for displaying desktops. Setup and configuration of VMware Horizon is beyond the scope of this blog. Needless to say using Horizon provides a lot of flexibility and power in the datacenter and we will be leveraging it for the purposes of this post.
vGPU Licensing
Because of the way we are using the GPU for virtual desktops we need to license the VMs. This is done through the NVIDIA GRID License Server. It can be setup on either a Linux OS or a Windows OS. The license server can be downloaded from NVIDIA in the same place you downloaded the VIB and guest OS driver.
Installation is straight forward and outlined in the GRID License Server Release Notes. I also detail the setup and licensing of VMs in GTC slides 15 to 26 and VMworld slides 14 to 19.
You have to setup and use NVIDIA Licensing for NVIDIA vGPUs. If you don’t and try to run a vGPU without proper licensing, the VM will not function correctly and you will get errors when you attempt to run CUDA applications. The one I get most frequently is the one below, CUDA Error code=46(cudaErrorDevicesUnavailable). It’s caused by the VM not having a license.
Virtual Machine Virtual Hardware Configuration
At this point we can build a base virtual machine. After all we want these desktops to be something that are repeatable, quickly deployable, easily protectable, and when we are ready to reclaim resources disposable.
The first task for this is just to build a basic virtual machine. For my initial testing I built a CentOS 7 Linux VM. I patched, installed any base development tools such as gcc, and completed other standard OS deployment operations. Once we have prepared a base VM image we will shutdown the guest VM.
- With the guest VM shutdown we will edit the virtual hardware settings for our VM. (GTC slides 28 and 29 \ VMworld slides 22 and 23)
- We will use the new device drop down at the bottom of the edit settings screen to select a Shared PCI Device
- Then we will select the desired vGPU Profile (How much of a vGPU a user can use)
- Lastly we want to click the “Reserve all Memory” button (this is important to do otherwise the VM may not power on)
- Then we can power on the VM
- Inside the VM we are going to setup some networking and disable Nouveau. (GTC slides 32 and 33 \ VMworld slides 24 and 25)
- At this point we can install the NVIDIA GPU drivers. (GTC slides 34 and 35 \ VMworld slides 26 and 27)
- It’s important to note that the driver installed during this step must match the VIB installed in previous steps!
- Once the NVIDIA driver is installed we will install the VMware Horizon Agent on the VM. (GTC slide 35 \ VMworld slide 27)
- At this point we can reboot the VM
- When the VM reboots the VMware Virtual Console of the VM will no longer function and you will need to access the VM using VMware Horizon, SSH, or some other console viewer.
At this point I recommend verifying that the VM functions correctly in VMware Horizon by adding it to a dedicated manual desktop pool. If it does we are ready for the next step.
Installing the NVIDIA CUDA Toolkit
Now I will get a little more detailed on the install since I haven’t done a session on setting this up yet. It is important to note that Changjiang’s blog on building a deep learning box is what actually helped me to figure out how to install the NVIDIA CUDA Toolkit on a virtual machine. As of this blog post I am using version 9.0.176 of the CUDA Toolkit. With that lets get started.
- Download the run file version of the NVIDIA CUDA Toolkit from the developer site – https://developer.nvidia.com/cuda-downloads
- Perform the pre-installation tasks in the CUDA Toolkit Documentation
- Verify that your VM shows it has an NVIDIA GPU in it with the command: lspci | grep -i nvidia
- Verify gcc is installed: gcc –version
- Install the Kernel Headers and Development packages (varies by OS)
- Now skip down to the Runfile Installation section
- Disable Nouveau if you haven’t already
- At this point drop into runlevel 3 (text mode) – when you do this the virtual console will be functional again until you exit the run level.
- As sudo you want to execute the run file: sudo sh ./cuda_<version>_linux.run
- Follow the prompts on screen
- When asked to install the driver GPU driver enter No (N), this is the most important part of this process.
The reason for this is, if you select yes, the file will over write the already installed driver with the driver included in the package, which if you remember from earlier, the driver version in the VM has to match the VIB version. - Finish answering the prompts and complete the installation of the run file
- At this point we can precede to the post-installation steps
- Add /usr/local/cuda-9.0/bin to the PATH variable:
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}} - We then need to add the 64bit library to the the LD_LIBRARY_PATH variable:
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
(I’ve had issues with this variable entry staying in my CentOS VM, if you have issues running the examples following a reboot check and see if this variable is empty.) - Install the writable samples
cuda-install-samples-9.0.sh <dir>
I typically put this in the /home/usr/ path (~) - Make the samples:
cd ~/NVIDIA_CUDA-9.0_Samples
make
This can take a while to run, you may want to do this over lunch
- Add /usr/local/cuda-9.0/bin to the PATH variable:
- Reboot your VM, if you did this via the console you will need to return to your VMware Horizon connection to the VM.
- Open up a console and change to the location of the files you built. Typically:
cd ~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/- Run deviceQuery
./deviceQuery
Output will look something like this:
It will not look exactly the same though. - Run bandwidthTest
./bandwidthTest
Output will look something like this:
It will not look exactly the same though. - If you are curious what I got for all the different files in my example, you can review them here.
- Run deviceQuery
- At this point you are ready to use the NVIDIA CUDA Toolkit or install additional components such as TensorFlow.
To summarize the process (see picture below); we first installed the VIB on the our ESXi hosts with GPUs (1). Next we installed the NVIDIA GPU driver on our virtual machine (2). After which we installed the VMware Horizon Agent on the virtual machine (3). Lastly we used the run file install method to install the NVIDIA CUDA Toolkit on the VM (4). At this point we can finish customizing the VM and use it for our master image to deliver to users.
This concludes the installation of the NVIDIA CUDA Toolkit, the second in my multi-part blog series on installing the NVIDIA CUDA Toolkit on a VM. If you haven’t already be sure to read the previous blog post (part1) where we looked at the typical problems encountered when trying to install the CUDA Toolkit. Also be sure to read the next blog in this series about why this is a big deal for organizations and developers (part3).
If you have issues, run into any snags, or something else please either share in the comments or contact me.