I’m very excited to share with everyone that I have successfully installed the NVIDIA CUDA Toolkit on a Linux CentOS 7 Virtual Desktop running on Horizon 7.1 in my home lab. (No, I’m not the first but its a big deal for me.) You might be thinking “big deal you’ve deployed an app on a virtual machine!” or “Why would anyone want to use virtual machines for CUDA development, they want/need a whole dedicated piece of hardware, besides it’s easier that way!”
When I hear objections like this, all I can think of is the Disney movie Aladdin where the Genie says “Master, I don’t think you quite realize what you’ve got here!”
In this series of blogs I’m going to let the genie out of the lamp. For this post I’m going to layout the problem of installing the CUDA Toolkit on a Horizon Virtual Machine. In the next post I will lay out how to install it on a virtual machine. The following post will then explain why a configuration like this is significant benefit to developers and companies. There may also be follow on parts that I haven’t even thought of yet, as I’m still excited about proving this works correctly.
With that let’s rub the lamp and see what we get.
Most people in the virtualization space think of virtual GPUs (vGPUs) as something that provides enhanced graphics capabilities to virtual desktops (using VDI) not something for processing of data. Those developing applications that use CUDA (deep learning, machine learning, big data, etc.) tend to think of GPUs as more of a tool for improved data processing performance. Both perceptions are barriers into virtualizing developer workloads.
Because of this CUDA developers don’t consider virtualization and virtualization teams don’t consider CUDA developers.The net result is each side writes the other off. Leading to no documentation on how to deliver a given product on a given platform (CUDA on a VM). You can see this in both the NVIDIA CUDA Toolkit and in vGPU technology.
Deploying vGPU technology has a fair amount of requirements and configuration steps (as you can see in my GTC 17 session on setting up vGPUs for Linux VMs). Many times you are required to use a specific set of matching drivers, one for the ESXi host (in VMware a VIB) and one for the VM (a driver). If those don’t match, the VM may fail to work correctly or at all.
With the CUDA Toolkit, the typical install path, in Linux OSs, is a package manager install (RPM/Deb) that is configured to deploy a specific GPU driver version. This driver version, to the best of my knowledge, has never matched the driver version for a vGPU. There is also no easy way to change that driver in the RPM or Deb file.
And this is typically where the discussion ends… “My virtual machine can only run driver X.Y.Z” AND “The RPM deploys driver A.B.C” SO… “This just won’t work, make it a physical machine and lets move on.”
This is the point where we need an “easy button” to press and make the drivers magically match up so we can run the CUDA Toolkit on a VM.
I repeatedly tried installing the CUDA Toolkit with the package manager installs, as I outlined above, on VMs, trying various combinations and orders to see if there was a magic way to get the package installer to accept and match the driver I was using. There’s not. There is another method though that will get your VM running the CUDA Toolkit without a much issue, we can use the runfile installation method.
You can see this in the image below. We have a VM with vGPU capabilities running on an ESXi host on the left. On the right we have two different deployment methods, package manager and run files. Package manager based installs won’t work on VMs as the driver installed as part of the package is not compatible with the on needed on the VM. However the run file deployment of the CUDA Toolkit will work as the GPU driver is variable. I’ll cover the steps of how to deploy the NVIDIA CUDA Toolkit runfile in the next blog post,followed by how this unleashes a powerful genie for both developers and organizations in the third post of this series.