Harnessing the Power of Python to Control vGPU Management in VMware vSphere – GTC 2021 Session E32023

Bookmark and Share

In this blog I provide some insights on managing vGPUs in vSphere with Python. It sounds like a simple task to control vGPUs in vSphere, but its not as easy as it appears. I got deep into this as Johan and I were working on content for this session. I’ll cover all that and more in this post.

Just Here for the Slides

If you are here for the session slides, you’re in luck. Below you’ll find a copy of the slide deck. This is the same deck you’ll get through the GTC site. For the video of the session you’ll need to log in to the GTC web site. If you have questions about the session be sure to use the Contact Me page or leave a message in the comments.

https://www.wondernerd.net/wp-content/uploads/2021/04/E32023Tony-Foster-_-Johan-Van-AmersfoortHarnessing-the-Power-of-Python-to-Control-NVIDIA-vGPU-Management-in-VMware-vSphereFINAL.pdf

For those just getting started with vGPUs or wondering what they are. They are Virtual GPUs. What that means is VMware, in this case, is abstracting the physical GPU so that multiple VMs can use the physical hardware at the same time. That means you can have multiple VMs all running graphics applications or maybe even running an AI workload, because the GPU has been “sliced up.” You can also add up to 4 vGPUs to a single VM, making the VM a vGPU power house. It’s really fun stuff to work with in my opinion.

Pythoning it up

That gets to the Python side of things. We aren’t talking about how you can program vGPUs as part of a CUDA or PyTorch operation, for the most part that is done the same as if it were a physical host. I actually delved into this in a GTC 2018 session, S8483 – Empowering CUDA Developers with Virtual Desktops. Instead the real focus of this session is to look at how you manage the back end virtual hardware through a programmatic method. In other words it’s great to manually add a vGPU to a VM through the vCenter, but that isn’t all that helpful when you want to automate the process.

After all, most IT admins would rather write some code to do a common repeatable task than be that monkey trained to do the same thing over and over again. You can do all the stuff we talked about in this session through PowerShell and PowerCLI, and I’ve done several blogs on how to automate that, and that’s all well and good, unless you are looking for a different type of control, Python.

Python, is a modern programming language that’s widely used and can easily be ported from one OS to another. Plus compared to other languages it doesn’t have nearly as steep of a learning curve. There are also a lot of great plugins and modules for Python including PyVmomi. All of these are reasons why many programmers who are developing software around vSphere are using Python.

Earlier I mentioned that programming may not be as easy as it first appears. Here’s why, GPUs and vGPUs don’t have a lot of documentation about them as programming objects. If you do a google search for adding a vGPU to a VM with Python you’re not going to get a lot of results. (My Google search returned 0 meaningful results in the first two pages.) Why because people haven’t needed to do this all that much… until now.

AI/ML/DL/HPC run it virtually

Why is it needed now? Much of it has to do with the raise of AI/ML/DL/HPC on virtual platforms. (I discussed this in a 2018 VMworld Session, VAP2340BU – Driving Organizational Value by Virtualizing AI/ML/DL and HPC Workloads.) And some of it with VDI. Up until late 2020 many thought it was absurd to put these “special workloads” on a virtual environment, because “they require bare metal performance.” Only a handful of folks understood what power virtualizing them held for the enterprise. No longer is AI the stuff of science experiments, at has a real business value and needs to be treated like any other business IT stack. And what do businesses do with IT stacks? They automate.

So how do we crack open the vault and start programmatically controlling our vGPU infrastructure. Where I started was with the VMware vHPC toolkit on GitHub. It has a lot of good samples and examples. The best thing to do is to do a ctrl+f and search for “GPU.” The content is fantastic. But the code in there is really chunked up so you have to bounce all over to figure out which objects you are working with. This is great once you understand the infrastructure and what you’re trying to code against, not so much when you are learning it.

The Secret Decoder Ring

That’s where the vSphere MOB comes into play. Don’t worry there’s no offer to refuse. MOB stands for Managed Object Browser. You can get to it by appending /mob to the end of your vSphere URL. The MOB allows you to browse all the objects managed by vSphere. So not just GPUs. The problem is finding them. For the GTC session we put together a little MOB Cheat Sheet slide to help you figure out where things are in your environment and the type of objects they are. This will help you as you code figure out what object you need to access and when.

In the session we talk about two different ways of calling objects. You can navigate to them directly or access them through the use of a container view. Calling them with a container view looks like this:

TempVMlist = HostContent.viewManager.CreateContainerView(HostContent.rootFolder,[vim.VirtualMachine], True)

The container view takes a content type of vim.VirtualMachine in this case, the MOB provides these object types so you can easily specify what sort of object you are looking for. You can navigate to object directly as well.

DataCenterContent = HostContent.rootFolder.childEntity[0] #Assume single DC       
VMs = DataCenterContent.vmFolder.childEntity

These two lines of code navigate directly to the VMs in a vSphere environment instead of creating a container view. This path can be found using the MOB as well. Having a context of both of these helps to make it easier to understand what you are doing.

The main areas for objects are:

  • managed_object_ref.config.sharedPassthruGpuTypes #Shared Passthrough GPUs
  • ChildVM.config.hardware.device #VM child hardware device listing
  • isinstance(VMVirtDevice, vim.VirtualPCIPassthrough) #Has a virtual PCI passthrough device
  • hasattr(VMVirtDevice.backing, “vgpu”) #Has a backing attribute of vgpu
  • VMVirtDevice.backing.vgpu #Device Backing
  • VMVirtDevice.deviceInfo.label #Device label eg. grid_p4-8q
  • VMVirtDevice.deviceInfo.summary #Device summary

All of the code we discussed in the session is available on GitHub so you can try it out for yourself. It’s important to note there is very little error checking in these scripts, and that is intentional because we don’t know how you intend to use them. So if you intend to use them in production be sure to add the appropriate error handling.

In the repository we provided details on how you can find what host have GPUs, which VMs have vGPUs, how to add a vGPU, to a VM and remove it. You probably want to know, as I did, how do you get to the stats and all the details you get with the nvidia-smi command.

Digging Deeper With NVML

That’s something I was hoping to share in the session. NVIDIA provides a tool called the NVIDIA Management Library or NVML which can provide an insane amount of information about your GPUs and vGPUs. The only problem is VMware doesn’t allow this to be exposed through the vSphere API. That means the only way to get to the information provided by the NVML is through a terminal session (SSH).

It took several emails back and forth to make sure I wasn’t missing anything and that the only way to get the NVML goodness was through SSH. And I can confirm as of this writing that this is the case. And unfortunately that programming is a bit beyond the scope of both the GTC session and this blog.

Hopefully this has helped unlock the secrets needed to programmatically mange you vGPUs with Python.

May your servers keep running and your data center always be chilled.

Permanent link to this article: https://www.wondernerd.net/harnessing-the-power-of-python-to-control-vgpu-management-in-vmware-vsphere-gtc-2021-session-e32023/