Scripting VDI By Day and Compute By Night

Bookmark and Share

You’ve probably read some blog posts or even case studies on VDI by day and compute by night (cycle harvesting), specifically with GPUs, but you really want to see a real implementation of it or try it for yourself. Well, you’re in luck, you’ve found the right blog!

Over the weekend I finished writing a PowerCLI script to do VDI by day and compute by night with vGPU based VMs. In in this post I’m going to share the script and how to set it up with you.

For those not familiar with the term VDI by day and compute by night here is my example of it. Let’s say you have a bunch of VDI users who are structural engineers. They typically work from 8 in the morning to 5 or so at night on their VDI desktops. They have programs like like AutoCAD which require high end GPUs, but since they are using VDI they’re virtual GPUs (vGPUs). That means you have these high end GPUs that might be sitting idle for 12 to 16 hours a day and the organization spent a lot of money for those GPUs and the rest of the compute resources.

Now what if you could use those GPUs for something other than just engineers virtual desktops? You know, get a little more millage out of them. The R&D department keeps wanting more servers for their high performance computing (HPC) farm… What could 12 extra hours a day of GPU time coming from the Engineers idle VDI hosts do to help them?

Cycle of VDI by Day and Compute By Night (Cycle Harvesting)

That is my VDI by day compute by night or cycle harvesting example. It is re-purposing compute resources from one task (VDI) to supplement another task (HPC, machine learning, deep learning, etc.) when the first task isn’t using them. Further more it’s releasing resources from the second task when they are needed by the primary task.

That’s where the script I wrote comes into play. Trying to do all of the above tasks manually is a pain… Let’s automate it. You are probably thinking, that’s so complex to figure out. And besides if I shutdown my compute VMs after they are half way through a process I’ve lost tons of information and time.

This script addresses these issues and, the release, as of this writing, does it in under 200 lines of code. That said let’s dig into the script.

The script is a PowerCLI script that calls the carrying capacity function I wrote about a few weeks ago. The essence is it performs a loop looking for excess resources of a given vGPU profile (that means the vGPUs used for VDI and Compute can be different), if it finds spare resources the script resumes suspended compute VMs. It continues doing this until the vGPU capacity defined in the script is reached. Then, as VDI desktops start to come back online it starts working in reverse suspending compute VMs from the newest to the oldest (last in, first out (LIFO)). The program loops through balancing of resources for a number of iterations over a given period of time. The net result is an automated instance of VDI by day and compute by night (or whenever there are sufficient free resources, hence why this is commonly called cycle harvesting). A partial flow chart of the logic is shown in the diagram below (there is a lot more happening in the script than whats shown in the diagram).

Cycle Harvesting logic diagram

The order of operations are important in the script. The script first looks at if it needs to suspend compute VMs first, as the primary job of the environment is ‘VDI.’ This insures that resources are allocated to the priority before even considering if it should resume any compute VMs. Then, if there is capacity, compute VMs are started or resumed. And if neither option is possible the environment has reached a “steady state” where resources are being consumed in an optimal manner which prioritizes the ‘VDI’ workload over the compute workload.

You can download the script from my GitHub repository: https://github.com/wondernerd/VDIbyDayComputeOtherwise

This is version 0.1 of the script, so there are a lot of refinements still to come. The script shows that the concept works and delivers the expected results. Future iterations will be more robust and will build on this basic model.

I built and tested this function on VMware PowerCLI 11.0.0 build 10380590 and on PowerShell 5.1.14409.1005. It should be backwards compatible several generations back to the point that the vGPU device backing was added in PowerCLI. Though I’m not sure when that was.

The beginning of the script contains all the variables needed to control it. You can see these in the code snippet below.

#Import vGPU capacity function
. 'C:\Users\Administrator\Desktop\vGPU System Capacity v1_3.ps1'


#define the paramaters

# VDI Side
$SpareVMcapacity = 1			#How many spare VMs should be able to be powered on

# Compute Side
$ComputeVMbaseName = "Compute"  #The base name of compute VMs, a three digit number will be added at the end
$ComputeCountFormat = "000"		#The preceding zeros in the compute name, so the 6th VM would be Compute006
$MaxComputeVMs = 4				#Total Number of Compute VMs in use
$ComputevGPUtype = "grid_p4-2q"	#Which vGPU is in the Compute VM (later I will detect this)

# Opperations Side Varables
$WorkingCluster = "Horizon"		#Name of the cluster that should be 
$SecondsBetweenScans = 30		#How long will the program wait between scans
$NumberOfScansToPreform = 10		#How many times should the scan be run 



###############################
###############################
# Operational variables do not touch

$vGPUslotsOpen = 0 				#How many vGPU VMs are currently on
$POComputeVMcount = 0			#Number of powered on (PO) compute VMs
$ScanCount = 0					#How Many times the scan has been run
$CurrVMName = ""				#Current VM Name
$ComputeSteadyState = 0


Now lets walk through how to configure the script. To start with we will need the vGPU System Capacity function. You can download that from my GitHub vGPU Capacity repository. Then you want to include the path to that file. (Being sure to mind the dot operator at the beginning of the line.)

Next we define how many spare VMs should be able to be powered on. I currently use one you will probably want to set this as a derivative value of your fail over capacity. Lets say you have N+1 fail over capacity in your environment that means you can have a single host fail and still be functional. To calculate that you would want to multiply the number of vGPUs a profile supports per card times the number of GPUs in a host times the number of spare hosts you have. You may also want to add in some extra capacity as well since it can take a few moments to suspend a compute VM.
Example: The P4-2Q profile supports 4 vGPUs per card and I have one card per system with a fail over capacity of N+1, so I would set this value to 4. (4 * 1 * 1 = 4) Thus allowing all VMs from one host to fail to the other host.

The next set of values are for the compute VMs. The first variable is for the base name of the virtual machines used for compute. In the script I call them “Compute.” Which corresponds to the compute VMs base name in my vCenter. This is followed by the trailing digit format for the compute VMs. I’ve defined this as “000.” These two variables are then concatenated in the script to define the compute VMs.
Example: Compute + number (formatted in 000 form) = Compute001. This is the name of the second compute VM used.

After we have defined the format of the compute names, we set how many compute VMs there are. I used 4 in this example. It should be noted that the script uses base zero counting so the first VM would be Compute000 and the last would be Compute003. Lastly we define which vGPU the compute VMs will be using. In future iterations of the script I will just pull the vGPU information from the VMs.

Following that section we have the operations variables we need to define. The first one is the name of the cluster we are working in, in my case I call my cluster “Horizon.” Next we define how often we should be scanning for changes in the environment. I went with 30 seconds but your environment may be different and may need sped up or slowed down. Lastly how many times should the scan run. I didn’t build an infinite loop into the system just for stability purposes and because this is an example script, you can modify the code if you want it to run indefinitely.


With those items set the script is ready. Now to prep the environment. The first thing we want to do is setup our compute VMs. That means building out whatever compute nodes you need for your environment connecting them into it’s management and making sure workloads are running correctly on them. You also want to make sure they follow the naming convention we used in the script settings along with the GPU type.

You might be going, but I don’t have the resources to have all these compute VMs powered on at once. That’s fine you can do it in batches. In my case I would power on one Compute VM and configure it. Then I would suspend the VM. By suspending it I’m maintaining its state but releasing its resources. I rinse, lather, and repeat this for all my remaining compute VMs and should wind up with all of the compute VMs configured and in a suspended state.

The thing about suspended VMs is their state is maintained but their resources are released. This means we can setup everything for use and then suspend the VM. Anything in process in them is maintained when suspended. Allowing any operations to complete once resumed.

Now comes a VERY important part. Licensing!
Take a look at your VMware Horizon licensing agreement. It probably says something like you can only use Horizon for virtual desktop instances…
That means your compute VMs need to be desktops or you need to license your hosts as standard ESXi assets. I can’t tell you which is best to do. You should work with the VMware Licensing team to determine the best course of action. It’s up to you to make sure you fully comply with the terms of your license agreement.


At this point everything should be in place and ready to go. If you so choose you can connect to your vSphere environment and run the example script you configured. When you run it, you will see it determine the state of the environment (suspend compute, resume compute, or balanced) and if there is capacity for it to resume an additional compute VM, it will resume the first compute VM (000). If that balances the resource utilization it will then continue to monitor the environment. Should the minimum capacity be exceed, (because another VDI desktop started up) it will suspend the compute host that was resumed last (LIFO). It will continue balancing the resources in the environment till it reaches the end of its cycles, then it will suspend the compute VMs.

VDI by day compute by night (cycle harvesting) balancing

With that you have a running VDI by day and compute by night (or really whenever there are spare resources). Obviously there are many enhancements left to make to this example script, but it validates the basic concepts of cycle harvesting in a virtual environment.

Please leave comments below on what enhancements you would like to see added to this script. And if you would like to contribute, please join me on GitHub.

May your servers keep running and your data center always be chilled.

Update 3-17-19: One of my good friends reminded me one licensing option to consider for your ESXi hosts is the VMware vSphere Scale-Out licensing. You can read more about it here: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/solutions/vmware-vsphere-scale-out-datasheet.pdf Now depending on your ESXi hosts and the requirements for your VDI environment this licensing bundle may not be what you need. Again it’s best to talk with your VMware rep to figure out what’s best for your environment.

Also after completing this post I created a video demo of my script at work. You can see it below. There are 3 important areas on the screen. Starting at the left, there is the vCenter with 4 compute VMs and 2 holding VMs. In the center is the VDI by day compute by night script. Lastly on the right side is a little script I created to randomly power on and off the holding VMs to emulate a VDI environment with desktops being instantiated and destroyed. The 2 holding VMs don’t contain any OS as the important part of this demonstration is to reserve vGPU resources, after all VDI and VMs are fairly well understood. The Compute VMs are CentOS 7 Linux VMs using Boinc (https://boinc.berkeley.edu/trac/wiki/DesktopGrid) for a workload. The demonstration size is so small because I only have a single host with a single P4 GPU in my lab. You can watch the holding VMs power on and off and the Compute VMs respond to this, though not super fast as the suspend process takes some time. Enjoy.


Permanent link to this article: https://www.wondernerd.net/scripting-vdi-by-day-and-compute-by-night/