Storage at the Speed of Light

Bookmark and Share

Light moves at a speed of 299,792.458 kilometers a second (km/s) or 186,282.397 miles per second (MPS) [Gibbs 1997]. What if we could use the speed of light and the vasts distances of space as a storage medium? In this post I propose the basic premise to do just that, to have storage at the speed of light.

I think it’s best to start out with what this is and what it is not.

This isn’t going to make storage faster. In fact this is a linear sequential storage platform like tape storage, it has to be read in the order it is transmitted. There is no jumping ahead or back to pickup additional bits.

Storage only flows in one direction. Because we are dealing with massive distances storage can only flow in one direction at a time. This becomes important later on.

Now lets get into the details of this proposal. Einstein proposed that the speed of light is a constant in his famous and simple formula E=MC2. In this formula C is a constant for the speed of light. The fastest moving thing known to humans.

Latency

On earth we experience latency, for example when we make a phone call half way around the world. This is caused by how fast we can transmit data from point A to point B. In my proposal latency is the storage medium. (Really it’s space.)

The simple formula for storage using the speed of light is distance (d) equals transmission time (t) multiplied by the speed of light (C) divided by 2. This of course assumes that the receiver (r) is capably of receiving data at the same speed as the transmitter, processing times at each end are 0, and as soon as data comes in it is sent back out in the opposite direction. This also doesn’t account for the addition of new data into the stream.

Formula for using the speed of light and space (in a vacuum). Distance equals transmit time times the speed of light divided by 2.
Formula for data transmission and storage at the speed of light.

You might be asking why is this divided by 2. This is because you must have two satellites. The one nearest the earth that transmits and one d distance away from the first satellite that receives the signal and then sends it back to original satellite. If there wasn’t the second satellite it would just be broadcasted in to space never to be seen again. This is shown in the diagram below:

Example of the Latency Storage Method. Satellites a given distance (d) away from each other are able to transmit (t) a given amount of data. While traveling from one satellite to the other it is in a state of storage. The data can only be read at Satellites Otherwise it is in a stored state.
Example of the Latency Storage Method. Satellites a given distance (d) away from each other are able to transmit (t) a given amount of data. While traveling from one satellite to the other it is in a state of storage. The data can only be read at Satellites Otherwise it is in a stored state.

Example of Latency for Storage

Lets walk through an example of this. Let’s say I can transmit and receive at 3 Mb a Second and I need to transmit 180Mb of data (including protocols). How far apart do the two satellites need to be? First we need to see how long we will transmit for. That works out to 60 seconds (180/3). We know that light travels at 299,792.458 km/s. We multiply by 60 and find that the distance is ‭17,987,547.48‬ km or one light minute (the distance light travels in one minute in a vacuum). We then need to divide that by two since it will be 30 seconds out and 30 seconds back. So our satellites would at a minimum need to be ‭8,993,773.74‬ km apart.

In the above scenario it would take 1 minute to get the data back once it was put in the loop. Why? Because we are using latency (and space/time) as the storage medium. Space time becomes a tape cartridge and our satellites the read heads. We can only read the data when it comes around to a read head. In this case the one closest to Earth otherwise we have to wait for the data.

Adding Data to the Stream

Now what if we want to continue adding data to this stream, how is that done? We can accomplish this by moving the satellites further apart. So lets say we wanted to add an additional 180Mb of data to the stream in the above example. How far do we need to move the satellites apart? That’s fairly simple we just need to double their distance.

The how fast we double their distance is the hard part. Since our satellite can’t reach the speed of light the amount of data we can add to the stream is dependent on how fast the satellite is moving away from the other satellite. This can be figured out…

Let’s assume our satellite is moving at 692,017.92 km/h (430,000 MPH). Which will be the speed of the Parker Solar Probe later in its journey, “making it the fastest human-made object relative to the sun” (Parker 2018). That means, if our satellite were to travel at the same speed it would be going ‭11,533.632‬ km/m. Know that we can figure out how fast we can add data to our latency based storage.

To figure this out we will restate our original equation as the distance (11,533.632 km/m) times 2. We then divide by the speed of light (‭17,987,547.48‬ km/m). That will provide us with the time. Or about an additional 0.00128240 minutes (‭0.0769441 seconds) of transmission time per minute. This of course assume that the speed is constant.

Formula for transit time of data over a given distance at the speed of light.

Knowing that and that we can transmit at 3 Mbps, we could add about 0.230832331 Mb to the stream every second (‭0.0769441 * 3 = 0.230832331).

There are obviously faster long term storage methods than this. It’s still a very interesting way to use time and space to retain data for extended periods of time.

Limitations

There are obvious limitations to this that should be called out here.

  • Hardware speed is not taken into account. Once a satellite revives data it must then process it and transmit it. This takes time and would increase the amount of time and reduce the amount of data stored in the time/space medium.
  • Because of hardware speeds a loop could never be filled to maximum capacity and each half of the loop would need to maintain a given amount of “slack” space to accommodate the time it takes for hardware to re-transmit data.
  • The formula’s don’t account for jitter. If a satellite were to speed up or slow down it would change the capacity of the system.
  • Data would need to be encrypted. Because data would be transmitted in open space and would presumably be transmitted beyond just the two satellites, data encryption methods would need to be used.
  • This does not take into account a reasonable transmission time. IE if it takes two weeks for data to make a round trip is that too long.
  • This also doesn’t account for maximum transmission distances where the rest of the cosmic noise would drown out the data transmission. (Which would necessitate the need for a checksum system being added to the protocol.)
  • It might be possible to gain higher amounts of data using multiple transmit frequencies, effectively increasing storage without increasing distances.

For the longest time most have seen latency as an issue or delay to be dealt with. It looks like it can be a very powerful storage method given enough space and time. If you think of the research being done to slow the speed of light (see Physicists Slow Speed of Light) it could be very interesting in the years to come.

This whole storage notion is just a random idea I had while volunteering at a Kansas 4-H SpaceTech Experience at the Kansas Cosmosphere. We had the youth doing an activity that mimicked satellites flying past Pluto. The youth could only go so fast relaying messages to the space craft. It reminded me of packets being pulled off a tape by a tape head. It made me think, why not use this as a storage medium.

I hope you enjoyed the read. May your servers keep running and your data center always be chilled.

Works Cited:

Permanent link to this article: https://www.wondernerd.net/storage-at-the-speed-of-light/

GTC 2019 Wrap Up

Bookmark and Share

It’s the last day of GTC 2019, for those who know me they know it’s time for my yearly wrap up of the trade show. For the folks in charge of the show it’s feedback for next year’s GTC.

GTC19 Banner - "GTC is where we create what others think is science fiction" - Jensen Huang

For those not familiar with my wrap up report, the last day of the show I write down all my positive and negative impressions of the show. That said here are my thoughts on this year’s GTC.

This year was a great year to attend GTC. There was the Mellanox acquisition announced the week before GTC, the release of the NVIDIA Jetson Nano (for $99), a slew of other new releases, and of course lots of fresh and new knowledge. Many of these things you can read about elsewhere, so I won’t regurgitate those here.

Here are some things from the show that may have gone unnoticed by some:

GTC19 NGC meetup swag. Starting at the bottom left, cartoon sticker, I am AI multi sticker, I am AI laptop sticker, Think bit train fast learn deep sticker, back center, t-shirt with NVIDIA logo.

This year there were several aspects of the show I really enjoyed. The first one was the birds of a feather (BoF) and community meetups after the sessions concluded for the day. I was able to attend the NGC community meetup and the Slurm BoF. both were outstanding with lots of information to be gained and a wealth of community resources! I think the teams that put these on were surprised by the success of the events.

GTC19 Keynote line with sign in the foreground that says come join us

I wasn’t particularly fond of the wait to get into the general session… the line for K4 seating wrapped around the building, the gates didn’t open until 30 minutes prior to the keynote and the doors didn’t open till about 15 minutes before the keynote. We were out in the sun getting a tan while we waited. There were obviously more people in the K4 group than any other group there. I would hate to see what happened had it rained the day of the keynote.

Continuing that thought Once the doors were opened for the K4 groups to pour in, there was very little lighting to see at the top level to make our way to the seats. This is especially true of people coming in from the very bright outside into a very dark environment. (The whole rods and cones visualization thing.) I would have liked to see either ushers at the top of the steps all the way along the hall way of the keynote with flash lights or some other form of light that could be controlled while people are trying to quickly find their seats.

I thought the diversity of sessions was wonderful. Every area seemed to have a wide range of options to choose from. The vGPU content wasn’t just focused on VDI there were all sorts of sessions you could chose from either based on your interest or to be exposed to something new. I also think the did a good job with session descriptions this year so you could gauge it’s technical content.

I was talking with Roger (one of the event coordinators) at the end of the show like I did last year. One thing I talked with him about was how you indicate a session is targeted to more of an executive of business leader. I proposed to him adding a prefix or suffix to the session number, for example ex12345 or 98765ex. That way it becomes very simple for non-techy types to identify sessions geared toward the business case for these solutions and not the ones that are diving deep into the guts of the solutions.

Some of you may be thinking, but GTC is a techy show, not a business show. It is, but, the smart techies know that to get funding for their programs they have to get buy in from business people. If you can bring them to the show they can see the value and come to better understand the business value. This will create a synergy that can help accelerate programs in organizations.

This leads me to one of my observations at this years GTC. The number of straight laced suits to t-shirts and pony tails appears to be almost 50/50 now. That’s right you walk the hallway’s you see a lot of folks in suits, almost as many as those with t-shirts. I even proposed a fun little experiment to Roger. Mount a camera in the main hallway use some image recognition and see how many are in suits vs t-shirts to show the change in attendance.

That gets me to the show floor. This year there were a lot of cool things. The robotics section was very cool from the Boston Robotics team to the building inspection bot. The VR experience was also off the hook. I love it when the first question after putting on the VR headset was “how are you with heights?” from there I was standing atop the highest mast of a ship. Absolutely awesome!

I don’t know that the extended hours were the best thing for the vendors, as many of them were dragging by the last day (more than normal). I also spent some time at the Dell booth (I work for Dell) and during sessions the show floor seemed really slow. So again I wonder if there is value to having the show floor open during sessions.

I do like the added space for the show floor that was gained by moving the keynote off site. That made it easier to walk down the isles.and for more exciting exhibits. I also am cool that they got rid of the Wednesday night party. It didn’t seem anyone else really missed it and created a lot more opportunities for dinner with friends and business partners or community meetups.

GTC19 Backpack - with top of it open rigid wire bars built in around the zipper keeping it held open.

Another thing I didn’t care for this year is the design of the bags this year. The look of them is good, but the metal rods around the top of the bag are really annoying. It made it really difficult to fold it up and stick it in my other bag so I could drop it in my room. I get the point of the metal to keep the shape and make it easier to put things into it. I’m also not a fan of the zipper running long on the main compartment, to me that looks sort of sloppy and reminds me of tool bags I buy at a discount tool store.

GTC19 materials, Gray book and pen set, backpack, t-shirt, badge, drink ticket

I’m also sad that attendees did not get t-shirts this year. I love my t-shirts from GTC they are so cool. That means I had to go and purchase them at the NVIDIA store instead, which I did because they are so cool. It would be great to see a shirt again next year.

GTC19 Speaker Gift - A note book and pen

I thinks it’s really cool that as the number of speakers continues to grow that they are still nice enough to get speakers a gift. This year it was a nice notebook and pen set.

I am super excited about the NVIDIA Jetson Nano and it was great that the supply of them lasted through the whole show. Purchasing them was super simple, and the badge scanning to keep numbers in check was pretty slick too. I purchased 3 of them. Two at the end of the keynote, and one at the NVIDIA store.

The food this year was fair to midland, or fairly decent as far as conference food goes. I still don’t like having to go down those back steps to the South Hall. I think those steps are a liability waiting to happen, and this is my third year of saying that. I’m glad to the best of my knowledge no one has gotten hurt on those stairs.

One thing I continue to really appreciate that is unique, among the conferences I attend, is the use of drink tickets. It really encourages responsible drinking among attendees and NVIDIA should be commended for that.

Something that would be an awesome thing that is probably a pipe dream, is for people like me who extensively blog the sessions I attend, would be a row of tables up front in conference rooms so I don’t have to balance my device on my leg while listening to the session.

I thought it was great that the facility offered a prayer room and nursing room for those attending GTC. I think that really embraces the diversity of GTC attendees and removes barriers to attending the show. Kudos to both the event center and the show team for making these available and making sure they were publicized.

With that I think it’s about time I wrap up my wrap up post. I got to make a lot of new friends and renew my friendship with so many others. I can’t wait to see everyone next year at GTC20.

May your servers keep running and your data center always be chilled.

Permanent link to this article: https://www.wondernerd.net/gtc-2019-wrap-up/

NVIDIA Jetson Nano – a Quick Look

Bookmark and Share

NVIDIA Jetson Nano out of box display

Yesterday, March 18th, the NVIDIA Jetson Nano was announced at GTC. As you left the keynote (or when you got back to the San Jose Convention Center) you could purchase one for $99. I purchased a Nano and thought I’d give you a quick look at it.

Here is a link to the press release announcing the Jetson Nano.

Below are is a gallery of shots I took of the Jetson Nano and all the stuff that comes in the box which isn’t much.

The unit does not come with a power supply, though it uses a Micro USB connector like most other small compute devices such as Raspberry Pi’s. It also does not have a wireless network built in, and it needs a micro SD card for the OS.

To get started you can visit https://NVIDIA.com/JetsonNano-Start (note this is case sensitive, and if you want to see a cute 404 error get the capitalization wrong).

The Jetson Nano page walks you through setting it up in 6 steps. Which all look very straight forward. The most complicated of which is flashing the SD card.

Some of the things that jumped out at me as I look at the Jetson Nano are that it has both a full sized DisplayPort and an HDMI video output. I’m curious about that value of one over the other.

Another area of interest is that the heat sink has holes drilled for a cooling fan and the base board has a connector that support the addition of a fan. Speaking of the base board, the Jetson Nano is attached to a base board allowing for the NVIDIA chip to be removed (with a screwdriver). The base chip seems to use the DDR4 laptop format for the slot between the base board and the chip, or at least that’s what the connector shows. The micro SD card slot is on the chip its self and not on the base board of the Jetson Nano.

It has a 40 pin expansion header (GPIO, I2C, UART) for different sensors as well as a MIPI-CSI camera connector. There is also a jumper that disables the the USB power input and the allows only the use of the 5V DC power input.

I’d power it up here in my hotel room but I left all my display cables back in Kansas and I dont have a keyboard with me.

Thought I’d share with folks. I haven’t decided what I’m going to do yet with the Jetson Nano but what ever it is it will be fun.

May your servers keep running and your data center always be chilled.

Permanent link to this article: https://www.wondernerd.net/nvidia-jetson-nano-a-quick-look/

Scripting VDI By Day and Compute By Night

Bookmark and Share

You’ve probably read some blog posts or even case studies on VDI by day and compute by night (cycle harvesting), specifically with GPUs, but you really want to see a real implementation of it or try it for yourself. Well, you’re in luck, you’ve found the right blog!

Over the weekend I finished writing a PowerCLI script to do VDI by day and compute by night with vGPU based VMs. In in this post I’m going to share the script and how to set it up with you.

For those not familiar with the term VDI by day and compute by night here is my example of it. Let’s say you have a bunch of VDI users who are structural engineers. They typically work from 8 in the morning to 5 or so at night on their VDI desktops. They have programs like like AutoCAD which require high end GPUs, but since they are using VDI they’re virtual GPUs (vGPUs). That means you have these high end GPUs that might be sitting idle for 12 to 16 hours a day and the organization spent a lot of money for those GPUs and the rest of the compute resources.

Now what if you could use those GPUs for something other than just engineers virtual desktops? You know, get a little more millage out of them. The R&D department keeps wanting more servers for their high performance computing (HPC) farm… What could 12 extra hours a day of GPU time coming from the Engineers idle VDI hosts do to help them?

Cycle of VDI by Day and Compute By Night (Cycle Harvesting)

That is my VDI by day compute by night or cycle harvesting example. It is re-purposing compute resources from one task (VDI) to supplement another task (HPC, machine learning, deep learning, etc.) when the first task isn’t using them. Further more it’s releasing resources from the second task when they are needed by the primary task.

That’s where the script I wrote comes into play. Trying to do all of the above tasks manually is a pain… Let’s automate it. You are probably thinking, that’s so complex to figure out. And besides if I shutdown my compute VMs after they are half way through a process I’ve lost tons of information and time.

This script addresses these issues and, the release, as of this writing, does it in under 200 lines of code. That said let’s dig into the script.

The script is a PowerCLI script that calls the carrying capacity function I wrote about a few weeks ago. The essence is it performs a loop looking for excess resources of a given vGPU profile (that means the vGPUs used for VDI and Compute can be different), if it finds spare resources the script resumes suspended compute VMs. It continues doing this until the vGPU capacity defined in the script is reached. Then, as VDI desktops start to come back online it starts working in reverse suspending compute VMs from the newest to the oldest (last in, first out (LIFO)). The program loops through balancing of resources for a number of iterations over a given period of time. The net result is an automated instance of VDI by day and compute by night (or whenever there are sufficient free resources, hence why this is commonly called cycle harvesting). A partial flow chart of the logic is shown in the diagram below (there is a lot more happening in the script than whats shown in the diagram).

Cycle Harvesting logic diagram

The order of operations are important in the script. The script first looks at if it needs to suspend compute VMs first, as the primary job of the environment is ‘VDI.’ This insures that resources are allocated to the priority before even considering if it should resume any compute VMs. Then, if there is capacity, compute VMs are started or resumed. And if neither option is possible the environment has reached a “steady state” where resources are being consumed in an optimal manner which prioritizes the ‘VDI’ workload over the compute workload.

You can download the script from my GitHub repository: https://github.com/wondernerd/VDIbyDayComputeOtherwise

This is version 0.1 of the script, so there are a lot of refinements still to come. The script shows that the concept works and delivers the expected results. Future iterations will be more robust and will build on this basic model.

I built and tested this function on VMware PowerCLI 11.0.0 build 10380590 and on PowerShell 5.1.14409.1005. It should be backwards compatible several generations back to the point that the vGPU device backing was added in PowerCLI. Though I’m not sure when that was.

The beginning of the script contains all the variables needed to control it. You can see these in the code snippet below.

#Import vGPU capacity function
. 'C:\Users\Administrator\Desktop\vGPU System Capacity v1_3.ps1'


#define the paramaters

# VDI Side
$SpareVMcapacity = 1			#How many spare VMs should be able to be powered on

# Compute Side
$ComputeVMbaseName = "Compute"  #The base name of compute VMs, a three digit number will be added at the end
$ComputeCountFormat = "000"		#The preceding zeros in the compute name, so the 6th VM would be Compute006
$MaxComputeVMs = 4				#Total Number of Compute VMs in use
$ComputevGPUtype = "grid_p4-2q"	#Which vGPU is in the Compute VM (later I will detect this)

# Opperations Side Varables
$WorkingCluster = "Horizon"		#Name of the cluster that should be 
$SecondsBetweenScans = 30		#How long will the program wait between scans
$NumberOfScansToPreform = 10		#How many times should the scan be run 



###############################
###############################
# Operational variables do not touch

$vGPUslotsOpen = 0 				#How many vGPU VMs are currently on
$POComputeVMcount = 0			#Number of powered on (PO) compute VMs
$ScanCount = 0					#How Many times the scan has been run
$CurrVMName = ""				#Current VM Name
$ComputeSteadyState = 0


Now lets walk through how to configure the script. To start with we will need the vGPU System Capacity function. You can download that from my GitHub vGPU Capacity repository. Then you want to include the path to that file. (Being sure to mind the dot operator at the beginning of the line.)

Next we define how many spare VMs should be able to be powered on. I currently use one you will probably want to set this as a derivative value of your fail over capacity. Lets say you have N+1 fail over capacity in your environment that means you can have a single host fail and still be functional. To calculate that you would want to multiply the number of vGPUs a profile supports per card times the number of GPUs in a host times the number of spare hosts you have. You may also want to add in some extra capacity as well since it can take a few moments to suspend a compute VM.
Example: The P4-2Q profile supports 4 vGPUs per card and I have one card per system with a fail over capacity of N+1, so I would set this value to 4. (4 * 1 * 1 = 4) Thus allowing all VMs from one host to fail to the other host.

The next set of values are for the compute VMs. The first variable is for the base name of the virtual machines used for compute. In the script I call them “Compute.” Which corresponds to the compute VMs base name in my vCenter. This is followed by the trailing digit format for the compute VMs. I’ve defined this as “000.” These two variables are then concatenated in the script to define the compute VMs.
Example: Compute + number (formatted in 000 form) = Compute001. This is the name of the second compute VM used.

After we have defined the format of the compute names, we set how many compute VMs there are. I used 4 in this example. It should be noted that the script uses base zero counting so the first VM would be Compute000 and the last would be Compute003. Lastly we define which vGPU the compute VMs will be using. In future iterations of the script I will just pull the vGPU information from the VMs.

Following that section we have the operations variables we need to define. The first one is the name of the cluster we are working in, in my case I call my cluster “Horizon.” Next we define how often we should be scanning for changes in the environment. I went with 30 seconds but your environment may be different and may need sped up or slowed down. Lastly how many times should the scan run. I didn’t build an infinite loop into the system just for stability purposes and because this is an example script, you can modify the code if you want it to run indefinitely.


With those items set the script is ready. Now to prep the environment. The first thing we want to do is setup our compute VMs. That means building out whatever compute nodes you need for your environment connecting them into it’s management and making sure workloads are running correctly on them. You also want to make sure they follow the naming convention we used in the script settings along with the GPU type.

You might be going, but I don’t have the resources to have all these compute VMs powered on at once. That’s fine you can do it in batches. In my case I would power on one Compute VM and configure it. Then I would suspend the VM. By suspending it I’m maintaining its state but releasing its resources. I rinse, lather, and repeat this for all my remaining compute VMs and should wind up with all of the compute VMs configured and in a suspended state.

The thing about suspended VMs is their state is maintained but their resources are released. This means we can setup everything for use and then suspend the VM. Anything in process in them is maintained when suspended. Allowing any operations to complete once resumed.

Now comes a VERY important part. Licensing!
Take a look at your VMware Horizon licensing agreement. It probably says something like you can only use Horizon for virtual desktop instances…
That means your compute VMs need to be desktops or you need to license your hosts as standard ESXi assets. I can’t tell you which is best to do. You should work with the VMware Licensing team to determine the best course of action. It’s up to you to make sure you fully comply with the terms of your license agreement.


At this point everything should be in place and ready to go. If you so choose you can connect to your vSphere environment and run the example script you configured. When you run it, you will see it determine the state of the environment (suspend compute, resume compute, or balanced) and if there is capacity for it to resume an additional compute VM, it will resume the first compute VM (000). If that balances the resource utilization it will then continue to monitor the environment. Should the minimum capacity be exceed, (because another VDI desktop started up) it will suspend the compute host that was resumed last (LIFO). It will continue balancing the resources in the environment till it reaches the end of its cycles, then it will suspend the compute VMs.

VDI by day compute by night (cycle harvesting) balancing

With that you have a running VDI by day and compute by night (or really whenever there are spare resources). Obviously there are many enhancements left to make to this example script, but it validates the basic concepts of cycle harvesting in a virtual environment.

Please leave comments below on what enhancements you would like to see added to this script. And if you would like to contribute, please join me on GitHub.

May your servers keep running and your data center always be chilled.

Update 3-17-19: One of my good friends reminded me one licensing option to consider for your ESXi hosts is the VMware vSphere Scale-Out licensing. You can read more about it here: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/solutions/vmware-vsphere-scale-out-datasheet.pdf Now depending on your ESXi hosts and the requirements for your VDI environment this licensing bundle may not be what you need. Again it’s best to talk with your VMware rep to figure out what’s best for your environment.

Also after completing this post I created a video demo of my script at work. You can see it below. There are 3 important areas on the screen. Starting at the left, there is the vCenter with 4 compute VMs and 2 holding VMs. In the center is the VDI by day compute by night script. Lastly on the right side is a little script I created to randomly power on and off the holding VMs to emulate a VDI environment with desktops being instantiated and destroyed. The 2 holding VMs don’t contain any OS as the important part of this demonstration is to reserve vGPU resources, after all VDI and VMs are fairly well understood. The Compute VMs are CentOS 7 Linux VMs using Boinc (https://boinc.berkeley.edu/trac/wiki/DesktopGrid) for a workload. The demonstration size is so small because I only have a single host with a single P4 GPU in my lab. You can watch the holding VMs power on and off and the Compute VMs respond to this, though not super fast as the suspend process takes some time. Enjoy.


Permanent link to this article: https://www.wondernerd.net/scripting-vdi-by-day-and-compute-by-night/

Number of vGPUs Available in vSphere

Bookmark and Share

I’ve been working on a PowerCLI function for the last few months in my free time and now I’d like to share it with everyone. This is a pretty spiffy function for those folks working with vGPUs, and it’s not just for VDI it can help those looking to virtualize ML/DL systems too. (Think VDI by day compute by night.) The function calculates the vGPU carrying capacity of a vSphere environment.

What that means is, if I have an environment with with 30 hosts. Each of which have a bunch of NVIDIA GPUs in them. And on those hosts are a bunch of VMs with various vGPU profiles attached. This function will determine how many more VMs can be powered on with a given vGPU profile.

Yup you read that right, how many more VMs using a given vGPU profile can be powered on.

PowerCLI Code Snippit for Carrying Capacity Function

Now lets make it a bit more interesting. What if you could specify the cluster, or any VI container, in your vSphere environment for that calculation? Well the function I’m sharing does that too.

Let’s take it a step further, what if I only want to calculate my vGPU carrying capacity for powered on hosts, or hosts in maintenance mode, or disconnected hosts? The function will do that to…

How about mixed GPUs? For example having M60s, P40s, and P4s all in the same environment. The function deals with it all.

I’ve put the script with function up on github here: https://github.com/wondernerd/vGPUCapacity

Feel free to do what ever you want with the script I’ve published it under the GNU version 3 license so its free to do whatever you want to with it.

I’m not going to spend a much time talking about the concepts and constructs of it, but I will spend some time talking about how to use the function.

The bulk of the script file is a function that does all the heavy lifting. The function, vGPUSystemCapacity, takes three arguments. One is required, and the other two are optional. The function returns the number of VMs that can be started with the given profile and if an error were to occur it would return a -1 value.

 vPGUSystemCapacity vGPUType as String [vGPULocations as string] [vGPUHostState as string {connected,disconnected,notresponding,maintenance}] 
returns int [-1 on error] 

The required argument is a string corresponding to the vGPU profile in the format of “grid_p40-2q” The format is “grid_” followed by the physical GPU type “p40” followed by a dash followed by the vGPU profile “2q.” The vGPU profiles can be found in the NVIDIA vGPU User Guide. This is shown in the following example of a function call requesting the results of a “grid_p40-2q” vGPU profile:

vGPUSystemCapacity "grid_p40-2q" 
200

Invalid vGPU profiles do not cause errors, so if you were to pass the function a value of “ColdPizza” for a vGPU card type the function will return a 0 value as the system can not support any “ColdPizza” type vGPUs.

vGPUSystemCapacity "ColdPizza" 
0

When the function is called with two arguments, the second argument is a string that corosponds the the VIcontainer[] object (ie cluster) you want to calculate the carrying capacity of. For example if I have a cluster named “production” I would pass that to the function as it’s second argument when using the function. You can also pass a wild card character to capture all valid VIcontainers[]. When no argument is passed for the second argument an “*” is the default value. This is to include everything in the vSphere environment. The example below builds on the previous example capturing only vGPUs in the cluster “production.” You can read more about the VIcontainer type on PowerCLI cmdlet reference.

vGPUSystemCapacity "grid_p40-4q" "Production" 
100

The third variation of the function takes into account the host state when calculating the carrying capacity. The third value is a VMHostState[] value that is passed to the function as a string. The valid values for host state are “connected”, “disconnected”, “notresponding”, and “maintenance”. You can read about these in the PowerCLI cmdlet reference document as well. The cool thing about these states is you can string them together as a comma delimited list to capture multiple state types all at once. When no string is passed the function defaults to “connected,disconnected,notresponding,maintenance” and will gather all host states. continuing on from our previous example, if we wanted to see the vGPU carrying capacity for connected hosts and hosts in maintenance mode we would use a function like this.

vGPUSystemCapacity "grid_p40-4q" "Production" "connected,maintenance" 
80

I built and tested this function on VMware PowerCLI 11.0.0 build 10380590 and on PowerShell 5.1.14409.1005. It should be backwards compatible several generations back to the point that the vGPU device backing was added in PowerCLI. Though I’m not sure when that was.

That gets you though working with the vGPUSystemCapacity function I created. If you’ve made it this far you may already have some ideas about what you can do with this. Here are somethings I’d like to do with it.

  • Use it to monitor how many more VMs of a given type I can instantiate on a system
  • Capture usage patterns through out the day, letting me know when I am at peek vGPU utilization in my environment
  • Use this as a core function to enable VDI by day and compute by night.

Let’s touch on this last bullet a bit. VDI by day and compute by night is a term lots of folks are throwing around these days. In fact I’ve done a blog on it myself. The premise is very simple, GPUs are expensive why let them sit idle in the data center when no one is using them? Capture back that time by letting them crunch on some business problems, traditionally at night. To do that we need to know one important thing. At any given time how many vGPUs of a given type are available to perform compute tasks?

Now if only there were some sort of function that could tell me that number… then maybe it would be possible to create a PowerCLI script that could manage all of that for me… Hmmm… I wonder what I’m working on next???

That gets us to the end of this post. Hopefully this script is helpful. If you have improvements post them below or on github. If you run into questions about using the script drop me a note in the comments below. And if you do something cool with it please be sure to share it with the community.

May your servers keep running and your data center always be chilled.

Permanent link to this article: https://www.wondernerd.net/number-of-vgpus-available-in-vsphere/

2019 NVIDIA vGPU Community Advisors

Bookmark and Share

Today the 2019 NVIDIA vGPU Community Advisors (NGCA) was announced. I am pleased to share that I have been selected to join this wonderful group for another year. You can view the page and other NGCA members on the NVIDIA vGPU Community Advisors page.

NVIDIA vGPU Community Advisors #NGCA

For those who don’t know what the NGCA is let me share. In late 2015 early 2016 Rachel Berry (@rhbBSE) and a few other NVIDIA’s started a group called the NGCA. The ‘G’ in NGCA stood for GRID, which is what vGPU used to be when it was a physical GPU card. Remember the K1 and K2 cards?

The idea was to create a direct line of communication between advocates for virtualizing GPUs with and the NVIDIA virtualization team. Not the gaming people or the HPC people, the virtualization people. It’s been really interesting over the last three years to watch how this has sort of morphed into something a bit bigger.

In 2016, providing graphics capabilities to VMs was the big thing. Unlock the closet cases of desktops that couldn’t be virtualized. Now a days, it’s not just about VDI, vGPUs are capable of so much more, and you see a much more diverse group in the NGCA. There are those of us who are looking at how to virtualize systems like HPC and Deep Learning Systems to gain great advantages in the data center.

I am excited to return as a member of this prestigious group and am looking forward to a stellar 2019.

Permanent link to this article: https://www.wondernerd.net/2019-nvidia-vgpu-community-advisors/

OpenVPN on IOS 12.x fails to Connect to PiVPN

Bookmark and Share

I use the PiVPN deployment of OpenVPN on a Raspberry Pi to connect back into my home network when I’m on the road. About a month ago my SD card gave out and I had to rebuild. I decided to go with PiVPN for my OpenVPN deployment. Setup was really easy and it worked great on my Android phone.

This last weekend I was at the in-laws and wanted to VPN in from my iPad to do some stuff at home. That’s where I ran into an issue, and if you are reading this I suspect you might have run into the same issue.

What I found was that I could connect to my VPN from my iPad with IOS 12.x but was unable to get any network traffic to or from my iPad when on the VPN. I couldn’t even ssh into my Pi. I couldn’t figure out what was wrong with it. Everything was working perfectly on my android phone.

I used my googlefu and tried to find answers to what could be going on and tried all sorts of different things. I finally figured out what was up and figured I’d share so I can fix this again later or help someone else!

The problem was that compression is turned on with PiVPN. And if you look at your settings in your OpenVPN client you will see that that is insecure.  (See picture below.) A further explanation of this can be found on the OpenVPN website under the security advisories if you are interested.

OpenVPN Client ScreenshotUnderstandably we want to turn off the allow compression option. So we tap “NO” on the screen.

Well if you’ve tried that you probably guessed its not that easy to fix. There is more to it than that.

First we have to modify the server config on the Raspberry Pi. To do this SSH into your Pi or open up a console window.

Now we are going to issue the following commands

su root
#You will be prompted for your root password at this point
cd /etc/openvpn
su vi server.conf
#You will be prompted for your root password at this point

You are now in the VI editor (you could also use your favorite editor).

Inside the config file you want to find the line that says:

compress lz4

You want to change that to be at the beginning of that line and press the ‘i’ button (for insert) then enter a ‘#‘ and a space (without quotes). It should look like this:

# compress lz4

Now press the esc key then type ‘:wq!‘ (without the quotes). This writes (w) and quits (q) forcefully (!).

What we just did was turn of the use of compression for your OpenVPN server. (Not sure why they haven’t shut it off yet in PiVPN.)

With that done you should be back at the prompt. We now need to reboot the Pi (the easiest way for many people, so the system can read the changes we made to the config file). To do this we will type the following:

exit
sudo shutdown -r now
#You will be prompted for your root password at this point your pi will reboot

While that’s happening we need to make a few more changes.

First we need to modify your profiles. (You will need to do this for any profile you’ve created.)

Open up the file in something like notepad++ and we want to find the line that says, “compress lz4” we then want to remove it or comment it out using the pound symbol (#).  So the top section of the file (before the <ca>) should look something like this:

client
dev tun
proto udp
remote [address] [port]
resolv-retry infinite
nobind
persist-key
persist-tun
remote-cert-tls server
tls-version-min 1.2
verify-x509-name server_[***********] name
cipher AES-256-CBC
auth SHA256
# compress lz4
verb 3
<ca>

Save the file and transfer it to your IOS device. You will need to repeat this for any other devices you created profiles for. (Because we just to the server to stop doing compression.)

Now you can import the file into the OpenVPN app on your IOS device. Then go into the OpenVPN settings screen and make sure allow compression is set to NO.

Now go test it and see if it works!

Now to give credit where credit is due. I couldn’t have figured out how to do this without two posts I found.

First this thread on the OpenVPN forums pointed me in the right direction: https://forums.openvpn.net/viewtopic.php?f=36&t=27195&sid=728ec0b98d2563dc3cecf5b35188843d

And led me to this bug post on the OpenVPN community page: https://community.openvpn.net/openvpn/ticket/1126

I hope this quick post has helped you get your IOS device connecting to your OpenVPN instance.

Till next time, may the lights of your data center stay off and your server fans keep humming.

Permanent link to this article: https://www.wondernerd.net/openvpn-on-ios-12-x-fails-to-connect-to-pivpn/

vGPU PowerCLI Commands

Bookmark and Share

I’m a noob when it comes to PowerCLI, I can Get-VM like most everyone else, but not much more.  You may have seen my previous post about Taking Back Resources where I talked about the theory and logic behind vGPU cycle harvesting. Right now I’m working on putting it into action and the best way I can think of is a PowerCLI script.

If you try googling much on GPU options for PowerCLI you’re not likely to find much. Either that or I am using the wrong terms. So before I drop some cool cycle harvesting options out there for folks I want to share two PowerCLI commands to look at GPU functions in your environment.

The first command I’d like to share is seeing what vGPU profile is being used by a VM.  This is a two part command. First you assign a VM to a variable, then you can explore the VMs backing. I figured it out from reading through code on the rgel/PowerCLi GitHub (https://github.com/rgel/PowerCLi/blob/master/Vi-Module/Vi-Module.psm1) My thanks to Roman and Hans for sharing this wonderful code.

PS C:\> $MyVMs = Get-VM "VMname"
PS C:\> $MyVMs.ExtensionData.Config.Hardware.Device.Backing.vgpu
grid_p4-4q

Yup, thats all there is to finding out what vGPU profile a VM is using.

Now for the next bit of goodness. Finding out what graphics cards are in a given host.

PS C:\> Get-VMHost | Get-VMHostPciDevice -DeviceClass DisplayController

VMHost.Name          Name                                                                   DeviceClass
-----------          ----                                                                   -----------
esxi02.wondernerd... NVIDIA Corporation NVIDIATesla P4                                      DisplayController
esxi02.wondernerd... Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) DisplayController
esxi01.wondernerd... Matrox Electronics Systems Ltd. PowerEdge R610 MGA G200eW WPCM450      DisplayController

Taking it a step further lets isolate this to the P4 in my environment.  To do this we’ll modify the above just a bit…

PS C:\> Get-VMHost | Get-VMHostPciDevice -deviceClass DisplayController -Name "NVIDIA*"

VMHost.Name          Name                              DeviceClass
-----------          ----                              -----------
esxi02.wondernerd... NVIDIA Corporation NVIDIATesla P4 DisplayController

I found this by poking around the PowerCLI Cmdlet Reference and playing with the output.  (https://code.vmware.com/docs/7336/cmdlet-reference#/doc/Get-VMHostPciDevice.html)

PowerCLI Screen on vGPUs

Hope this helps as you develop your PowerCLI scripts dealing with vGPUs.

Permanent link to this article: https://www.wondernerd.net/vgpu-powercli-commands/

Permanent link to this article: https://www.wondernerd.net/taking-back-resources/

VMworld Session VAP2340BU – Driving Organizational Value by Virtualizing AI/ML/DL and HPC Workloads

Bookmark and Share

Thank you to everyone who was able to attend our session at VMworld! And an even bigger thanks to those visiting who weren’t able to attend in person. Gina and I want to make sure you have the resources you need to understand how you can gain a competitive advantage by virtualizing your AI/ML/DL/ and HPC workloads. We wanted to share with you the resources for the session, these are items we found helpful in preparing our content or may help to further explain ideas presented in our session.

Recording of the session:

Session slides (sorry only available in PDF):

Cover slide for session VAP2340BULearn more about:

Permanent link to this article: https://www.wondernerd.net/vap2340bu/