During VMworld San Francisco I had the opportunity as a VMware vExpert to attend a special breakfast that SIOS put on and which my friend David Klee was speaking at. This was a pretty cool session about how SIOS was using data analytics to look at the health of the data center.
You’re probably thinking, VMware already has this, it’s called vRealize Operations… You are correct they do, and what if there was something more powerful than Operations? That would be pretty cool, and that’s what SIOS does with their SIOS IQ product. The product uses deep learning to determine normal for a virtual data center.
On the surface this is a pretty darn cool idea. Your DC takes on many characteristics of an organism, and instead of someone having to intemperate whats wrong. The data center can actually tell you what the problem is and how far the problem spreads. That was my take from the session anyway.
I think that’s pretty darn cool, and that’s enough of the commercial. Lets get to the meat of this post, and if you hadn’t guessed its time for another one of my crazy ideas…
SIOS technology and deep learning is really cool and powerful for a DC admin. Let’s scale this up a notch. When I talked with Sergey A. Razin following the session, I asked him about increasing the power of the solution with GPUs. This creates a very powerful shift in data center management instead of just telling you there is a problem the data center can become predictive and maybe even self healing.
How can this be done? instead of single threaded deep learning where we have to process data in a somewhat sequential set of operations lets use GPUs so we can look at data on a much more massive scale. (See: Getting Started with Deep Learning)
We can take the data from the data center (ultimately more than a human could process) and use a deep learning model to process it. This would allow for a more comprehensive analysis of the data center and what could be done to optimize resources as well as prevent DC failures. For example one of the metrics that could be looked at and evaluated is CPU core speed. A fairly innocuous item to monitor in and of its self. Now what if the management system started noticing that processors in a given host started to get slower, possibly a bad processor. Now lets say that there were minor fluctuations in the processor speed of several CPUs, but they were not significant enough on their own to indicate a problem. A GPU enabled monitoring system, able to look at these variances could come to the conclusion, that there is a cooling/airflow problem and it originates at the server with the server that’s clocking down and spreads out like a bulls-eye. We can then stat to pinpoint problems in the DC and target in on them before they become major catastrophes.
Imagine having deep learning like this where the DC operations system has visibility into both your application workloads and your user workloads (servers and VDI). The load on a VDI system starts to spike, then RAM utilization flares on a SharePoint server. All the sudden the network motoring (NSX) see’s an increase in outbound traffic to a foreign IP and other systems in the environment are starting to peak. If the deep learning operations system has been built to understand things like this. You would be able to shutdown the networks and control the desktop. This would also provide a forensic starting point to look at what has been impacted in the DC. Imagine what something like this could have done for Sony…
All that said. It’s cool to think about how the data center could be changed by the work SIOS is doing. And even further changed into a self healing data center by expanding that with GPU capabilities to process more data quicker and learn more about the DC. This of course would make it so that IT could spend less time monitoring the data center and more time innovating in the data center. Which means for all the business majors out there, increased productivity, better business agility, and faster time to market. All very compelling reasons for the creation of such technology.
I wish I had the time to build on this and create an architecture for something like this that I could share. Unfortunately I don’t have the time or a $20,000 GPU to build a framework to share with everyone. That’s why I shared it with my friends at SIOS and why I’m sharing it with you now. It’s one of my hair brained ideas that someone may be inspired by and create something cool.
I hope you found this post interesting.
Tony