Farming NVIDIA Jetson Based Thin Clients

In March 2019 NVIDIA released the Jetson Nano, a $99 GPU embedded micro-computer. I have several and have even distributed them to some brilliant youth to see what they come up with. I’ve noticed a couple things about the Nano’s that could have very interesting results on how we process data in organizations and smart cities. Let’s take a look.

The first thing I’ve noticed, Stratodesk has blogged about using Nano’s as thin client systems for VDI with good graphics capabilities (the Nano’s GPU contributing to that). This improves the user experience while keeping cost for desktop endpoints relatively low. Imagine hundreds of Nano thin clients through out the cubes of an organization… End users are getting a great graphics experience at a pretty economical price. Cube’s everywhere are glowing green with Jetson Nano’s as users monitors power up…

In most organizations GPU powered thin clients (or any thin clients for that matter) are used for about 8 hours a day (when users are in the office) and the rest of the time they sit idle, not delivering value to the organization. They may be on or off, just sitting there idling waiting till the next work day roles around to add value to the organization.

Pie chart showing that a thin client is only used 33% of the time and the rest of the time sits idle.

The second thing I’ve noticed, at GTC Silicon Valley 2018 Liqid presented session s8539 on pooling and orchestrating NVIDIA Jetson for AI and deep learning on the edge. Essentially being able to take NVIDIA Jetson’s and compose them on a high speed network in a single chassis as single unified system. You could have up to 24 Jetson’s composed as a single platform, running in your data center.

Interestingly enough, after doing some digging, this also includes the Jetson Nano as one of the Jetson types that could be composed. That creates a pretty spiffy AI/DL platform for the data center or remote location. Pop it in a rack, use the data center cooling and power, to start cranking on those ever vexing business questions (like figuring out who’s the masked singer).

Independently these are very cool options both of which can take advantage of the Jetson Nano. At this point queue the WonderNerd, and his hair brained ideas. Why not bring them together and do a little cube farming IT style? I bounced the idea of composing thin clients into a processing system off of someone I know at Liqid to see if it had wings and well lets start putting the idea together and see what you think.

24 cubes, most idle, which could be brought together as a composeable resource.

The basic idea is (stop me if you’ve heard this VDI by day concept) that we have a bunch of endpoints with NVIDIA GPUs (like Jetson Nano’s) spread through the cubes of an organization. During the day they provide a VDI endpoints to users. When the users go home for the day and the endpoints are sitting idle, we recompose them and turn them into a grid based computing platform (don’t confuse this with NVIDIA GRID, that’s virtualizing GPUs, we are aggregating them). It’s a concept similar to VDI by day and compute by night, only this is end point by day and compute by night.

This isn’t necessarily a new idea. People have been doing this for quite a while. It’s called grid, or distributed, computing. I’ve been doing grid computing for quite a while, volunteering my systems in a program called World Community Grid (WCG) which has been helping researchers solve various world problems like Zika and childhood cancer. If you have some spare compute cycles to share, I recommend participating in the WCG.

The idea behind grid, or distributed computing is that you have a bunch of systems distributed throughout an area. Each system reaches out to a master node to get a bit of work to do, it processes the job and returns the result to the master node. The same bit of work would, in most circumstances, be distributed to three or more systems participating in the grid. The answer in the majority is accepted as the correct result. (Some of the foundations of block chain stem from distributed computing.)

This probably seems straight forward enough, take a bunch of Nano’s and compose them as a GRID platform. This is all well and good if that’s all these were doing, but their primary purpose is as a thin client end point so users can get work done.

You may be asking, how would thin clients change state from an endpoint to part of a composed infrastructure system? Thinking about this, it could be done by leveraging some intelligent programming to detect when thin clients were no longer in use and reallocate them to a distributed composable infrastructure. (Much like what I did with my VDI by day compute by night scripts.) With a bit more intelligence they could have some code that pays attention to usage habits and removes a thin client from a processing system 30 minutes prior to the expected arrival of the person who uses it as a thin client.

You’re probably thinking that I’m forgetting one little thing, networking, no one’s going to run 10GB lines to end users. That’s insane and kills the benefits ($$$) of something like this, besides Nano’s dont have 10Gbps links. 1GB links should work in many situations. What may need to happen is a way to add an M.2 card on the thin client to be used as a store and forward buffer (no the Nano’s don’t currently have one). The micro SD storage slot on the Nano is a consideration, but I’m not sure how durable it would be functioning as a cache. With a buffer it would be possible to fully saturate the link in both directions as material is created and transferred through out the grid. If an M.2 were available, hopefully it would sit on the same bus as the Nano’s GPU thus it should be fast enough to supply the GPU, and with the right caching algorithm minimize the impact on the network and individual nodes.

High speed store and forward storage concept diagram for independent nodes to enable caching of incoming and outgoing data.

With some M.2 instances being composeable as well, it might be possible to create both a local and unified storage space on each endpoint. This would allow processing jobs to function as both dis-aggregated and aggregated processes. This might be an area where VMware vSAN would work, though network speed between Nano’s may be an issue. The data would need to be placed locally first and then propagated to other nodes thus creating unified storage among all the nodes. This might be one method to aggregate and share processed results among the nodes.

Even with local storage, this would still require the up-link’s off the network switches and the switches bus to be pretty beefy. The minimum bandwidth up to support something like this would be two – 10 Gbps ports LAG’ed together and 40 Gbps would probably be preferable.

In most cases it would also be best to limit the size of a composed system to a single switch. In other words if the switch has 24 ports, the maximum system size is 24 nodes. This helps avoid extra hops and network lag plus we don’t want to congest the north bound network links with east west traffic from nodes not on the same switch. Again this would be something that could be addressed with some intelligent programming.

With the networking out of the way, what could a Jetson Nano cluster deliver for an organization. Here are the Nano specs…

GPU	128-core Maxwell
CPU	Quad-core ARM A57 @ 1.43 GHz
Memory	4 GB 64-bit LPDDR4 25.6 GB/s
Storage	microSD (not included)
Video Encode	4K @ 30 \| 4x 1080p @ 30 \| 9x 720p @ 30 (H.264/H.265)
Video Decode	4K @ 60 \| 2x 4K @ 30 \| 8x 1080p @ 30 \| 18x 720p @ 30 (H.264/H.265)
Camera	1x MIPI CSI-2 DPHY lanes
Connectivity	Gigabit Ethernet, M.2 Key E
Display	HDMI 2.0 and eDP 1.4
USB	4x USB 3.0, USB 2.0 Micro-B
Others	GPIO, I²C, I²S, SPI, UART

Let’s extrapolate that out to a single 24 node composed system and see what the total power of it would be…

Component	Single Jetson Nano Capacity	Combined Capacity of 24 Jetson Nano’s
GPU	128-core Maxwell	3072 Maxwell cores
CPU	Quad-core ARM A57 @ 1.43 GHz	96 ARM A57 cores @ 1.43 GHz
Memory	4 GB 64-bit LPDDR4 25.6 GB/s	96 GB LPDDR4 RAM
Connectivity	Gigabit Ethernet, M.2 Key E	24 Gbps of combined network connectivity

At the GPU level, this infrastructure would be equivalent to one NVIDIA M6000 GPU which has 3072 Maxwell cores. At the time of this writing Amazon is selling the M6000 for $1999.00 USD. Using a base cost model a single GPU would win, $1999 (1 x $1999) vs $2400 (24 x $100) for Jetson Nanos. Plus all the added complexity (switch, storage, etc.) and coding (operational state, network segment, etc.) required for a Nano farm.

Thought that’s not a true apples to apples comparison. The M6000 would be used exclusively for workload processing, a composed system with Jetson Nano’s would have a dual function, end point and workload processing. That would mean that 8 hours of the day they serve as end points and the other 16 (ish) hours they are doing processing. Now we could weight the cost of this on the hours used by each application, which would work out to $800 for thin client and $1600 for processing, thought that’s not a fair way to look at it either. The thin clients are required (assuming VDI), they are a sunk cost to the organization. In other words they are already there, they cost a given amount (for the purposes of this blog $100) regardless of if they are used for 8 hours or 24 hours.

That means a better way to approach this cost would be something like, “the data processing team pays for the upgraded switch(s) and the software to control it as a cluster along with any back end equipment needed for a composed system.” Granted this still probably wouldn’t be economical at a small scale of 24 thin-clients… this would need to be much larger deployment. Think about an office of 500 cubes, that would be 20 composed systems when no one is in the office.

500 cubicles. Idle cubes represented by green boxes with "Idle" in them and blue boxes with a person figure in them. — 500 cubicles

That’s 20 extra M6000 GPUs working about 16 hours a day on the organizations problems. That works out to roughly 278 extra processing days a year per composed system, or 2,254 days for 20 composed systems. (5 days a week X 52 weeks a year = 260 days * (2/3 of a day) = 173.333 days + 104 weekends = 278 days.) This works out to about 76% of a composed systems time could go to data processing over a given year. That’s a pretty good result.

At this time my concept probably doesn’t make complete business sense (give it time). The biggest hindrance is someone taking the time to program the logic described. Additionally I’m not sure what the performance degradation would be splitting a compute job across 24 nodes instead of running it on a single processor (there is more latency sending a signal over a TCP/IP network than a few nano-meters of copper). Because of these unknowns, I can’t say this will save you tons of money, in fact I’m not sure what the cost of this at scale would be right now especially if you calculate the cost based on usage or a direct comparison to a dedicated GPU. Still its interesting to consider in large organizations.

This thought process does open up a couple of other interesting opportunities though, where it could prove adventitious.

One is for fractional workload processing. Yes, that thing that all my AI/ML/DL/HPC folks are wanting for their workloads. Let’s say you have a bunch of small jobs that don’t consume a whole M6000 GPU, they only consume, lets say half of the GPU, but you must consume the whole card to run the job. Now lets say you have 1 million of those jobs to run. Lets say each job takes 3 seconds to run. That’s 3 million seconds to run all the jobs. Now lets say I can optimize the operation by splitting it and running it on two systems half the size of the original… that’s 1.5 million seconds to process all those jobs. In other words I’ve gone from 833.33 hours to 416.66 hours. That’s a pretty powerful way to optimize resources to fit the workload. It can also scale up. (In other words 1.5 GPUs instead of 2.)

There is also the possibility to do this programmatically. Which would mean that a function in the program could determine the optimum processing configuration and compose the infrastructure accordingly. This is much further off though.

Fractional GPU processing can be done by using composability proposed by Liqid in their GTC session mentioned above. It can also be done using virtualization technology such as VMware vSphere. My proposal with Nano’s just leverages unused resources in the organization.

Un-optimized utilization consuming a full GPU (left) compared to optimized fractional GPU from composable resources (right). — Fractional GPU workload processing

The second scenario is also interesting, especially with 5G, micro services, containers, smart cities, and many more advancements. Imagine a smart city with hundreds of smart systems processing things in real time, like traffic lights, utilities, etc. Now imagine them functioning as a large unified system which would allow processing power to go where it is needed.

Think about an intersection that is busy at night with hundreds of cars traveling through it every hour. Across town there is another intersection that is only busy during rush hour. And, yet, another intersection that is only busy on the weekends. A composed infrastructure like I have described above could be used to deliver an optimized pool of resources across a smart city. The unused processing resources from one area (purple circle) could be used to enhance near by areas that need more processing power (red circle) during peak times. It becomes a dynamic proactive city that responds to the changing needs of its residents and visitors.

All of these are interesting concepts, they just need someone to build them… maybe I’ll see if I can build a composable Nano infrastructure in my home lab. If you’ve already built this please share with the readers and I, post a link to it in the comments below.

May your servers keep running and your data center always be chilled.