Skip to content
Home » How The FPGA Can Take On CPU And NPU Engines And Win

How The FPGA Can Take On CPU And NPU Engines And Win

  • by


We made a joke – type of – a few years in the past after we began this publication that the long run compute engines would look extra like a GPU card than they did a server as we knew it again then. And one of many central tenets of that perception is that, given what number of HPC and AI purposes are sure by reminiscence bandwidth – not compute capability and even reminiscence capability – that some type of extraordinarily shut, very excessive bandwidth reminiscence would come to all method of calculating chips: GPUs, CPUs, FPGAs, vector engines, no matter.

This has turned out to be largely true, at the very least for now till one other reminiscence method is invented. And if the FPGA – extra precisely, hybrid compute and networking complexes that we name FPGAs though they’re much greater than programmable logic blocks – goes to compete for compute jobs, it will must have some type of excessive bandwidth foremost reminiscence tightly coupled to it. Which is why Xilinx is now speaking about its high-end Versal HBM gadget, which has been hinted at within the Xilinx roadmaps since 2018 and which is coming to market in about 9 months, Mike Thompson, senior product line supervisor for the beefy Virtex UltraScale+ and Versal Premium and HBM ACAPs at Xilinx, tells The Subsequent Platform. That’s about six months later than anticipated – it’s onerous to say with the vagaries of the X axis on many vendor roadmaps as they get additional manner from the Y axis, however estimate for your self:

Xilinx has been blazing the excessive bandwidth foremost reminiscence path together with a couple of different gadget makers, and never as a science experiment however as a result of many latency delicate workloads within the networking, aerospace and protection, telecom, and monetary companies industries merely can’t get the job accomplished with commonplace DRAM and even the very quick SRAMs which can be embedded in FPGA logic blocks.

Excessive bandwidth reminiscence initially got here in two flavors for datacenter compute engines, however the market has rallied round one among them.

The MCDRAM variant known as Hybrid Reminiscence Dice (HMC) from Intel and Micron Know-how was deployed on the Intel “Knights Touchdown” Xeon Phi units, which could possibly be used as compute engines in their very own proper or as accelerators for plain vanilla CPUs. The Xeon Phi might ship slightly greater than 400 GB/sec of reminiscence bandwidth throughout 16 GB of HMC reminiscence to the closely vectorized Atom cores on the chip, which was vital for the time. This HMC variant was additionally used within the Sparc64-IXfx processor from Fujitsu, which was aimed toward supercomputers, which had 32 GB of capability, and which delivered 480 GB/sec of bandwidth throughout its 4 reminiscence banks.

However with the A64FX Arm-based processor that Fujitsu, designed for the “Fugaku” supercomputer that’s the world’s strongest machine, Fujitsu switched to the extra widespread second-generation Excessive Bandwidth Reminiscence (HBM2) variant of stacked, parallel DRAM, which was initially created by AMD and reminiscence makers Samsung and SK Hynix and first used within the “Fiji” era of Radeon graphics playing cards about the identical time Intel was rolling out the Xeon Phi chips with MCDRAM in 2015.

Fujitsu put 4 channels on the chip that delivered 32 GB of capability and a really respectable 1 TB/sec of bandwidth – an order of magnitude or so greater than a CPU socket delivers, simply to place that into perspective.

Given the necessity for prime bandwidth and bigger capability than built-in SRAM might supply, Xilinx put 16 GB of HBM2 reminiscence, delivering 460 GB/sec of bandwidth, on its prior era of Virtex UltraScale+ FPGAs. As you’ll be able to see, that is about half of what the flops-heavy CPU compute engines of the time have been providing, and you will notice this sample once more. The pace is balanced towards the wants of the workloads and the value level that clients want. These shopping for beefy FPGAs have simply as a lot want for prime pace SerDes for signaling, so that they must commerce off networking and reminiscence to remain inside a thermal envelope that is smart for the use instances.

Nvidia has taken HBM capability and bandwidth to extremes because it has delivered three generations of HBM2 reminiscence on its GPU accelerators, with the present “Ampere” units having a most of 80 GB of capability yielding a really spectacular 2 TB/sec of bandwidth. And this want for pace – and capability – is being pushed by flops-ravenous AI workloads, which have exploding datasets to chew on. HPC codes working on hybrid CPU-GPU methods can dwell in smaller reminiscence footprints than many AI codes, which is lucky, however that won’t stay true if the reminiscence is on the market. All purposes and datasets ultimately develop to devour all capacities and bandwidths.

Some units match in the course of these two extremes in terms of HBM reminiscence. NEC’s “Aurora” vector accelerators launched 4 years in the past had 48 GB of HBM2 reminiscence and 1.2 TB/sec of bandwidth, beating the “Volta” era of GPU accelerators from Nvidia of the time. However the up to date Ampere’s launched this yr simply blow all the things else away when it comes to HBM2 capability and bandwidth. Intel has simply introduced that its future “Sapphire Rapids” Xeon SP processors, now anticipated subsequent yr, could have a variant that helps HBM2 reminiscence, and naturally the companion Ponte Vecchio” Xe HPC GPU accelerator from Intel could have HBM2 reminiscence stacks, too. We don’t know the place Intel will find yourself on the HBM2 spectrum with its CPUs and GPUs, however most likely someplace between the extremes for the CPUs and close to the extremes for the GPUs if Intel is really severe about competing.

The forthcoming Versal HBM units from Xilinx are taking a center manner course as effectively, for a similar causes that the Virtex UltraScale+ units did once they have been unveiled in November 2016. However Xilinx can be including in different HBM improvements that scale back latency additional than different do per unit of capability and bandwidth.

The Versal HBM gadget is predicated on the Versal Premium gadget, which we detailed in March 2020. That Versal Premium advanced has 4 tremendous logic areas, or SLRs as Xilinx calls them, and one among these SLRs is swapped out with two banks of eight-stacks of HBM2e reminiscence. Every stack has a most of 16 GB for a complete of 32 GB, and reminiscence throughout the SKUs is on the market in 8 GB, 16 GB, and 32 GB with various quantities of compute and interconnect. The SLR instantly adjoining to the swapped in HBM reminiscence has an HBM controller and an HBM change – each of that are designed by Xilinx –embedded in it, which Thompson says is comparatively small. This HBM change is a key differentiator.

“One of many challenges with HBM is you could’t entry each reminiscence location from any of the reminiscence ports, and we now have 32 reminiscence ports on this gadget,” explains Thompson. “Different merchandise out there don’t construct in a change, both, which implies they must spend a considerable amount of gentle logic to create a change of their very own, which eats a major chunk of the logic in these units and someplace between 4 watts and 5 watts of energy. With different units utilizing HBM, not having a change causes large overhead and extra latency as reminiscence maps find yourself being far more annoying than they need to be.”

One more piece of the FPGA logic being hard-coded in transistors for effectivity, together with the SerDes and lots of different accelerators. Here’s what the Versal HBM block diagram seems like:

As with the Versal Premium units, the Versal HBM units have some scalar processing engines primarily based on Arm cores, some programmable logic that implements the FPGA performance and its inner and numerous recollections, and DSP engines that do mixed-precision math for machine studying, imaging, and sign processing purposes. Hooked up to that is the HBM reminiscence and a slew of hard-coded I/O controllers and SerDes that make information zip into and out of those chips at lightning pace. One of many the reason why FPGA clients want HBM reminiscence on such a tool is as a result of it has a lot completely different I/O including as much as a lot mixture bandwidth. The PCI-Specific 5.0 controllers, which assist DMA, CCIX, and CXL protocols for reminiscence latency, have an mixture of 1.5 Tb/sec of bandwidth; and the chip-to-chip Interlaken interconnect has an built-in ahead error correction (FEC) accelerator and ship 600 Gb/sec of mixture bandwidth. The cryptographic engines, that are additionally hard-coded just like the PCI-Specific and Interlaken controllers, assist AES-GCM at 128 bits and 256 bits in addition to MACsec and IPsec protocols, and ship 1.2 Tb/sec of mixture bandwidth and might do encryption at 400 Gb/sec to match the road price of a 400 Gb/sec Ethernet port. The hard-coded Ethernet controllers can drive 400 Gb/sec ports (with 58Gb/sec PAM4 signaling) and 800 Gb/sec ports (with 112 Gb/sec PAM4 signaling) in addition to something all the way down to 10 Gb/sec alongside the Ethernet steps utilizing legacy 32 Gb/sec NRZ signaling; all instructed, the chip has an mixture Ethernet bandwidth of two.4 Tb/sec.

This Versal HBM gadget is a bandwidth beast on I/O, and for sure purposes, which means it must be a reminiscence bandwidth beast to steadiness it out. And the Versal HBM gadget is far more of a beast than the Virtex UltraScale+ HBM gadget it’s going to change, and proves it on many alternative metrics past HBM reminiscence capability and bandwidth. That is enabled by architectural modifications and the shift from 16 nanometer processes all the way down to 7 nanometers (because of fab associate Taiwan Semiconductor Manufacturing Corp).

Thompson says the Versal HBM gadget has the equal of 14 FPGAs of logic and the HBM has the equal bandwidth as 32 DDR5-6400 DRAM modules.

The gadget has 8X the reminiscence bandwidth and makes use of 63 p.c much less energy than 4 DDR5-6400 modules of the identical capability, Xilinx estimates:

So how will the Versal HBM gadget stack up towards prior Xilinx units and Intel Agilex units and Intel and AMD CPUs? Nicely, you’ll be able to neglect any comparisons to AMD Epyc CPUs with AMD in the course of shopping for Xilinx for $35 billion. And Thompson didn’t deliver any comparisons to Intel ACAP-equivalent units, both. However he did deliver some charts that pit two-socket Intel “Ice Lake” Xeon SP methods towards the Virtex HBM and Versal HBM units, and here’s what it seems like:

On the scientific information suggestion engine take a look at on the left of the chart above, the CPU-only system takes seconds to minutes to run, however the outdated Virtex HBM gadget was in a position to maintain a database that was twice as giant due to the pace at which it might stream information into the gadget and was 100X quicker at making suggestions for therapies. The Versal HBM gadget held a database twice as giant and derived the suggestions twice as quick. The identical relative efficiency was seen with the real-time fraud detection benchmark on the precise.

Right here is one other manner to consider how the Versal HBM gadget is likely to be used, says Thompson. Say you need to construct a next-generation 800 Gb/sec firewall that has machine studying smarts in-built. If you wish to make use of the Marvell Octeon community processor SoC, which may solely drive 400 Gb/sec ports, you will want two of them, and they don’t have machine studying. So you will want two Virtex UltraScale+ FPGAs so as to add that performance to the pair of Octeons. It should additionally take a dozen DDR4 DRAM modules to ship 250 GB/sec of reminiscence throughput. Like this:

Presumably, not solely is the Versal HBM system higher when it comes to having fewer units, extra throughput, and fewer energy consumption, however can be cheaper to purchase, too. We don’t know as a result of Xilinx doesn’t give out pricing. And if not, it certainly has to ship higher bang for the buck and higher efficiency per greenback per watt or there is no such thing as a sense in taking part in this sport in any respect. By how a lot, we might like to know.


Supply hyperlink