SambaNova claims that its latest chips can significantly outperform Nvidia’s A100 silicon, at least when it comes to machine learning workloads.
The Palo Alto-based AI startup this week unveiled its DataScale systems and Cardinal SN30 accelerator, which the company says are capable of delivering 688 TFLOPS of BF16 performance, twice that of Nvidia’s A100.
However, in machine learning training workloads, SambaNova says the gap is even greater. The company claims its SN30-based DataScale systems are six times faster when training a 13 billion-parameter GPT model than Nvidia’s DGX A100 servers, at least according to its internal benchmarks, so take them with a grain of salt. good dose of salt.
The SN30 is fabricated on a TSMC 7nm process node, which packs 86 billion transistors into a single chip. The chip itself is a bit unconventional compared to other high performance accelerators on the market today, in that it is not a traditional GPU, CPU or FPGA. .
SambaNova describes the chip as a reconfigurable data stream unit or RDU. “Reconfigurability is key to the architecture, so unlike a GPU or CPU that contains fixed elements, think of it as an on-chip compute and memory package,” said Marshall Choy, SVP of Product at SambaNova Systems. The register.
In many ways, the RDU is reminiscent of an FPGA, although, as Choy points out, it’s nowhere near as thin.
According to Choy, the closest comparison is to a coarse-grained reconfigurable architecture (CGRA), which typically lacks the gate-level control of an FPGA but benefits from lower power consumption. and faster reconfiguration time.
“We think our chip and hardware are software defined, because we’re actually reconfiguring each input to configure it to the needs of the running operator,” Choy said.
For example, while the chip lacks the large matrix math engines you might find in a dedicated AI accelerator, the chip can reconfigure itself to achieve the same results. This is done using SambaNova’s software stack that extracts common parallel patterns, Choy explained.
Reduce memory bottlenecks
Configurability of the SN30 is only one part of the equation, memory being the other, notes Choy.
The chip features a 640MB SRAM cache, which is combined with a much larger terabyte of external DRAM per socket. Choy says this approach – a relatively small cache with large external DRAM capacity – allows the company’s technology to support large natural language models (NLPs) more efficiently.
The argument seems to be, from SambaNova, that to use these big models with off-the-shelf GPUs, you have to fit a lot of these CPUs into a system and pool their onboard memory to hold all that data as you go. extent of their access, while you need fewer SN30 chips because they can store the model in their large external DRAM connected to DDR.
For example, you may have an 800 GB model which therefore needs 10 Nvidia 80 GB GPUs to keep everything in memory, but you may not need 10 GPUs to perform the task, so you’re wasting money, energy and space on this useless silicon. . You could instead do it with a few SN30s and use their large external DRAM to hold the model, or so the SambaNova logic goes.
“If you look at NLP, for example, Nvidia and everyone else is just doing fast computing. We need X amount of memory, so we need that many GPUs,” Choy said. “What we’ve done is design our system to provide 12.8 times more memory than an Nvidia-based system. [80GB-per-GPU] system”
So, SambaNova seems to be balancing the performance it experiences by using this external DRAM against the massive and relatively cheap capacity of this memory, as well as the performance of its chip architecture.
“We see cases where it may take 1,400 GPUs to do the job. We’re throwing 64 sockets at it because we have 12.8 times the memory,” Choy said.
It should be noted that SambaNova’s approach to this problem is by no means new. Graphcore used a similar cache and SRAM memory architecture in its intelligence processing units. Meanwhile, Nvidia’s Grace-Hopper SuperChips bundle the company’s Arm-compatible CPU and GH100 GPUs with 80GB of HBM and 512GB of LPDDR5.
An AI data center as a service
Unlike Nvidia, SambaNova does not sell GPU dies for integration into OEM systems or PCIe cards. The SN30 is only available as part of a complete system and is designed for use with the company’s software stack.
“The smallest consumer unit would be a complete eight-outlet system from us,” Choy said.
In fact, the systems come complete in racks with integrated power and networking. In this regard, DataScale is more comparable to Nvidia’s DGX servers, which are designed for rack-scale deployment using the chip giant’s proprietary switches.
Four DataScale systems can be installed in one rack, and the company says it can scale up to 48 racks in large-scale deployments.
Beyond hardware and software, the company also offers fully trained base models for customers who lack the expertise or interest to develop and train their own.
According to Choy, this is a frequent request from customers who prefer to focus on the science and data engineering associated with refining datasets rather than training models.
However, AI infrastructure and software remains prohibitively expensive for many customers, with single systems often costing hundreds of thousands of dollars each.
Recognizing this, SambaNova plans to offer its DataScale and SambaFlow software suite as a subscription service from the outset.
Choy says the approach will allow customers to get a return on investment faster and with less risk than outright purchasing AI infrastructure. ®