Cerebras CS-2 Wafer Scale Chip Outperforms Every Single GPU By Leaps & Bounds, Breaks Record of Largest AI Model Trained on A Single Device

Cerebras CS-2 Wafer Scale Chip Outperforms Every Single GPU By Leaps & Bounds, Breaks Record of Largest AI Model Trained on A Single Device 2

Cerebras has just proclaimed a juncture for the company, the most significant learning initiative of the most extensive global Natural Language Processing (NLP) AI model in a single apparatus developing and manufacturing the development and manufacturing of the world’s largest accelerator chip, the CS-2 Wafer Scale Engine.

Cerebras access twenty billion parameters in workloads on a single chip

The artificial intelligence model trained by Cerebras climbed to a unique and remarkable twenty billion parameters. Cerebras completed this action without having to scale the workload across numerous accelerators. Cerebras’ triumph is critical for machine learning in that the infrastructure and complexity of the software requirements are reduced compared to previous models.

The Wafer Scale Engine-2 is engraved in an individual 7 nm wafer, equalling hundreds of premium chips on the market, and features 2.6 trillion 7 nm transistors. Along with the wafer and transistors, the Wafer Scale Engine-2 incorporates 850,000 cores and 40 GB of integrated cache with a 15kW power consumption. Tom’s Hardware notes that “a single CS-2 system is akin to a supercomputer all on its own.”

The benefit for Cerebras utilizing a 20 billion-parameter NLP model in an individual chip allows for the company to reduce its overhead in the cost of training thousands of GPUs, hardware, and scaling requirements. In turn, the company can eliminate any technical difficulties of partitioning various models across the chip. The company states this is “one of the most painful aspects of NLP workloads, […] taking months to complete.”

It’s a tailored issue that’s unusual not only to each processed neural network, GPU specifications, and the overall network combining all the components, which researchers must take care of before the first section of training. The training is also solitary and cannot be used on multiple systems.

In NLP, bigger models are shown to be more accurate. But traditionally, only a select few companies had the resources and expertise necessary to do the painstaking work of breaking up these large models and spreading them across hundreds or thousands of graphics processing units. As a result, few companies could train large NLP models – it was too expensive, time-consuming, and inaccessible for the rest of the industry. Today we are proud to democratize access to GPT-3XL 1.3B, GPT-J 6B, GPT-3 13B, and GPT-NeoX 20B, enabling the entire AI ecosystem to set up large models in minutes and train them on a single CS-2.

— Andrew Feldman, CEO and Co-Founder, Cerebras Systems

Currently, we have seen systems that perform exceptionally well with having to use fewer parameters. One such system is Chinchilla, which continually exceeds GPT-3 and Gopher’s 70 billion parameters. However, Cerebras’ accomplishment is exceptionally significant in that researchers will find that they will be able to calculate and create gradually elaborate models on the new Wafer Scale Engine-2 where others cannot.

The technology behind the vast amount of workable parameters uses the company’s Weight Streaming technology, allowing researchers to “decouple compute and memory footprints, allowing for memory to be scaled towards whatever the amount is needed to store the rapidly-increasing number of parameters in AI workloads.” In turn, the time taken for setting up the learning will be reduced from months to minutes with only a few standard commands, allowing to switch flawlessly between GPT-J and GPT-Neo.

Cerebras’ ability to bring large language models to the masses with cost-efficient, easy access opens up an exciting new era in AI. It gives organizations that can’t spend tens of millions an easy and inexpensive on-ramp to major league NLP. It will be interesting to see the new applications and discoveries CS-2 customers make as they train GPT-3 and GPT-J class models on massive datasets.

— Dan Olds, Chief Research Officer, Intersect360 Research