Intel 4th Gen Xeon “Sapphire Rapids” CPUs Achieve Up To 10x AI Stable Diffusion Performance With AMX

Intel’s 4th Gen Xeon CPUs codenamed Sapphire Rapids have achieved up to 10x performance uplift in AI Stable Diffusion thanks to AMX.

Intel Boosts AI Stable Diffusion Performance With AMX Acceleration on 4th Gen Xeon Sapphire Rapids CPUs

The recently launched Intel 4th Gen Xeon “Sapphire Rapids” CPUs have seen accelerated adoption in the cloud and data center segment. One of the key areas where Intel has put extra effort is their hardware feature set for deep learning acceleration which is boosted with the new AMX (Advanced Matrix Extension) accelerators.

Intel first showcases the average latency between the current-gen Sapphire Rapids and the last-gen Ice Lake CPUs. The 3rd Gen Xeon CPUs require around 45 seconds to run a code while the 4th Gen CPUs take 32.3 seconds. This is 28% lower latency without any changes to the code. So what if Intel was to use an optimized and open-source toolkit for high-performance inference like OpenVINO?

The answer is even more speedup in performance! With Optimum Intel and OpenVino, Intel Xeon CPUs drop the latency down to 16.7 seconds, a speedup of over 2x. Further optimizing the code to a fixed resolution drops the latency down to just 4.7 seconds which marks a 3.5-3.8x speedup over the untouched code.

With a static shape, average latency is slashed to 4.7 seconds, an additional 3.5x speedup.

As you can see, OpenVINO is a simple and efficient way to accelerate Stable Diffusion inference. When combined with a Sapphire Rapids CPU, it delivers almost 10x speedup compared to vanilla inference on Ice Lake Xeons.

If you can’t or don’t want to use OpenVINO, the rest of this post will show you a series of other optimization techniques. Fasten your seatbelt!

We also enable the bloat16 data format to leverage the AMX tile matrix multiply unit (TMMU) accelerator present on Sapphire Rapids CPUs.

With this updated version, inference latency is further reduced from 11.9 seconds to 5.4 seconds. That’s more than 2x acceleration thanks to IPEX and AMX.

With this final version, inference latency is now down to 5.05 seconds. Compared to our initial Sapphire Rapids baseline (32.3 seconds), this is almost 6.5x faster!

via Intel

Further system-level optimizations, IPEX, & BF16 bring even more performance to the table and the results can be seen in the nice chart provided by Intel themselves:

Intel’s Sapphire Rapids Xeon CPUs are currently available for preview on the Amazon EC2 R7iz instances, you can sign up here yourself to access and see the advantages that the 4th Gen CPU family brings to the table. With Stable Diffusion and similar AI models becoming more popular, it can be seen why Intel’s CPUs will become a popular choice in this segment.

The post Intel 4th Gen Xeon “Sapphire Rapids” CPUs Achieve Up To 10x AI Stable Diffusion Performance With AMX by Hassan Mujtaba appeared first on Wccftech.