BF16, or bfloat16, is a shortened floating point data type based on the IEEE 32-bit single-precision floating point data type (f32) and is used to accelerate machine learning by reducing storage requirements and increasing the calculation speed of ML algorithms.
Tachyum provided support for BF16 for use with GCC 13.2 (GNU Compiler Collection) and tested the software integration of BF16. Tachyum supports the same floating-point BF16 operations as IEEE FP32 and FP64 in hardware.
Tachyum’s Prodigy was designed to handle matrix and vector processing from the ground up and among Prodigy’s vector and matrix features are support for a range of data types (FP64, FP32, TF32, BF16, Int8, FP8, FP4 and TAI); 2x1024-bit vector units per core; AI sparsity and super-sparsity support; and no penalty for misaligned vector loads or stores when crossing cache lines.
This built-in support offers, according to Tachyum, high performance for AI training and inference workloads, increases performance and reduces memory utilisation.
"Two months ago, Tachyum successfully integrated the BF16 datatype into Prodigy’s GCC compiler and software distribution," said Dr. Radoslav Danilak, founder and CEO of Tachyum. “Since there is no standard application test, we developed an AI inference BF16 test application that has since been vetted for use on the Prodigy FPGA. The BF16 test application on FPGA shows that it performs optimally as part of Tachyum AI’s tensor matrix operations.”
As a Universal Processor that’s suitable for all workloads, Prodigy-powered data centre servers will be able to dynamically switch between computational domains (such as AI/ML, HPC, and cloud) with a single homogeneous architecture.
According to Tachyum, by eliminating the need for expensive dedicated AI hardware and dramatically increasing server utilisation, Prodigy will reduce CAPEX and OPEX significantly while delivering significant improvements in terms of performance, power, and economics.
Prodigy integrates 192 high-performance custom-designed 64-bit compute cores, to deliver up to 4.5x the performance of the highest-performing x86 processors for cloud workloads, up to 3x that of the highest performing GPU for HPC, and 6x for AI applications.