Why it matters: Built for AI-related tasks, Eos has some noteworthy specs. Nvidia calls it an AI factory, an arguably accurate description. It also showcases what Nvidia’s technologies can do when working at scale.
Nvidia has given enthusiasts their first look at Eos, a data-center scale supercomputer designed for AI applications. It first introduced Eos at the Supercomputing Conference in November 2023 but didn’t reveal its specs.
Eos sports 576 Nvidia DGX H100 systems – each equipped with eight H100 Tensor Core GPUs for a total of 4,608. It also has Nvidia Quantum-2 InfiniBand networking and software. This combination provides a total of 18.4 exaflops of FP8 AI performance.
With a network architecture supporting data transfer speeds of up to 400Gb/s, the Eos can train large language models, recommender systems, and quantum simulations, among other AI tasks. Nvidia says it built Eos on the knowledge gained from prior Nvidia DGX supercomputers, such as Saturn 5 and Selene. Its developers are using it for their work in AI.
Eos raised eyebrows last year when it ranked No. 9 in the Top500 list of the world’s fastest supercomputers – a notable achievement, ServeTheHome points out, since Nvidia stopped focusing on double-precision gains for AI performance some time ago. The fastest supercomputer in the Top500 rankings is the Frontier system, housed at the Oak Ridge National Laboratory in Tennessee, with an HPL score of 1,194 PFlop/s versus 121.4 PFlop/s for Eos. Chances are good that this score will improve in time.
Last November, Eos completed an MLPerf training benchmark based on the GPT-3 model with 175 billion parameters trained on one billion tokens in just 3.9 minutes – a nearly 3x gain from 10.9 minutes six months ago. Nvidia claims that because the benchmark uses a portion of the complete GPT-3 data set, by extrapolation, Eos could now train in just eight days or 73x faster than a system using 512 A100 GPUs, which was the standard peak performance when GPT-3 came out in 2020.
Eos also comes with an integrated software stack for AI development and deployment, which includes orchestration and cluster management, accelerated compute storage and network libraries, and an operating system optimized for AI workloads.