The most powerful end-to-end AI supercomputing platform.
Massive datasets, exploding model sizes, and complex simulations require multiple GPUs with extremely fast interconnections and a fully accelerated software stack. The NVIDIA HGX™ AI supercomputing platform brings together the full power of NVIDIA GPUs, NVIDIA® NVLink®, NVIDIA InfiniBand networking, and a fully optimized NVIDIA AI and HPC software stack from the NVIDIA NGC™ catalog to provide the highest application performance. With its end-to-end performance and flexibility, NVIDIA HGX enables researchers and scientists to combine simulation, data analytics, and AI to drive scientific progress.
NVIDIA HGX combines NVIDIA A100 Tensor Core GPUs with high-speed interconnects to form the world’s most powerful servers. With 16 A100 GPUs, HGX has up to 1.3 terabytes (TB) of GPU memory and over 2 terabytes per second (TB/s) of memory bandwidth for unprecedented acceleration.
Compared to previous generations, HGX provides up to a 20X AI speedup out of the box with Tensor Float 32 (TF32) and a 2.5X HPC speedup with FP64. NVIDIA HGX delivers a staggering 10 petaFLOPS, forming the world’s most powerful accelerated scale-up server platform for AI and HPC.
DLRM Training
DLRM on HugeCTR framework, precision = FP16 | NVIDIA A100 80GB batch size = 48 | NVIDIA A100 40GB batch size = 32 | NVIDIA V100 32GB batch size = 32.
Deep learning models are exploding in size and complexity, requiring a system with large amounts of memory, massive computing power, and fast interconnects for scalability. With NVIDIA NVSwitch™ providing high-speed, all-to-all GPU communications, HGX can handle the most advanced AI models. With A100 80GB GPUs, GPU memory is doubled, delivering up to 1.3TB of memory in a single HGX. Emerging workloads on the very largest models like deep learning recommendation models (DLRM), which have massive data tables, are accelerated up to 3X over HGX powered by A100 40GB GPUs.
Big data analytics benchmark | 30 analytical retail queries, ETL, ML, NLP on 10TB dataset | V100 32GB, RAPIDS/Dask | A100 40GB and A100 80GB, RAPIDS/Dask/BlazingSQL
Machine learning models require loading, transforming, and processing extremely large datasets to glean critical insights. With up to 1.3TB of unified memory and all-to-all GPU communications with NVSwitch, HGX powered by A100 80GB GPUs has the capability to load and perform calculations on enormous datasets to derive actionable insights quickly.
On a big data analytics benchmark, A100 80GB delivered insights with 2X higher throughput over A100 40GB, making it ideally suited for emerging workloads with exploding dataset sizes.
HPC applications need to perform an enormous amount of calculations per second. Increasing the compute density of each server node dramatically reduces the number of servers required, resulting in huge savings in cost, power, and space consumed in the data center. For simulations, high-dimension matrix multiplication requires a processor to fetch data from many neighbors for computation, making GPUs connected by NVIDIA NVLink ideal. HPC applications can also leverage TF32 in A100 to achieve up to 11X higher throughput in four years for single-precision, dense matrix-multiply operations.
An HGX powered by A100 80GB GPUs delivers a 2X throughput increase over A100 40GB GPUs on Quantum Espresso, a materials simulation, boosting time to insight.
Top HPC Apps
Geometric mean of application speedups vs. P100: Benchmark application: Amber [PME-Cellulose_NVE], Chroma [szscl21_24_128], GROMACS [ADH Dodec], MILC [Apex Medium], NAMD [stmv_nve_cuda], PyTorch (BERT-Large Fine Tuner], Quantum Espresso [AUSURF112-jR]; Random Forest FP32 [make_blobs (160000 x 64 : 10)], TensorFlow [ResNet-50], VASP 6 [Si Huge] | GPU node with dual-socket CPUs with 4x NVIDIA P100, V100, or A100 GPUs.
Quantum Espresso
Quantum Espresso measured using CNT10POR8 dataset, precision = FP64.
NVIDIA HGX is available in single baseboards with four or eight H100 GPUs and 80GB of GPU memory, or A100 GPUs, each with 40GB or 80GB of GPU memory. The 4-GPU configuration is fully interconnected with NVIDIA NVLink, and the 8-GPU configuration is interconnected with NVIDIA NVSwitch. Up to 32 HGX H100 8-GPU systems can be combined for clusters of up to 256 GPUs in size using the NVIDIA NVLink Switch System. Two HGX A100 8-GPU baseboards can be combined using an NVSwitch interconnect to create a powerful 16-GPU single node.
HGX is also available in a PCIe form factor for a modular, easy-to-deploy option, bringing the highest computing performance to mainstream servers.
This powerful combination of hardware and software lays the foundation for the ultimate AI supercomputing platform.
With HGX, it’s also possible to include NVIDIA networking to accelerate and offload data transfers and ensure the full utilization of computing resources. Smart adapters and switches reduce latency, increase efficiency, enhance security, and simplify data center automation to accelerate end-to-end application performance.
The data center is the new unit of computing, and HPC networking plays an integral role in scaling application performance across the entire data center. NVIDIA InfiniBand is paving the way with software-defined networking, In-Network Computing acceleration, remote direct-memory access (RDMA), and the fastest speeds and feeds.
NVIDIA HGX-1 and HGX-2 are reference architectures that standardize the design of data centers accelerating AI and HPC. Built with NVIDIA SXM2 V100 boards, with NVIDIA NVLink and NVSwitch interconnect technologies, HGX reference architectures have a modular design that works seamlessly in hyperscale and hybrid data centers to deliver up to 2 petaFLOPS of compute power for a quick, simple path to AI and HPC.