Technologies

Deep dives into the technologies that power AI/HPC infrastructure

NVIDIA CUDA

NVIDIA Corporation

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to use GPU acceleration for general-purpose computing.

Key Topics

  • CUDA toolkit, runtime, and driver version compatibility
  • Compute capability (sm_ levels) and GPU architecture mapping
  • Tensor Core programming (FP16, BF16, FP8)
  • NCCL for multi-GPU and multi-node collective communication
  • cuDNN and cuBLAS for deep learning acceleration
  • GPUDirect RDMA for zero-copy GPU↔NIC data transfers
  • CUDA MPS and MIG for GPU sharing

AMD ROCm / HIP

Advanced Micro Devices (AMD)

ROCm (Radeon Open Compute) is AMD's open-source platform for GPU-accelerated computing. HIP (Heterogeneous-Compute Interface for Portability) enables writing code that runs on both AMD and NVIDIA GPUs.

Key Topics

  • ROCm stack components: HIP, rocBLAS, MIOpen, RCCL
  • CDNA architecture (CDNA1/MI100, CDNA2/MI250, CDNA3/MI300)
  • GFX target mapping (gfx908, gfx90a, gfx940/942)
  • AMDGPU kernel driver and ROCm version compatibility
  • HIP porting from CUDA
  • ROCm-aware MPI with Open MPI + UCX

InfiniBand

NVIDIA (Mellanox) · OpenFabrics Alliance

InfiniBand is a high-bandwidth, low-latency interconnect standard used in virtually all Top500 supercomputers. It underpins GPU-to-GPU communication in AI training clusters at scale.

Key Topics

  • InfiniBand fabric architecture: HCA, switches, subnet manager
  • Speed generations: HDR (200 Gb/s), NDR (400 Gb/s), XDR (800 Gb/s)
  • IB verbs API (libibverbs, rdma-core)
  • Fat-tree and dragonfly+ topologies for HPC
  • OpenSM subnet manager configuration
  • Adaptive routing and congestion control
  • InfiniBand with NCCL and RCCL

RDMA & RoCE

RDMA over Converged Ethernet

RDMA (Remote Direct Memory Access) allows direct memory-to-memory transfers between nodes without CPU involvement, achieving microsecond-latency at line rate. RoCE extends this over standard Ethernet infrastructure.

Key Topics

  • RDMA semantics: one-sided ops (READ, WRITE), two-sided (SEND/RECV)
  • RoCEv1 vs RoCEv2 — L2 vs L3 routable
  • Lossless Ethernet: PFC (Priority Flow Control), ECN, DCQCN
  • GPUDirect RDMA — GPU memory directly mapped to NIC BAR
  • RDMA programming with libibverbs and rdma-core
  • NCCL / RCCL over RDMA for collective operations
  • Network congestion and packet drop impact on AI training

OFED / MLNX_OFED

OpenFabrics Alliance · NVIDIA

OFED (OpenFabrics Enterprise Distribution) is the unified software stack for high-performance interconnects, including InfiniBand and RDMA-capable Ethernet. MLNX_OFED is NVIDIA's (Mellanox) bundled distribution.

Key Topics

  • OFED components: kernel drivers, libibverbs, rdma-core, perftest
  • MLNX_OFED vs upstream rdma-core differences
  • Kernel version compatibility matrix
  • OFED installation and driver configuration
  • Performance tuning: MTU, GID index, RoCE mode
  • OFED upgrades in production clusters

Open MPI

The Open MPI Project

Open MPI is the most widely used MPI implementation in HPC clusters. Its modular architecture integrates directly with UCX, OFED, CUDA, and ROCm for GPU-aware distributed computing.

Key Topics

  • Open MPI architecture: MTL, BTL, PML, COLL framework
  • UCX transport layer — InfiniBand, RoCE, shared memory
  • GPU-aware MPI — CUDA-aware and ROCm-aware builds
  • NCCL vs MPI for collective communication in AI training
  • MPI process placement and binding on NUMA systems
  • Open MPI version ↔ OFED ↔ CUDA compatibility

NVIDIA DOCA

NVIDIA (BlueField DPU)

DOCA (Data Center Infrastructure on a Chip Architecture) is NVIDIA's software framework for programming BlueField DPUs (Data Processing Units) — enabling network, storage, and security offload from host CPUs.

Key Topics

  • BlueField DPU architecture (BF2, BF3)
  • DOCA SDK — flow engine, RegEx, compress, crypto
  • Network offload: OVS-DOCA, ECMP, VXLAN/GRE acceleration
  • Zero-trust security offload with DPU
  • DOCA RDMA and DMA services
  • Integrating DPU with AI cluster fabric

UCX – Unified Communication X

UCX Consortium (OpenUCX)

UCX is a high-performance communication library that auto-selects the best available transport (InfiniBand, RDMA, shared memory, TCP) for any given hardware configuration. It is the recommended MPI transport layer.

Key Topics

  • UCX architecture: UCP, UCT, UCS layers
  • Transport selection and priority tuning
  • GPU memory handles — CUDA IPC, ROCm IPC
  • UCX diagnostics with ucx_info and ucx_perftest
  • UCX with Open MPI and NCCL