Technologies
Deep dives into the technologies that power AI/HPC infrastructure
NVIDIA CUDA
NVIDIA Corporation
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to use GPU acceleration for general-purpose computing.
Key Topics
- CUDA toolkit, runtime, and driver version compatibility
- Compute capability (sm_ levels) and GPU architecture mapping
- Tensor Core programming (FP16, BF16, FP8)
- NCCL for multi-GPU and multi-node collective communication
- cuDNN and cuBLAS for deep learning acceleration
- GPUDirect RDMA for zero-copy GPU↔NIC data transfers
- CUDA MPS and MIG for GPU sharing
AMD ROCm / HIP
Advanced Micro Devices (AMD)
ROCm (Radeon Open Compute) is AMD's open-source platform for GPU-accelerated computing. HIP (Heterogeneous-Compute Interface for Portability) enables writing code that runs on both AMD and NVIDIA GPUs.
Key Topics
- ROCm stack components: HIP, rocBLAS, MIOpen, RCCL
- CDNA architecture (CDNA1/MI100, CDNA2/MI250, CDNA3/MI300)
- GFX target mapping (gfx908, gfx90a, gfx940/942)
- AMDGPU kernel driver and ROCm version compatibility
- HIP porting from CUDA
- ROCm-aware MPI with Open MPI + UCX
InfiniBand
NVIDIA (Mellanox) · OpenFabrics Alliance
InfiniBand is a high-bandwidth, low-latency interconnect standard used in virtually all Top500 supercomputers. It underpins GPU-to-GPU communication in AI training clusters at scale.
Key Topics
- InfiniBand fabric architecture: HCA, switches, subnet manager
- Speed generations: HDR (200 Gb/s), NDR (400 Gb/s), XDR (800 Gb/s)
- IB verbs API (libibverbs, rdma-core)
- Fat-tree and dragonfly+ topologies for HPC
- OpenSM subnet manager configuration
- Adaptive routing and congestion control
- InfiniBand with NCCL and RCCL
RDMA & RoCE
RDMA over Converged Ethernet
RDMA (Remote Direct Memory Access) allows direct memory-to-memory transfers between nodes without CPU involvement, achieving microsecond-latency at line rate. RoCE extends this over standard Ethernet infrastructure.
Key Topics
- RDMA semantics: one-sided ops (READ, WRITE), two-sided (SEND/RECV)
- RoCEv1 vs RoCEv2 — L2 vs L3 routable
- Lossless Ethernet: PFC (Priority Flow Control), ECN, DCQCN
- GPUDirect RDMA — GPU memory directly mapped to NIC BAR
- RDMA programming with libibverbs and rdma-core
- NCCL / RCCL over RDMA for collective operations
- Network congestion and packet drop impact on AI training
OFED / MLNX_OFED
OpenFabrics Alliance · NVIDIA
OFED (OpenFabrics Enterprise Distribution) is the unified software stack for high-performance interconnects, including InfiniBand and RDMA-capable Ethernet. MLNX_OFED is NVIDIA's (Mellanox) bundled distribution.
Key Topics
- OFED components: kernel drivers, libibverbs, rdma-core, perftest
- MLNX_OFED vs upstream rdma-core differences
- Kernel version compatibility matrix
- OFED installation and driver configuration
- Performance tuning: MTU, GID index, RoCE mode
- OFED upgrades in production clusters
Open MPI
The Open MPI Project
Open MPI is the most widely used MPI implementation in HPC clusters. Its modular architecture integrates directly with UCX, OFED, CUDA, and ROCm for GPU-aware distributed computing.
Key Topics
- Open MPI architecture: MTL, BTL, PML, COLL framework
- UCX transport layer — InfiniBand, RoCE, shared memory
- GPU-aware MPI — CUDA-aware and ROCm-aware builds
- NCCL vs MPI for collective communication in AI training
- MPI process placement and binding on NUMA systems
- Open MPI version ↔ OFED ↔ CUDA compatibility
NVIDIA DOCA
NVIDIA (BlueField DPU)
DOCA (Data Center Infrastructure on a Chip Architecture) is NVIDIA's software framework for programming BlueField DPUs (Data Processing Units) — enabling network, storage, and security offload from host CPUs.
Key Topics
- BlueField DPU architecture (BF2, BF3)
- DOCA SDK — flow engine, RegEx, compress, crypto
- Network offload: OVS-DOCA, ECMP, VXLAN/GRE acceleration
- Zero-trust security offload with DPU
- DOCA RDMA and DMA services
- Integrating DPU with AI cluster fabric
UCX – Unified Communication X
UCX Consortium (OpenUCX)
UCX is a high-performance communication library that auto-selects the best available transport (InfiniBand, RDMA, shared memory, TCP) for any given hardware configuration. It is the recommended MPI transport layer.
Key Topics
- UCX architecture: UCP, UCT, UCS layers
- Transport selection and priority tuning
- GPU memory handles — CUDA IPC, ROCm IPC
- UCX diagnostics with
ucx_infoanducx_perftest - UCX with Open MPI and NCCL