Latency Processing Unit (LPU)
Latency-optimized and highly scalable architecture that accelerates hyperscale models for Generative AI (e.g., GPT-3, LLaMA),
a multi-billion parameter, small-batch,
and memory-intensive workload

Streamlined memory access (SMA) and streamlined execution engine (SXE) are designed to maximize the memory bandwidth usage by continuously streaming and processing the data with minimal interference for high-speed inference