Beyond TFLOPS: Solving the Memory-Bound Challenge in LLM Inference

Most institutions measure performance based solely on TFLOPS—the number of compute units available. However, this approach is fundamentally flawed.
In reality, all computing performance—including AI performance—is determined by two pillars: compute and memory. For real-world applications, these must be well-balanced. If there are too many compute units but data arrives slowly, most units remain idle. Conversely, if compute units are insufficient compared to incoming data, overall system performance is bottlenecked by compute.

As widely recognized, today’s Large Language Models (LLMs), which dominate AI, are memory-bound workloads. Due to their massive model sizes and auto-regressive nature, they spend far more time accessing and fetching data from memory. As a result, countless compute cores (e.g., CUDA cores) remain idle while waiting for data. For example, NVIDIA GPUs provide a memory bandwidth of only 2–3 TB/s, compared to compute performance exceeding 1000 TFLOPS. This discrepancy—hundreds of times more compute capacity than memory bandwidth—means that actual system performance is determined primarily by how fast data can be fetched. Therefore, relying solely on peak TFLOPS to assess system performance is highly misleading.

In memory-bound scenarios, there are several ways to improve overall performance. Fundamentally, higher-performance memory technologies (e.g., HBM3e, HBM4) can be used, which is why HBM is gaining attention. However, these technologies are costly, difficult to scale quickly, and often monopolized by a few large tech companies. A more practical approach is to maximize utilization of available memory bandwidth. In semiconductors, neither compute nor memory can achieve 100% peak performance; effective performance is calculated as Peak BW (TB/s) × Utilization (%). Similarly, compute performance is Peak TFLOPS × Utilization (%). In practice, a utilization above 70% is considered excellent.

Another method is reusing model data once loaded, reducing the need for repeated external memory fetches. Unfortunately, with LLMs, parameters must be reloaded for every token generation, preventing simple reuse. To address this, batching techniques are widely adopted: serving multiple users simultaneously by generating multiple tokens in parallel, thereby reusing the same model data. This improves data-to-compute balance and alleviates memory bottlenecks.

At HyperAccel, our LLM inference-specialized processor (LPU) introduces a new Streamlined Dataflow architecture that utilizes up to 90% of available bandwidth, along with built-in batching support at the hardware level to maximize parameter reuse. Furthermore, our chip natively supports model quantization, reducing data formats from 16-bit (FP16) to 8-bit or even 4-bit, further improving efficiency.

In conclusion, while our LPU may feature fewer compute units compared to GPUs, it achieves significantly higher Tokens Per Second (TPS) performance—the true metric for LLM inference—while greatly reducing the cost of computation.

Beyond TFLOPS: Solving the Memory-Bound Challenge in LLM Inference

HyperAccel Latency Processing Unit (LPU™)Accelerating Hyperscale Models for Generative AI

Accelerate What’s Next in AI

Beyond TFLOPS: Solving the Memory-Bound Challenge in LLM Inference

Related

HyperAccel Latency Processing Unit (LPU™)Accelerating Hyperscale Models for Generative AI

Accelerate What’s Next in AI