Published in: ISCA ’25: Proceedings of the 52nd Annual International Symposium on Computer Architecture
Abstract: The growth of context window size in large language model (LLM) inference poses a very distinct computational challenge of hardware inefficiency. The inefficiency arises from the computational imbalance during LLM inference between the compute-intensive prefill stage, and memory-intensive decode stage. The predominant inference hardware, GPU, boasts large number of cores to excel in the prefill stage, which processes the entire input context at once, but suffers from hardware underutilization in the decode stage, which iteratively generates one output token at a time. In conventional LLM, batching has been able to alleviate the underutilization by generating multiple tokens of different requests. However, batching becomes infeasible in models with large context windows over 100K tokens because the Key-Value (KV) activations dominate the physical memory capacity, surpassing the entire model size. In this paper, we propose Hybe, a GPU-NPU hybrid system for efficient LLM inference with a million-token context window. Hybe utilizes the preexisting GPU for the prefill stage and employs lightweight NPUs during the decode stage. Each NPU includes only the necessary computing resources to fully utilize the given memory bandwidth, thereby achieving maximum hardware efficiency. Furthermore, Hybe introduces fine-grained KV transmission, a kernel scheduling method that immediately offloads partial KV produced from the GPU to the NPU, which significantly reduces the KV memory required in the GPU. Lastly, Hybe scheduler applies stage-wise pipelining that dynamically assigns queued requests to idle hardware to minimize stalls. Hybe utilizes NVIDIA H100 GPU with inference-optimized vLLM library and implement Hybe NPU in 4nm process with equal HBM specification. Hybe achieves 2.1 × speedup for Phi-3 with 100K-token context window and 3.9 × energy efficiency for Llama-3 with 1M-token context window, over H100 GPUs with equal total device count.
Authors: Seungjae Moon, Junseo Cha, Hyunjun Park, Joo-Young Kim