Published in: 2025 Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)
Authors: Jung-Hoon Kim; Sukbin Lim; Junseo Cha; Seungjae Moon; Dongjin Seo; Hunjong Lee
Abstract: This paper presents Adelia, an efficient inference chip for large language models (LLMs) featuring a streamlined data-flow and dual-mode parallelization. The streamlined dataflow directly connects the external memory to Adelia’s LLM-optimized compute engine with matched bandwidth, achieving an effective memory bandwidth usage of up to 91%. The systolic path between multiple engines facilitates data reuse to enhance computational power without compromising efficiency. Adelia dynamically transitions between context mode, which distributes the long context of a single request to optimize latency, and batch mode, which processes inputs from different requests to prioritize throughput based on the runtime status. Adelia is fabricated in 4nm technology and occupies 5.28mm2 in die area. Compared to the H100 GPU, it achieves 1.59x and 2.51x greater memory bandwidth efficiency and throughput efficiency, respectively, across various models.