SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding
The article introduces SwiftKV Attention, an efficient algorithm designed for low-latency attention inference on edge accelerators, which processes each token in a single pass without resource-intensive operations. It also presents the SwiftKV-MHA accelerator, capable of high precision attention and low precision GEMV, facilitating fast multi-head parallel decoding.