KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing
The article introduces KVNAND, a novel architecture that enables large language model inference on edge devices without the need for DRAM by utilizing compute-enabled 3D NAND flash for both model weights and key-value cache. This innovation addresses the challenges of low arithmetic intensity and high bandwidth demands associated with traditional methods.