📢 [𝐈𝐂𝐌𝐋 𝟐𝟎𝟐𝟒] If you are interested in faster and more energy-efficient large language models (LLMs), we invite you to check out the following two papers we presented at ICML 2024!
Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration
Motivation
The attention mechanism is crucial in LLMs, yet our understanding of it remains limited. Recent studies have identified the phenomenon of attention sinks in the initial token, which attracts disproportionately high attention despite containing minimal semantic information. In this paper, we aim to investigate this phenomenon further and develop a training-free technique to enhance LLM performance.
Key Research Questions and Our Answers
- 𝐃𝐨 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐬𝐢𝐧𝐤𝐬 𝐨𝐧𝐥𝐲 𝐞𝐱𝐢𝐬𝐭 𝐢𝐧 𝐭𝐡𝐞 𝐢𝐧𝐢𝐭𝐢𝐚𝐥 𝐭𝐨𝐤𝐞𝐧? We observe that attention sinks are found not only in the initial token but 𝐚𝐥𝐬𝐨 𝐢𝐧 𝐥𝐚𝐭𝐞𝐫 𝐭𝐨𝐤𝐞𝐧𝐬, especially within the intermediate layers of LLMs.
- 𝐖𝐢𝐥𝐥 𝐩𝐫𝐞𝐬𝐞𝐫𝐯𝐢𝐧𝐠 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐬𝐢𝐧𝐤𝐬 𝐚𝐥𝐰𝐚𝐲𝐬 𝐛𝐞𝐧𝐞𝐟𝐢𝐭 𝐋𝐋𝐌𝐬’ 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐢𝐧 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭 𝐬𝐜𝐞𝐧𝐚𝐫𝐢𝐨𝐬? In contrast to previous findings, we highlight that 𝐧𝐨𝐭 𝐚𝐥𝐥 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐬𝐢𝐧𝐤𝐬 𝐚𝐫𝐞 𝐛𝐞𝐧𝐞𝐟𝐢𝐜𝐢𝐚𝐥 𝐟𝐨𝐫 𝐋𝐋𝐌𝐬. Specifically, for most attention sinks that occur in the middle or later parts of the inputs, reducing their attention scores can lead to improved accuracy.
- 𝐂𝐚𝐧 𝐰𝐞 𝐞𝐧𝐡𝐚𝐧𝐜𝐞 𝐋𝐋𝐌𝐬’ 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐛𝐲 𝐬𝐨𝐥𝐞𝐥𝐲 𝐦𝐚𝐧𝐢𝐩𝐮𝐥𝐚𝐭𝐢𝐧𝐠 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐬𝐢𝐧𝐤𝐬 𝐰𝐢𝐭𝐡𝐨𝐮𝐭 𝐟𝐢𝐧𝐞𝐭𝐮𝐧𝐢𝐧𝐠? Yes! We propose the training-free 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐂𝐚𝐥𝐢𝐛𝐫𝐚𝐭𝐢𝐨𝐧 𝐓𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞 (𝐀𝐂𝐓), which identifies and calibrates ineffective attention sinks on-the-fly during inference, thereby enhancing the performance of LLMs.
Evaluation Results
Extensive experiments with seven LLMs on 18 datasets across four different tasks consistently validate the effectiveness of ACT, demonstrating up to a 7.30% improvement in accuracy compared to standard inference.
For more technical details, please feel free to check out our paper, project page, and code:
📄 Paper on arXiv: https://arxiv.org/pdf/2406.15765
🌐 Project page: https://yuzz1020.github.io/ACT/
🔗 Codebase on GitHub: https://github.com/GATECH-EIC/ACT
When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
Motivation
Autoregressive Large Language Models (LLMs) are facing two significant bottlenecks:
1. The attention’s quadratic complexity as the number of tokens increases.
2. The sequential processing nature of autoregressive LLM generation.
While previous linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain.
Key Contributions
- 𝐅𝐢𝐫𝐬𝐭 𝐂𝐨𝐦𝐩𝐫𝐞𝐡𝐞𝐧𝐬𝐢𝐯𝐞 𝐒𝐭𝐮𝐝𝐲: Our evaluation of seven linear attention methods across three different types of LLMs reveals that existing encoder-based linear attentions are not optimally suited for autoregressive decoder-based LLMs.
- 𝐄𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 𝐚𝐧𝐝 𝐄𝐟𝐟𝐞𝐜𝐭𝐢𝐯𝐞 𝐋𝐢𝐧𝐞𝐚𝐫𝐢𝐳𝐞𝐝 𝐋𝐋𝐌𝐬: We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs.
Evaluation Results
Extensive experiments with seven linear attention models and five encoder/decoder-based LLMs consistently validated the effectiveness of our augmented linearized LLMs, achieving up to a 6.67x reduction in perplexity on the LLaMA model and up to a 2x speedup during generation compared to previous methods.
For more technical details, please feel free to check out our paper and code:
📄 Paper on arXiv: https://arxiv.org/abs/2406.07368
🔗 Codebase on GitHub: https://github.com/GATECH-EIC/Linearized-LLM