FastV: A Plug-and-Play Inference Acceleration AI Method for Large Vision Language Models Relying on Visual Tokens

Researchers from the Peking University and Alibaba Group introduced FastV to address the challenges caused by inefficient attention computation in Large Vision-Language Models (LVLMs). Existing models such as LLaVA-1.5 and Video-LLaVA have shown significant advancements in LVLMs but they struggle with the bottleneck in the attention mechanism, concerning the handling of visual tokens. The researchers revealed that the attention mechanism within LVLMs exhibits a bias towards textual tokens, resulting in inefficient utilization of visual information.

Currently, LVLMs process multimodal inputs by transforming images into tokens and feeding them alongside textual tokens into the transformer-based decoder. Researchers identified the issue with the visual tokens, which constitute a substantial portion of input data, receiving disproportionately lower attention scores compared to textual tokens, especially in the deeper layers of LVLMs. This inefficiency leads to suboptimal utilization of visual information and hampers the overall performance and computational efficiency of LVLMs. To address this, they propose FastV, a dynamic pruning method designed to optimize computational efficiency in LVLMs. FastV dynamically prunes unnecessary visual tokens based on their attention scores, significantly reducing computational costs without compromising performance in a variety of vision-language tasks.

The proposed model, FastV, operates by introducing a dynamic pruning mechanism for visual tokens during the inference phase of LVLMs. It ranks the importance of visual tokens based on their attention scores and selectively prunes out less relevant tokens beyond a certain layer. This selective pruning strategy significantly reduces the computational burden of LVLMs, particularly in deep layers, where the attention mechanism tends to allocate fewer resources to visual tokens. By leveraging this insight, FastV achieves a substantial reduction in FLOPs while maintaining superior performance across various vision-language tasks. 

FastV’s flexibility allows users to customize the trade-off between computational efficiency and performance according to their specific requirements, making it a versatile and practical solution for deploying LVLMs in resource-constrained environments. FastV has shown significant effectiveness in precisely targeting image tokens for reduction, thereby optimizing performance without compromising the model’s overall functionality.

In conclusion, the proposed model addresses the inefficiency of attention computation in LVLMs, particularly concerning the handling of visual tokens. FastV demonstrates remarkable performance in reducing computational costs without sacrificing the quality of output across a range of vision-language tasks. Overall, FastV represents a significant step towards improving the computational efficiency and practical deployment of LVLMs, offering a promising solution to the challenges posed by resource constraints in real-world applications.


Check out the Paper and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 38k+ ML SubReddit

Want to get in front of 1.5 Million AI enthusiasts? Work with us here

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft