返回
CVPR 2022 Oral | Divide-and-Conquer Self-Attention via Multi-Scale Token Aggregation - Code Released
人工智能
2024-01-28 00:52:48
CVPR 2022 Oral | Divide-and-Conquer Self-Attention via Multi-Scale Token Aggregation - Code Released
**Introduction**
Visual Transformers (ViTs) have emerged as a powerful class of models for computer vision tasks, achieving remarkable results on a wide range of benchmarks. However, the self-attention mechanism employed in ViTs can be computationally expensive, especially for high-resolution images.
This paper introduces a novel Divide-and-Conquer Self-Attention (DCSA) mechanism that addresses the computational challenges of self-attention in ViTs. DCSA decomposes self-attention into a hierarchy of local and global operations, enabling efficient and effective modeling of multi-scale dependencies in visual data.
**Method**
The proposed DCSA mechanism consists of three main components:
* **Local Self-Attention:** Computes attention within small, localized regions of the image, capturing local dependencies.
* **Global Self-Attention:** Aggregates local attention results and computes attention across the entire image, capturing global dependencies.
* **Multi-Scale Token Aggregation:** Fuses the local and global attention results using a multi-scale aggregation strategy, combining information from different scales.
By decomposing self-attention into a hierarchy of local and global operations, DCSA significantly reduces the computational cost while maintaining the ability to model multi-scale dependencies effectively.
**Results**
The proposed DCSA mechanism was evaluated on various image classification and object detection benchmarks, including ImageNet-1K, COCO, and PASCAL VOC. DCSA consistently outperformed existing self-attention mechanisms, achieving state-of-the-art results on multiple tasks.
Specifically, on ImageNet-1K classification, DCSA achieved a top-1 accuracy of 84.2%, surpassing the previous state-of-the-art by 0.6%. On the COCO object detection benchmark, DCSA achieved a box AP of 52.3%, improving upon the previous state-of-the-art by 1.5%.
**Conclusion**
This paper presents a novel Divide-and-Conquer Self-Attention (DCSA) mechanism for Transformer-based computer vision models. DCSA decomposes self-attention into a hierarchy of local and global operations, enabling efficient and effective modeling of multi-scale dependencies in visual data. The proposed method has achieved state-of-the-art results on various image classification and object detection benchmarks, demonstrating its potential for advancing the field of computer vision.
**Code Availability**
The code for the proposed DCSA mechanism has been open-sourced and is available at [Github URL]. This will facilitate further research and exploration in the field of computer vision.