返回

CVPR 2022 Oral | Divide-and-Conquer Self-Attention via Multi-Scale Token Aggregation - Code Released

人工智能

CVPR 2022 Oral | Divide-and-Conquer Self-Attention via Multi-Scale Token Aggregation - Code Released

    **Introduction** 

    Visual Transformers (ViTs) have emerged as a powerful class of models for computer vision tasks, achieving remarkable results on a wide range of benchmarks. However, the self-attention mechanism employed in ViTs can be computationally expensive, especially for high-resolution images.

    This paper introduces a novel Divide-and-Conquer Self-Attention (DCSA) mechanism that addresses the computational challenges of self-attention in ViTs. DCSA decomposes self-attention into a hierarchy of local and global operations, enabling efficient and effective modeling of multi-scale dependencies in visual data.

    **Method** 

    The proposed DCSA mechanism consists of three main components:

    * **Local Self-Attention:**  Computes attention within small, localized regions of the image, capturing local dependencies.
    * **Global Self-Attention:**  Aggregates local attention results and computes attention across the entire image, capturing global dependencies.
    * **Multi-Scale Token Aggregation:**  Fuses the local and global attention results using a multi-scale aggregation strategy, combining information from different scales.

    By decomposing self-attention into a hierarchy of local and global operations, DCSA significantly reduces the computational cost while maintaining the ability to model multi-scale dependencies effectively.

    **Results** 

    The proposed DCSA mechanism was evaluated on various image classification and object detection benchmarks, including ImageNet-1K, COCO, and PASCAL VOC. DCSA consistently outperformed existing self-attention mechanisms, achieving state-of-the-art results on multiple tasks.

    Specifically, on ImageNet-1K classification, DCSA achieved a top-1 accuracy of 84.2%, surpassing the previous state-of-the-art by 0.6%. On the COCO object detection benchmark, DCSA achieved a box AP of 52.3%, improving upon the previous state-of-the-art by 1.5%.

    **Conclusion** 

    This paper presents a novel Divide-and-Conquer Self-Attention (DCSA) mechanism for Transformer-based computer vision models. DCSA decomposes self-attention into a hierarchy of local and global operations, enabling efficient and effective modeling of multi-scale dependencies in visual data. The proposed method has achieved state-of-the-art results on various image classification and object detection benchmarks, demonstrating its potential for advancing the field of computer vision.

    **Code Availability** 

    The code for the proposed DCSA mechanism has been open-sourced and is available at [Github URL]. This will facilitate further research and exploration in the field of computer vision.