CVPR 2022 Oral | Divide-and-Conquer Self-Attention via Multi-Scale Token Aggregation - Code Released

2024-01-28 00:52:48

    **Introduction** 

    Visual Transformers (ViTs) have emerged as a powerful class of models for computer vision tasks, achieving remarkable results on a wide range of benchmarks. However, the self-attention mechanism employed in ViTs can be computationally expensive, especially for high-resolution images.

    This paper introduces a novel Divide-and-Conquer Self-Attention (DCSA) mechanism that addresses the computational challenges of self-attention in ViTs. DCSA decomposes self-attention into a hierarchy of local and global operations, enabling efficient and effective modeling of multi-scale dependencies in visual data.

    **Method** 

    The proposed DCSA mechanism consists of three main components:

    * **Local Self-Attention:**  Computes attention within small, localized regions of the image, capturing local dependencies.
    * **Global Self-Attention:**  Aggregates local attention results and computes attention across the entire image, capturing global dependencies.
    * **Multi-Scale Token Aggregation:**  Fuses the local and global attention results using a multi-scale aggregation strategy, combining information from different scales.

    By decomposing self-attention into a hierarchy of local and global operations, DCSA significantly reduces the computational cost while maintaining the ability to model multi-scale dependencies effectively.

    **Results** 

    The proposed DCSA mechanism was evaluated on various image classification and object detection benchmarks, including ImageNet-1K, COCO, and PASCAL VOC. DCSA consistently outperformed existing self-attention mechanisms, achieving state-of-the-art results on multiple tasks.

    Specifically, on ImageNet-1K classification, DCSA achieved a top-1 accuracy of 84.2%, surpassing the previous state-of-the-art by 0.6%. On the COCO object detection benchmark, DCSA achieved a box AP of 52.3%, improving upon the previous state-of-the-art by 1.5%.

    **Conclusion** 

    This paper presents a novel Divide-and-Conquer Self-Attention (DCSA) mechanism for Transformer-based computer vision models. DCSA decomposes self-attention into a hierarchy of local and global operations, enabling efficient and effective modeling of multi-scale dependencies in visual data. The proposed method has achieved state-of-the-art results on various image classification and object detection benchmarks, demonstrating its potential for advancing the field of computer vision.

    **Code Availability** 

    The code for the proposed DCSA mechanism has been open-sourced and is available at [Github URL]. This will facilitate further research and exploration in the field of computer vision.

Kyle

探索Web开发资源和人工智能教程的代码社区

联系我

扫码关注微信公众号

CVPR 2022 Oral | Divide-and-Conquer Self-Attention via Multi-Scale Token Aggregation - Code Released

Kyle

揭秘推荐系统背后机制：知识图谱下的KSR模型解析

火力全开，带你领略Keras微调的魅力

一文读懂KNN算法与超参数的调试

图嵌入算法：从根本认识结构化数据背后的联系

PyQt5中的QLabel设置背景setAutoFillBackground