Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

1Sichuan Changhong Electric Co., Ltd., 2UESTC, 3University of Glasgow,
4G42, 5ETH Zurich, 6CCST, NUAA, 7MBZUAI
in CVPR 2023

Left: Visual comparison of COD in different challenging scenarios. Right: Different types of decoding structures
for object segmentation: (a) U-shaped decoding structure, (b) dense integration strategy, (c) feedback
refinement strategy, (d) separate decoding of low-level and high-level features, and (e) our decoding structure.

Abstract

Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection. However, they suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders, which are not conducive to camouflaged object detection that explores subtle cues from indistinguishable backgrounds.

To address these issues, in this paper, we propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features through progressive shrinking for camouflaged object detection. Specifically, we propose a nonlocal token enhancement module (NL-TEM) that employs the non-local mechanism to interact neighboring tokens and explore graph-based high-order relations within tokens to enhance local representations of transformers. Moreover, we design a feature shrinkage decoder (FSD) with adjacent interaction modules (AIM), which progressively aggregates adjacent transformer features through a layer-bylayer shrinkage pyramid to accumulate imperceptible but effective cues as much as possible for object information decoding.

Extensive quantitative and qualitative experiments demonstrate that the proposed model significantly outperforms the existing 24 competitors on three challenging COD benchmark datasets under six widely-used evaluation metrics.

Method

FSPNet architecture

Overall architecture of the proposed FSPNet. It consists of three key components: a ViT-based encoder, a non-local token enhancement module (NL-TEM) and a feature shrinkage decoder (FSD) with adjacent interaction modules (AIM). Specifically, the input image is first serialized into tokens as input to a transformer encoder to model global contexts using the self-attentive mechanism. After that, to strengthen the local feature representation within tokens, a non-local token enhancement module (NL-TEM) is designed to perform feature interaction and exploration between and within tokens and convert the enhanced tokens from the encoder space to the decoder space for decoding. In the decoder, to merge and retain subtle but critical cues as much as possible, we design a feature shrinkage decoder (FSD) to progressively aggregates adjacent features through layer-bylayer shrinkage to decode object information.



Experiments & Results

Quantitative Comparisons

Quantitative comparison with 24 SOTA methods on three benchmark datasets. The best and second best are bolded and underlined for highlighting, respectively.

Visual Comparisons

Visual comparison with some representative SOTA models in challenging scenarios.

More visual results on different datasets.



Poster



Related Works

  • Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, and Huchuan Lu. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

  • Yujia Sun, Shuo Wang, Chenglizhao Chen, and Tian-Zhu Xiang. Boundary-guided camouflaged object detection. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), 2022.

  • Deng-Ping Fan, Ge-Peng Ji, Ming-Ming Cheng, and Ling Shao. Concealed object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

  • Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

  • Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR), 2021.

  • Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning (ICML), 2021.

  • Please find more related resources for COD here.
  • BibTeX

    @inproceedings{Huang2023Feature,
        author    = {Huang, Zhou and Dai, Hang and Xiang, Tian-Zhu and Wang, Shuo and Chen, Huai-Xin and Qin, Jie and Xiong, Huan},
        title     = {Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers},
        booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        year      = {2023}
    }