Left: Visual comparison of COD in different challenging scenarios. Right: Different types of decoding structures
for object segmentation: (a) U-shaped decoding structure, (b) dense integration strategy, (c) feedback
refinement strategy, (d) separate decoding of low-level and high-level features, and (e) our decoding structure.
Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection.
However, they suffer from two major limitations: less effective locality modeling and insufficient feature aggregation
in decoders, which are not conducive to camouflaged object detection that explores subtle cues from indistinguishable backgrounds.
To address these issues, in this paper, we propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet),
which aims to hierarchically decode locality-enhanced neighboring transformer features through progressive shrinking for camouflaged
object detection. Specifically, we propose a nonlocal token enhancement module (NL-TEM) that employs the non-local mechanism to
interact neighboring tokens and explore graph-based high-order relations within tokens to enhance local representations of transformers.
Moreover, we design a feature shrinkage decoder (FSD) with adjacent interaction modules (AIM), which progressively aggregates adjacent
transformer features through a layer-bylayer shrinkage pyramid to accumulate imperceptible but effective cues as much as possible for
object information decoding.
Extensive quantitative and qualitative experiments demonstrate that the proposed model significantly
outperforms the existing 24 competitors on three challenging COD benchmark datasets under six widely-used evaluation metrics.
Overall architecture of the proposed FSPNet. It consists of three key components: a ViT-based encoder, a non-local token enhancement module (NL-TEM) and a feature shrinkage decoder (FSD) with adjacent interaction modules (AIM). Specifically, the input image is first serialized into tokens as input to a transformer encoder to model global contexts using the self-attentive mechanism. After that, to strengthen the local feature representation within tokens, a non-local token enhancement module (NL-TEM) is designed to perform feature interaction and exploration between and within tokens and convert the enhanced tokens from the encoder space to the decoder space for decoding. In the decoder, to merge and retain subtle but critical cues as much as possible, we design a feature shrinkage decoder (FSD) to progressively aggregates adjacent features through layer-bylayer shrinkage to decode object information.
Quantitative comparison with 24 SOTA methods on three benchmark datasets. The best and second best are bolded and underlined for highlighting, respectively.
Visual comparison with some representative SOTA models in challenging scenarios.
More visual results on different datasets.
@inproceedings{Huang2023Feature,
author = {Huang, Zhou and Dai, Hang and Xiang, Tian-Zhu and Wang, Shuo and Chen, Huai-Xin and Qin, Jie and Xiong, Huan},
title = {Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023}
}