The human visual perception system demonstrates exceptional capabilities in learning without explicit supervision and understanding the part-to-whole composition of objects. Drawing inspiration from these two abilities, we propose Hierarchical Adaptive Self-Supervised Object Detection (HASSOD), a novel approach that learns to detect objects and understand their compositions without human supervision.
HASSOD employs a hierarchical adaptive clustering strategy to group regions into object masks based on self-supervised visual representations, adaptively determining the number of objects per image. Furthermore, HASSOD identifies the hierarchical levels of objects in terms of composition, by analyzing coverage relations between masks and constructing tree structures. This additional self-supervised learning task leads to improved detection performance and enhanced interpretability. Lastly, we abandon the inefficient multi-round self-training process utilized in prior methods and instead adapt the Mean Teacher framework from semi-supervised learning, which leads to a smoother and more efficient training process.
Through extensive experiments on prevalent image datasets, we demonstrate the superiority of HASSOD over existing methods, thereby advancing the state of the art in self-supervised object detection. Notably, we improve Mask AR from 20.2 to 22.5 on LVIS, and from 17.0 to 26.0 on SA-1B.
HASSOD adopts a two-stage discover-and-learn process to learn a self-supervised object detector. In the first stage, we discover objects from unlabeled images using self-supervised representations, and generate a set of initial pseudo-labels. Then in the second stage, we learn an object detector based on the initial pseudo-labels, and smoothly refine the model by self-training.
The first stage is based on pre-trained, fixed visual features, and the second stage learns an object detector to improve over the fixed visual features and pseudo-labels.
HASSOD creates a set of pseudo-labels as the initial self-supervision source. We propose a hierarchical adaptive clustering strategy to discover object masks as pseudo-labels, using only unlabeled images and a frozen self-supervised visual backbone. Most importantly, we incorporate the concept of hierarchical levels into object masks by leveraging the coverage relations between them.
HASSOD significantly outperforms the previous self-supervised methods (e.g., FreeSOLO and CutLER) in terms of average recall (Mask AR) at all object scales (Small, Medium, and Large). For example, HASSOD improves Mask AR from 17.0 to 26.0 on SA-1B. HASSOD also leads to a reduced gap between fully self-supervised models and the supervised model SAM. Notably, HASSOD only uses 1/5 of training images and 1/12 of training iterations as CutLER.
@inproceedings{cao2023hassod,
title={{HASSOD}: Hierarchical Adaptive Self-Supervised Object Detection},
author={Cao, Shengcao and Joshi, Dhiraj and Gui, Liangyan and Wang, Yu-Xiong},
booktitle={NeurIPS},
year={2023}
}