Authors:
(1) Hyeongjun Kwon, Yonsei University;
(2) Jinhyun Jang, Yonsei University;
(3) Jin Kim, Yonsei University;
(4) Kwonyoung Kim, Yonsei University;
(5) Kwanghoon Sohn, Yonsei University and Korea Institute of Science and Technology (KIST).
4. Method
4.2. Probabilistic hierarchy tree
4.3. Visual hierarchy decomposition
4.4. Learning hierarchy in hyperbolic space
4.5. Visual hierarchy encoding
5. Experiments and 5.1. Image classification
5.2. Object detection and Instance segmentation
6. Ablation studies and discussion
Hierarchy-aware visual recognition. Unsupervised image parsing is a long-standing pursuit in visual recognition tasks from the classical computer vision era [39–43]. In the pre-deep learning era, Zhouwen et al. [42] firstly introduces a framework to parse images with their constituents via a divide-and-conquer strategy. CapsuleNet-based methods have demonstrated substantial enhancements in image parsing, facilitated by dynamic routing, which efficiently capture the compositional relationships among the activities of capsules that represent object parts. Recently, hierarchical semantic segmentation has been extensively researched, including human parser [44, 45] based on human-part hierarchy and unsupervised part segmentations [46–48].
Beyond image parsing, recent researches on deep neural networks (DNNs) have attempted to exploit hierarchical relationships between detail and global representations. CrossViT [16] utilizes a dual-branch transformer for multiscale feature extraction, enriching features through a fusion module that integrates inter-scale patch relationships. Quadtree [18] iteratively and hierarchically selects a subset of crucial finer patches within each coarse patch. DependencyViT [19] inverts the self-attention process to organize patches as parent and child nodes, enabling a hierarchical exploration. More recently, CAST [49] employs superpixel-based patch generation and graph pooling for hierarchical patch merging for improving fine-grained recognition performances. While they define hierarchical relations with token similarities in Euclidean space, we predefine a hierarchical structure with probabilistic modeling and learn the relation in hyperbolic space.
Probabilistic modeling. Probabilistic representation has been extensively explored in the early NLP studies for handling the nuance of word semantics with the probability distribution. For instance, Vilnis et al. [30] first introduced the probability densities for representing word embeddings. Athiwaratkun et al. [31] discovered that an imbalance in word frequency leads to distortions in word order and mitigated the problem by representing the word orders through the encapsulation of probability densities. Besides word representation, abundant research has demonstrated the effectiveness of probabilistic modeling in visual representation [6, 50–52]. For example, Shi et al. [51] proposed to penalize the low quality face images by measuring the variance of each image distribution. Chun et al. [50] identified the limitations of deterministic modeling in vision-language domains and introduced probabilistic cross-modal embedding for providing the uncertainty estimates.
In this work, we deploy probabilistic modeling in defining hierarchical structure, where each distribution represents the inclusive relations of hierarchy nodes.
Hyperbolic manifold. Hyperbolic manifolds have gained increasing interest in deep learning area due to their effectiveness in modeling hierarchical structures. Their success in NLP field [24, 26, 27, 53] has inspired approaches to adopt hyperbolic manifolds in computer vision researches such as image retrieval [1, 2], image segmentation [54, 55], and few-shot learning [25]. As a pioneering work, Khrulkov et al. [56] investigated an exponential map from Euclidean space to hyperbolic space for learning hierarchical image embeddings. Ermolov et al. [1] applied pair-wise cross entropy loss in hyperbolic space for ViTs. Kim et al. [2] extended the work by discovering the latent hierarchy of training data with learnable hierarchical proxies in hyperbolic space. Focusing on pixel-level analysis, [55] identified the long-tail objects by embedding masked instance regions into hyperbolic manifolds. More recently, Desai et al. [29] introduced to learn joint image-text embedding space in hyperbolic manifold. While they explore hyperbolic manifold for representing the categorical hierarchies, we identify the hierarchical structure of visual elements without the part-level annotation through a novel hierarchical contrastive loss.
This paper is available on arxiv under CC BY 4.0 DEED license.