Authors:
(1) Hyeongjun Kwon, Yonsei University;
(2) Jinhyun Jang, Yonsei University;
(3) Jin Kim, Yonsei University;
(4) Kwonyoung Kim, Yonsei University;
(5) Kwanghoon Sohn, Yonsei University and Korea Institute of Science and Technology (KIST).
4. Method
4.2. Probabilistic hierarchy tree
4.3. Visual hierarchy decomposition
4.4. Learning hierarchy in hyperbolic space
4.5. Visual hierarchy encoding
5. Experiments and 5.1. Image classification
5.2. Object detection and Instance segmentation
6. Ablation studies and discussion
Our goal is to enhance the structured understanding of pre-trained deep neural networks (DNNs) by investigating the hierarchical organization of visual scenes. To this end, we introduce a Visual Hierarchy Mapper (Hi-Mapper) which serves as a plug-and-play module on any type of pre-trained DNNs. An overview of Hi-Mapper is depicted in Fig. 2a.
Hi-Mapper identifies the visual hierarchy from the visual feature map vmap and encodes the identified visual hierarchy back to the global visual representation vcls for enhancing the recognition of a whole scene. To this end, we predefine a hierarchy tree with Gaussian distribution, where the relations of the hierarchy nodes are defined through the inclusion of probability densities (Sec. 4.2). The pre-defined
hierarchy tree interacts with vmap through the hierarchy decomposition module \protect \mathcal {D} such that the feature map is decomposed into the visual hierarchy (Sec. 4.3). Since the zero curvature of Euclidean space is not optimal for representing the hierarchical structure, we map the visual hierarchy to hyperbolic space and optimize the relation with a novel hierarchical contrastive loss (Sec. 4.4). The visual hierarchy is then encoded back to the global visual representation vcls through the hierarchy encoding module \protect \mathcal {G} resulting in an enhanced global representation (Sec. 4.5).
The main problem of the recent hierarchy-aware ViTs [18– 20] is that they define the hierarchical relations between the image tokens mainly through the self-attention scores. Such a symmetric measurement is suboptimal for representing the asymmetric inclusive relation of parent-child nodes. To handle the problem, we propose to define L levels of hierarchy tree T by modeling each hierarchy node with a probability distribution.
Specifically, we first parameterize each leaf-level (at the initial level) node as a unique Gaussian distribution and subsequently define the higher-level node as a Mixtureof-Gaussians (MoG) of its corresponding child nodes. Accordingly, the mean vector represents the cluster center of the visual semantic and the covariance captures the scale of each semantic cluster.
Given the pre-defined hierarchy tree T and the visual feature map vmap, we decompose vmap into L levels of visual hierarchy through hierarchy decomposition module D, as shown in Fig. 3a. We instantiate D as a stack of two transformer decoder layers.
A natural characteristic of the hierarchical structure is that the number of nodes exponentially increases as the depth increases. In practice, representing this property in Euclidean space leads to distortions in the semantic distances due to its flat geometry. We propose to handle the problem by learning the hierarchical relations in hyperbolic space, where its exponentially expanding volume can efficiently represent the visual hierarchy.
This paper is available on arxiv under CC BY 4.0 DEED license.