paint-brush
Why Hyperbolic Space Matters for AI Scene Recognitionby@hyperbole

Why Hyperbolic Space Matters for AI Scene Recognition

by HyperboleFebruary 27th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Hi-Mapper enhances AI’s ability to recognize visual hierarchies by mapping feature representations into hyperbolic space. It uses a probabilistic hierarchy tree and contrastive loss to refine deep learning models, improving structured scene understanding in pre-trained neural networks.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Why Hyperbolic Space Matters for AI Scene Recognition
Hyperbole HackerNoon profile picture
0-item

Authors:

(1) Hyeongjun Kwon, Yonsei University;

(2) Jinhyun Jang, Yonsei University;

(3) Jin Kim, Yonsei University;

(4) Kwonyoung Kim, Yonsei University;

(5) Kwanghoon Sohn, Yonsei University and Korea Institute of Science and Technology (KIST).

Abstract and 1 Introduction

2. Related Work

3. Hyperbolic Geometry

4. Method

4.1. Overview

4.2. Probabilistic hierarchy tree

4.3. Visual hierarchy decomposition

4.4. Learning hierarchy in hyperbolic space

4.5. Visual hierarchy encoding

5. Experiments and 5.1. Image classification

5.2. Object detection and Instance segmentation

5.3. Semantic segmentation

5.4. Visualization

6. Ablation studies and discussion

7. Conclusion and References

A. Network Architecture

B. Theoretical Baseline

C. Additional Results

D. Additional visualization

4.1. Overview

Our goal is to enhance the structured understanding of pre-trained deep neural networks (DNNs) by investigating the hierarchical organization of visual scenes. To this end, we introduce a Visual Hierarchy Mapper (Hi-Mapper) which serves as a plug-and-play module on any type of pre-trained DNNs. An overview of Hi-Mapper is depicted in Fig. 2a.



Hi-Mapper identifies the visual hierarchy from the visual feature map vmap and encodes the identified visual hierarchy back to the global visual representation vcls for enhancing the recognition of a whole scene. To this end, we predefine a hierarchy tree with Gaussian distribution, where the relations of the hierarchy nodes are defined through the inclusion of probability densities (Sec. 4.2). The pre-defined




hierarchy tree interacts with vmap through the hierarchy decomposition module \protect \mathcal {D} such that the feature map is decomposed into the visual hierarchy (Sec. 4.3). Since the zero curvature of Euclidean space is not optimal for representing the hierarchical structure, we map the visual hierarchy to hyperbolic space and optimize the relation with a novel hierarchical contrastive loss (Sec. 4.4). The visual hierarchy is then encoded back to the global visual representation vcls through the hierarchy encoding module \protect \mathcal {G} resulting in an enhanced global representation (Sec. 4.5).


4.2. Probabilistic hierarchy tree

The main problem of the recent hierarchy-aware ViTs [18– 20] is that they define the hierarchical relations between the image tokens mainly through the self-attention scores. Such a symmetric measurement is suboptimal for representing the asymmetric inclusive relation of parent-child nodes. To handle the problem, we propose to define L levels of hierarchy tree T by modeling each hierarchy node with a probability distribution.


Specifically, we first parameterize each leaf-level (at the initial level) node as a unique Gaussian distribution and subsequently define the higher-level node as a Mixtureof-Gaussians (MoG) of its corresponding child nodes. Accordingly, the mean vector represents the cluster center of the visual semantic and the covariance captures the scale of each semantic cluster.







4.3. Visual hierarchy decomposition

Given the pre-defined hierarchy tree T and the visual feature map vmap, we decompose vmap into L levels of visual hierarchy through hierarchy decomposition module D, as shown in Fig. 3a. We instantiate D as a stack of two transformer decoder layers.



4.4. Learning hierarchy in hyperbolic space

A natural characteristic of the hierarchical structure is that the number of nodes exponentially increases as the depth increases. In practice, representing this property in Euclidean space leads to distortions in the semantic distances due to its flat geometry. We propose to handle the problem by learning the hierarchical relations in hyperbolic space, where its exponentially expanding volume can efficiently represent the visual hierarchy.







4.5. Visual hierarchy encoding



This paper is available on arxiv under CC BY 4.0 DEED license.