HyRo

Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation

CVPR 2026 - PVUW Workshop

Hoang M. Truong   Hai Nguyen-Truong   Dang Huynh

Fulbright University Vietnam

HyRo Teaser

HyRo rotates the text embeddings to achieve a smaller angle (β) relative to the target image embeddings compared to the initial angle (α). This geometric adjustment enables the model to resolve semantic ambiguities and produce more accurate, fine-grained segmentations.

Abstract

Open-vocabulary semantic segmentation requires adapting image-level vision-language models such as CLIP to dense pixel-level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment. While recent works leverage hyperbolic geometry to model hierarchical relationships, they primarily focus on aligning hierarchical depth and overlook semantic misalignment among embeddings at the same level. In this work, we propose HyRo, a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincar\'e ball model. HyRo adjusts the hyperbolic radius to match hierarchical levels and refines semantic relationships by optimizing angular alignment through an orthogonal transformation that is proven theoretically to preserve radius. Extensive experiments on standard open-vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves competitive performance on standard benchmarks.

Methodology

HyRo Rotation Approach

Hyperbolic Rotation (HyRo). Embeddings are rotated around the origin in hyperbolic space using an orthogonal block matrix to minimize the angle between visual and textual features while preserving their hyperbolic radii, thereby enhancing cross-modal semantic alignment.


HyRo Overall Architecture

Overall Architecture. Given image and text inputs, Euclidean embeddings are mapped to the Poincaré ball via the exponential map. HyRo then decouples alignment into two stages:

  • Hierarchical Adjustment: Uses block-diagonal radius scaling matrices to align granularity.
  • Semantic Refinement: Uses orthogonal rotation matrices to adjust angular relationships without altering the radius.

The refined hyperbolic embeddings are mapped back to the tangent space for decoding.

Qualitative Results

Results of a qualitative comparison on the ADE20K dataset include 847 categories.

Qualitative compare on A-847

Qualitative results on the ADE20K dataset with 847 categories.

Qualitative results on A-847