Hyperbolic Rotation (HyRo). Embeddings are rotated around the origin in hyperbolic space using an orthogonal block matrix to minimize the angle between visual and textual features while preserving their hyperbolic radii, thereby enhancing cross-modal semantic alignment.
Overall Architecture. Given image and text inputs, Euclidean embeddings are mapped to the Poincaré ball via the exponential map. HyRo then decouples alignment into two stages:
The refined hyperbolic embeddings are mapped back to the tangent space for decoding.
Results of a qualitative comparison on the ADE20K dataset include 847 categories.
Qualitative results on the ADE20K dataset with 847 categories.