VideoRoPE

Abstract

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce VideoRoPE, with a 3D structure designed to preserve spatio-temporal relationships. VideoRoPE features low-frequency temporal allocation to mitigate periodic oscillations, a diagonal layout to maintain spatial symmetry, and adjustable temporal spacing to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.

VideoRoPE: Key Characteristics

Method	2D/3D Structure	Frequency Allocation	Spatial Symmetry	Temporal Index Scaling
Vanilla RoPE	✗	✗	✗	✗
TAD-RoPE	✗	✗	✗	✔︎
RoPE-Tie	✔︎	✗	✔︎	✗
M-RoPE	✔︎	✗	✗	✗
VideoRoPE	✔︎	✔︎	✔︎	✔︎

The table compares different RoPE variants for Video Large Language Models (Video LLMs), specifically focusing on their features across four categories: 2D/3D Structure, Frequency Allocation, Spatial Symmetry, and Temporal Index Scaling.

VideoRoPE: Analysis

Left: To demonstrate the importance of frequential allocation, based on VIAH (a) we present a more challenging V-NIAH-D task (b) that similar images are inserted as distractors. Right: Compared to M-RoPE, our VideoRoPE is more robust in retrieval and is less affected by distractors.

Attention-based frequential allocation analysis. Middle: M-RoPE's temporal dimension (t) is limited to local information, resulting in a diagonal layout. Bottom: VideoRoPE effectively retrieves the needle using the temporal dimension. The x and y coordinates represent the video frame number, e.g., 50 for 50 frames.

VideoRoPE: Design

We present VideoRoPE, a video position embedding strategy that prioritizes temporal modeling with Low-frequency Temporal Allocation (LTA), reduces oscillations, and ensures robustness. It uses a Diagonal Layout (DL) for spatial symmetry and introduces Adjustable Temporal Spacing (ATS) to control temporal spacing. VideoRoPE effectively models spatiotemporal information for robust video representation.

(a) M-RoPE models temporal dependencies using the first 16 rotary angles, which exhibit higher frequencies and more pronounced oscillations. (b) In contrast, VideoRoPE models temporal dependencies using the last 16 rotary angles, characterized by significantly wider, monotonic intervals. Our frequency allocation effectively mitigates the misleading influence of distractors in V-NIAH-D.

The 3D visualization for different position embedding. (a) The vanilla 1D RoPE does not incorporate spatial modeling. (b) M-RoPE, while have the 3D structure, introduces a discrepancy in index growth for visual tokens across frames, with some indices remaining constant. (c) In contrast, our VideoRoPE achieves the desired balance, maintaining the consistent index growth pattern of vanilla RoPE while simultaneously incorporating spatial modeling.

The position embeddings of adjacent text tokens for Vanilla RoPE (top row), the corresponding visual tokens in adjacent frames for M-RoPE (middle row) and our VideoRoPE (bottom row) with interleaved spatial and temporal last design.

📊 Performance

We visualize the retrieval results for V-NIAH and V-NIAH-D, demonstrating that VideoRoPE outperforms other RoPE variants, particularly in extrapolation within the test context, with a color gradient from green to red representing retrieval performance from perfect to zero.

VideoRoPE consistently outperforms previous RoPE variants across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination, demonstrating superior performance on various benchmarks.

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃


@article{wei2025videorope,
  title={VideoRoPE: What Makes for Good Video Rotary Position Embedding?},
  author={Wei, Xilin and Liu, Xiaoran and Zang, Yuhang and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Tong, Jian and Duan, Haodong and Guo, Qipeng and Wang, Jiaqi and others},
  journal={arXiv preprint arXiv:2502.05173},
  year={2025}
}

Acknowledgement

This website is adapted from Nerfies and LLaVA, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.