Timothy Delille / notes / rotary-positional-embeddings

rotary-positional-embeddings

RoPE is a positional encoding method that applies a rotation operation to the input embeddings based on their positions in the sequence. The key idea is to encode positional information by rotating the embedding vectors, which preserves the inner product structure and allows the model to capture relative positions.

It can be seen as multiplicative whereas adding absolute positional embeddings is additive.

Mathematical Formulation of RoPE

let $x\in \mathbb{R}^d$ be an input embedding vector of size $d$ . For a sequence of tokens, each token is associated with a position index $t$ .

The embedding vector $x$ (corresponding to 1 token in the sequence) is split into pairs of dimensions $(x_1, x_2), (x_3, x_4), \dots$ (if $d$ is odd, we just pad the embedding).

For each pair of dimensions $(x_{2i-1}, x_{2i})$ , a rotation matrix $R_t$ is applied based on position $t$ of the token within the sequence, defined as:

R_t = \begin{bmatrix}\cos(t\theta_i) & -\sin(t\theta_i) \\\sin(t\theta_i) & \cos(t\theta_i) \\\end{bmatrix}

where $\theta_i$ is a frequency parameter that depends on the dimension $i$ (not token position). These frequencies are chosen to form a geometric progression. We will say more about this frequency parameter in the next section.

Each pair is multiplied by the matrix $R_t$ : $(x_{2i - 1}', x_{2i}') = R_t . (x_{2i - 1}, x_{2i})$

After rotating all the pairs, the final rotated embedding is obtained by concatenating the rotated pairs.

The rotation operation preserves the inner product between two embeddings, which is crucial for the dot-product attention mechanism in transformers. For two embeddings $x$ and $y$ at respective positions $t$ and $s$ in the sequence, the inner product after rotation is:

<x_t, y_s>=<R_t x, R_s y> = <x, R_{-t}R_sy>=<x,R_{s-t}y>

The rotation operation is computationally efficient and can be implemented using simple matrix multiplications. It does not require additional parameters, unlike learned positional embeddings.

Frequency Parameter $θ_i$

The frequency parameter determines how quickly a specific dimension $i$ rotates. It is typically defined as:

\theta_i = \frac{1}{10000^{2i/d}}, i \geq 0

This ensures that the frequencies form a geometric progression, allowing the model to capture both short range and long range dependencies.

Using complex numbers makes it easier to reason about the rotation. First define:

z_i = x_{2i} + i x_{2i+1}

Apply the rotation:

\tilde{z_i} = z_i e^{it\theta_i}

Where $t$ is the token position.

We can see that, taking the dot product between two vectors $q$ and $k$ at position $m$ and $n$ respectively:

<RoPE(q,m), RoPE(k,n)> = \sum_j^{d}<q,k>e^{i(m-n)\theta_j}

(the minus appears because the inner product in complex space requires taking the conjugate of the second term)

When $\theta_i$ increases, differences in positions $m-n$ lead to larger phase shifts, which means that the attention score drops quickly as the position increases. Since $\theta_i$ decreases with dimension $i$ , lower dimension rotate faster, thus put more importance on nearby tokens (since attention scores drop quickly with increasing position differences), and attend to short range dependencies (syntax, nearby words). Conversely, higher dimensions rotate slower, and attention scores stay stable across long distances, attending to long-range dependencies (coreference, topic continuity).

Since slow-rotating dimensions remain stable over large distances, RoPE models naturally generalize to sequences longer than those seen during training.

Why do we need to pair dimensions ?

The need to create pairs of dimensions in Rotary Position Embedding (RoPE) arises from the mathematical formulation of the rotation operation in high-dimensional spaces.

Rotation is a well-defined operation in 2D space. For a 2D vector, a rotation by an angle $\theta$ can be represented using a rotation matrix which rotates the vector by $\theta$ radians while preserving its magnitude.

In high-dimensional spaces (e.g. $\mathbb{R}^d$ , where $d$ is the embedding dimension), rotation is not as straightforward as in 2D. However, we can still perform rotations by applying 2D rotations to pairs of dimensions. This is because a rotation in high-dimensional space can be decomposed into a series of independent 2D rotations applied to pairs of dimensions.

Without pairing, it would be unclear how to generalize the rotation operation to high-dimensional spaces in a way that preserves the desired properties (e.g., preserving inner products). We would need to define a more complex high-dimensional rotation operation, which would be computationally expensive and harder to implement. This approach ensures that the rotation operation is efficient (each rotation operation involves only a small matrix multiplication (2x2), which can be easily parallelized) and more interpretable.

Implementation

Naive implementation:

split the embedding into pairs
generate the rotation matrix for each pair
apply the rotation matrix to each pair
concatenate all rotated pairs


def rope_naive(x, t):

    d = x.shape[-1]

    x_rotated = torch.zeros_like(x)

    for i in range(d // 2):

        theta_i = 10000 ** (-2 * i / d)

        cos_t = torch.cos(t * theta_i)

        sin_t = torch.sin(t * theta_i)

        x1 = x[..., 2*i]

        x2 = x[..., 2*i + 1]

        x_rotated[..., 2*i] = x1 * cos_t - x2 * sin_t

        x_rotated[..., 2*i + 1] = x1 * sin_t + x2 * cos_t

    return x_rotated

Efficient implementation:

Leverage vectorization and trigonometric identities to fuse operations, avoiding explicit splitting of dimensions and redundant computations
for position t, pre-compute a complex rotation factor $\Theta_t^j=e^{it\theta_k}=\cos(t\theta_j) + i \sin(t\theta_j)$
apply rotation via complex multiplication (???)
Decompose the complex into real numbers to recover the rotated embedding: $x_t=[\Re(x_t^{(1)}), \Im(x_t^{(1)}), \dots, \Re(x_t^{(d/2)}, \Im(x_t^{(d/2)})]$


def rope_efficient(x, t):

    d = x.shape[-1]

    # Reshape to complex numbers

    x_complex = torch.view_as_complex(x.reshape(*x.shape[:-1], -1, 2))

    # Precompute rotation factors (cached during initialization)

    freqs = 10000 ** (-2 * torch.arange(d//2) / d)

    theta = t.unsqueeze(-1) * freqs.unsqueeze(0)  # Shape: [seq_len, d//2]

    rotation_factors = torch.cos(theta) + 1j * torch.sin(theta)

    # Apply rotation

    x_rotated_complex = x_complex * rotation_factors

    # Convert back to real numbers

    x_rotated = torch.view_as_real(x_rotated_complex).flatten(-2)

    return x_rotated

For a more formal derivation, see this blog post by Eleuther AI: https://blog.eleuther.ai/rotary-embeddings/

links

x.com/timothydelille

linkedin.com/in/timothydelille

github.com/timothydelille

Mathematical Formulation of RoPE

Frequency Parameter θiθ_iθi​

Why do we need to pair dimensions ?

Implementation

links

Frequency Parameter $θ_i$