rotary-positional-embeddings

RoPE is a positional encoding method that applies a rotation operation to the input embeddings based on their positions in the sequence. The key idea is to encode positional information by rotating the embedding vectors, which preserves the inner product structure and allows the model to capture relative positions.

It can be seen as multiplicative whereas adding absolute positional embeddings is additive.

Mathematical Formulation of RoPE

let xRdx\in \mathbb{R}^d be an input embedding vector of size dd. For a sequence of tokens, each token is associated with a position index tt.

The embedding vector xx (corresponding to 1 token in the sequence) is split into pairs of dimensions (x1,x2),(x3,x4),(x_1, x_2), (x_3, x_4), \dots (if dd is odd, we just pad the embedding).

For each pair of dimensions (x2i1,x2i)(x_{2i-1}, x_{2i}), a rotation matrix RtR_t is applied based on position tt of the token within the sequence, defined as:

Rt=[cos(tθi)sin(tθi)sin(tθi)cos(tθi)]R_t = \begin{bmatrix}\cos(t\theta_i) & -\sin(t\theta_i) \\\sin(t\theta_i) & \cos(t\theta_i) \\\end{bmatrix}

where θi\theta_i is a frequency parameter that depends on the dimension ii (not token position). These frequencies are chosen to form a geometric progression. We will say more about this frequency parameter in the next section.

Each pair is multiplied by the matrix RtR_t: (x2i1,x2i)=Rt.(x2i1,x2i)(x_{2i - 1}', x_{2i}') = R_t . (x_{2i - 1}, x_{2i})

After rotating all the pairs, the final rotated embedding is obtained by concatenating the rotated pairs.

The rotation operation preserves the inner product between two embeddings, which is crucial for the dot-product attention mechanism in transformers. For two embeddings xx and yy at respective positions tt and ss in the sequence, the inner product after rotation is:

<xt,ys>=<Rtx,Rsy>=<x,RtRsy>=<x,Rsty><x_t, y_s>=<R_t x, R_s y> = <x, R_{-t}R_sy>=<x,R_{s-t}y>

The rotation operation is computationally efficient and can be implemented using simple matrix multiplications. It does not require additional parameters, unlike learned positional embeddings.

Frequency Parameter θiθ_i

The frequency parameter determines how quickly a specific dimension ii rotates. It is typically defined as:

θi=1100002i/d,i0\theta_i = \frac{1}{10000^{2i/d}}, i \geq 0

This ensures that the frequencies form a geometric progression, allowing the model to capture both short range and long range dependencies.

Using complex numbers makes it easier to reason about the rotation. First define:

zi=x2i+ix2i+1z_i = x_{2i} + i x_{2i+1}

Apply the rotation:

zi~=zieitθi\tilde{z_i} = z_i e^{it\theta_i}

Where tt is the token position.

We can see that, taking the dot product between two vectors qq and kk at position mm and nn respectively:

<RoPE(q,m),RoPE(k,n)>=jd<q,k>ei(mn)θj<RoPE(q,m), RoPE(k,n)> = \sum_j^{d}<q,k>e^{i(m-n)\theta_j}

(the minus appears because the inner product in complex space requires taking the conjugate of the second term)

When θi\theta_i increases, differences in positions mnm-n lead to larger phase shifts, which means that the attention score drops quickly as the position increases. Since θi\theta_i decreases with dimension ii, lower dimension rotate faster, thus put more importance on nearby tokens (since attention scores drop quickly with increasing position differences), and attend to short range dependencies (syntax, nearby words). Conversely, higher dimensions rotate slower, and attention scores stay stable across long distances, attending to long-range dependencies (coreference, topic continuity).

Since slow-rotating dimensions remain stable over large distances, RoPE models naturally generalize to sequences longer than those seen during training.

Why do we need to pair dimensions ?

The need to create pairs of dimensions in Rotary Position Embedding (RoPE) arises from the mathematical formulation of the rotation operation in high-dimensional spaces.

Rotation is a well-defined operation in 2D space. For a 2D vector, a rotation by an angle θ\theta can be represented using a rotation matrix which rotates the vector by θ\theta radians while preserving its magnitude.

In high-dimensional spaces (e.g. Rd\mathbb{R}^d, where dd is the embedding dimension), rotation is not as straightforward as in 2D. However, we can still perform rotations by applying 2D rotations to pairs of dimensions. This is because a rotation in high-dimensional space can be decomposed into a series of independent 2D rotations applied to pairs of dimensions.

Without pairing, it would be unclear how to generalize the rotation operation to high-dimensional spaces in a way that preserves the desired properties (e.g., preserving inner products). We would need to define a more complex high-dimensional rotation operation, which would be computationally expensive and harder to implement. This approach ensures that the rotation operation is efficient (each rotation operation involves only a small matrix multiplication (2x2), which can be easily parallelized) and more interpretable.

Implementation

Naive implementation:

  • split the embedding into pairs
  • generate the rotation matrix for each pair
  • apply the rotation matrix to each pair
  • concatenate all rotated pairs

def rope_naive(x, t):

    d = x.shape[-1]

    x_rotated = torch.zeros_like(x)

    for i in range(d // 2):

        theta_i = 10000 ** (-2 * i / d)

        cos_t = torch.cos(t * theta_i)

        sin_t = torch.sin(t * theta_i)

        x1 = x[..., 2*i]

        x2 = x[..., 2*i + 1]

        x_rotated[..., 2*i] = x1 * cos_t - x2 * sin_t

        x_rotated[..., 2*i + 1] = x1 * sin_t + x2 * cos_t

    return x_rotated

Efficient implementation:

  • Leverage vectorization and trigonometric identities to fuse operations, avoiding explicit splitting of dimensions and redundant computations
  • for position t, pre-compute a complex rotation factor Θtj=eitθk=cos(tθj)+isin(tθj)\Theta_t^j=e^{it\theta_k}=\cos(t\theta_j) + i \sin(t\theta_j)
  • apply rotation via complex multiplication (???)
  • Decompose the complex into real numbers to recover the rotated embedding: xt=[(xt(1)),(xt(1)),,(xt(d/2),(xt(d/2))]x_t=[\Re(x_t^{(1)}), \Im(x_t^{(1)}), \dots, \Re(x_t^{(d/2)}, \Im(x_t^{(d/2)})]

def rope_efficient(x, t):

    d = x.shape[-1]

    # Reshape to complex numbers

    x_complex = torch.view_as_complex(x.reshape(*x.shape[:-1], -1, 2))

    # Precompute rotation factors (cached during initialization)

    freqs = 10000 ** (-2 * torch.arange(d//2) / d)

    theta = t.unsqueeze(-1) * freqs.unsqueeze(0)  # Shape: [seq_len, d//2]

    rotation_factors = torch.cos(theta) + 1j * torch.sin(theta)

    # Apply rotation

    x_rotated_complex = x_complex * rotation_factors

    # Convert back to real numbers

    x_rotated = torch.view_as_real(x_rotated_complex).flatten(-2)

    return x_rotated

For a more formal derivation, see this blog post by Eleuther AI: https://blog.eleuther.ai/rotary-embeddings/