RoPE is a positional encoding method that applies a rotation operation to the input embeddings based on their positions in the sequence. The key idea is to encode positional information by rotating the embedding vectors, which preserves the inner product structure and allows the model to capture relative positions.
It can be seen as multiplicative whereas adding absolute positional embeddings is additive.
let be an input embedding vector of size . For a sequence of tokens, each token is associated with a position index .
The embedding vector (corresponding to 1 token in the sequence) is split into pairs of dimensions (if is odd, we just pad the embedding).
For each pair of dimensions , a rotation matrix is applied based on position of the token within the sequence, defined as:
where is a frequency parameter that depends on the dimension (not token position). These frequencies are chosen to form a geometric progression. We will say more about this frequency parameter in the next section.
Each pair is multiplied by the matrix :
After rotating all the pairs, the final rotated embedding is obtained by concatenating the rotated pairs.
The rotation operation preserves the inner product between two embeddings, which is crucial for the dot-product attention mechanism in transformers. For two embeddings and at respective positions and in the sequence, the inner product after rotation is:
The rotation operation is computationally efficient and can be implemented using simple matrix multiplications. It does not require additional parameters, unlike learned positional embeddings.
The frequency parameter determines how quickly a specific dimension rotates. It is typically defined as:
This ensures that the frequencies form a geometric progression, allowing the model to capture both short range and long range dependencies.
Using complex numbers makes it easier to reason about the rotation. First define:
Apply the rotation:
Where is the token position.
We can see that, taking the dot product between two vectors and at position and respectively:
(the minus appears because the inner product in complex space requires taking the conjugate of the second term)
When increases, differences in positions lead to larger phase shifts, which means that the attention score drops quickly as the position increases. Since decreases with dimension , lower dimension rotate faster, thus put more importance on nearby tokens (since attention scores drop quickly with increasing position differences), and attend to short range dependencies (syntax, nearby words). Conversely, higher dimensions rotate slower, and attention scores stay stable across long distances, attending to long-range dependencies (coreference, topic continuity).
Since slow-rotating dimensions remain stable over large distances, RoPE models naturally generalize to sequences longer than those seen during training.
The need to create pairs of dimensions in Rotary Position Embedding (RoPE) arises from the mathematical formulation of the rotation operation in high-dimensional spaces.
Rotation is a well-defined operation in 2D space. For a 2D vector, a rotation by an angle can be represented using a rotation matrix which rotates the vector by radians while preserving its magnitude.
In high-dimensional spaces (e.g. , where is the embedding dimension), rotation is not as straightforward as in 2D. However, we can still perform rotations by applying 2D rotations to pairs of dimensions. This is because a rotation in high-dimensional space can be decomposed into a series of independent 2D rotations applied to pairs of dimensions.
Without pairing, it would be unclear how to generalize the rotation operation to high-dimensional spaces in a way that preserves the desired properties (e.g., preserving inner products). We would need to define a more complex high-dimensional rotation operation, which would be computationally expensive and harder to implement. This approach ensures that the rotation operation is efficient (each rotation operation involves only a small matrix multiplication (2x2), which can be easily parallelized) and more interpretable.
Naive implementation:
def rope_naive(x, t):
d = x.shape[-1]
x_rotated = torch.zeros_like(x)
for i in range(d // 2):
theta_i = 10000 ** (-2 * i / d)
cos_t = torch.cos(t * theta_i)
sin_t = torch.sin(t * theta_i)
x1 = x[..., 2*i]
x2 = x[..., 2*i + 1]
x_rotated[..., 2*i] = x1 * cos_t - x2 * sin_t
x_rotated[..., 2*i + 1] = x1 * sin_t + x2 * cos_t
return x_rotated
Efficient implementation:
def rope_efficient(x, t):
d = x.shape[-1]
# Reshape to complex numbers
x_complex = torch.view_as_complex(x.reshape(*x.shape[:-1], -1, 2))
# Precompute rotation factors (cached during initialization)
freqs = 10000 ** (-2 * torch.arange(d//2) / d)
theta = t.unsqueeze(-1) * freqs.unsqueeze(0) # Shape: [seq_len, d//2]
rotation_factors = torch.cos(theta) + 1j * torch.sin(theta)
# Apply rotation
x_rotated_complex = x_complex * rotation_factors
# Convert back to real numbers
x_rotated = torch.view_as_real(x_rotated_complex).flatten(-2)
return x_rotated
For a more formal derivation, see this blog post by Eleuther AI: https://blog.eleuther.ai/rotary-embeddings/