Spectral Attention¶
spectrans.layers.attention.spectral ¶
Spectral attention mechanisms using kernel approximations.
Implements attention mechanisms based on spectral methods and kernel approximations, particularly Random Fourier Features (RFF). These methods achieve linear complexity \(O(n)\) instead of the quadratic \(O(n^2)\) complexity of standard attention.
Implementations follow the Performer architecture and related work on linearizing attention through kernel feature maps.
Classes:
| Name | Description |
|---|---|
SpectralAttention |
Multi-head spectral attention using RFF approximation. |
PerformerAttention |
Performer-style attention with positive random features. |
KernelAttention |
General kernel-based attention with various kernel options. |
Examples:
Basic spectral attention:
>>> import torch
>>> from spectrans.layers.attention.spectral import SpectralAttention
>>> attn = SpectralAttention(hidden_dim=512, num_heads=8, num_features=256)
>>> x = torch.randn(32, 100, 512) # (batch, seq_len, dim)
>>> output = attn(x)
>>> assert output.shape == x.shape
Performer attention:
>>> from spectrans.layers.attention.spectral import PerformerAttention
>>> performer = PerformerAttention(
... hidden_dim=512,
... num_heads=8,
... num_features=256,
... use_orthogonal=True
... )
>>> output = performer(x)
Notes
The spectral attention approximates standard attention as:
Where \(\varphi\) is a feature map (such as RFF) that linearizes the computation. Standard attention requires \(O(n^2d)\) time and \(O(n^2)\) space, while spectral attention reduces this to \(O(nrd)\) time and \(O(nr)\) space for \(r\) features.
Approximation quality scales as \(O(1/\sqrt{r})\) with the number of random features.
References
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. 2021. Rethinking attention with performers. In Proceedings of the International Conference on Learning Representations (ICLR).
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. 2021. Random feature attention. In Proceedings of the International Conference on Learning Representations (ICLR).
See Also
spectrans.kernels.rff : Random Fourier Features implementation. spectrans.layers.attention.lst : Linear spectral transform attention.
Classes¶
SpectralAttention ¶
SpectralAttention(hidden_dim: int, num_heads: int = 8, num_features: int | None = None, head_dim: int | None = None, kernel_type: Literal['gaussian', 'softmax'] = 'softmax', use_orthogonal: bool = True, feature_redraw: bool = False, dropout: float = 0.0, use_bias: bool = True)
Bases: AttentionLayer
Multi-head spectral attention using RFF approximation.
Implements attention using Random Fourier Features to approximate the softmax kernel with linear complexity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden dimension of the model. |
required |
num_heads
|
int
|
Number of attention heads. |
8
|
num_features
|
int | None
|
Number of random features. If None, uses hidden_dim. |
None
|
head_dim
|
int | None
|
Dimension per head. If None, uses hidden_dim // num_heads. |
None
|
kernel_type
|
Literal['gaussian', 'softmax']
|
Type of kernel to approximate. |
"softmax"
|
use_orthogonal
|
bool
|
Whether to use orthogonal random features. |
True
|
feature_redraw
|
bool
|
Whether to redraw features at each forward pass. |
False
|
dropout
|
float
|
Dropout probability. |
0.0
|
use_bias
|
bool
|
Whether to use bias in projections. |
True
|
Attributes:
| Name | Type | Description |
|---|---|---|
head_dim |
int
|
Dimension per attention head. |
num_features |
int
|
Number of random features used. |
q_proj |
Linear
|
Query projection. |
k_proj |
Linear
|
Key projection. |
v_proj |
Linear
|
Value projection. |
out_proj |
Linear
|
Output projection. |
kernel |
RandomFeatureMap | KernelFunction
|
Kernel for attention approximation. |
Methods:
| Name | Description |
|---|---|
forward |
Forward pass of spectral attention. |
Source code in spectrans/layers/attention/spectral.py
Functions¶
forward ¶
forward(x: Tensor, mask: Tensor | None = None, return_attention: bool = False) -> Tensor | tuple[Tensor, ...]
Forward pass of spectral attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of shape (batch_size, seq_len, hidden_dim). |
required |
mask
|
Tensor | None
|
Attention mask of shape (batch_size, seq_len). |
None
|
return_attention
|
bool
|
Whether to return attention weights (not supported). |
False
|
Returns:
| Type | Description |
|---|---|
Tensor or tuple[Tensor, Tensor]
|
Output tensor of shape (batch_size, seq_len, hidden_dim). If return_attention=True, also returns None (weights not available). |
Source code in spectrans/layers/attention/spectral.py
PerformerAttention ¶
PerformerAttention(hidden_dim: int, num_heads: int = 8, num_features: int | None = None, generalized: bool = False, dropout: float = 0.0)
Bases: SpectralAttention
Performer-style attention with FAVOR+ algorithm.
Implements the Performer architecture with positive orthogonal random features (FAVOR+) for softmax kernel approximation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden dimension. |
required |
num_heads
|
int
|
Number of attention heads. |
8
|
num_features
|
int | None
|
Number of random features. |
None
|
generalized
|
bool
|
Whether to use generalized attention (without softmax). |
False
|
dropout
|
float
|
Dropout probability. |
0.0
|
Attributes:
| Name | Type | Description |
|---|---|---|
generalized |
bool
|
Whether using generalized attention. |
Methods:
| Name | Description |
|---|---|
forward |
Forward pass of Performer attention. |
Source code in spectrans/layers/attention/spectral.py
Functions¶
forward ¶
forward(x: Tensor, mask: Tensor | None = None, return_attention: bool = False) -> Tensor | tuple[Tensor, ...]
Forward pass of Performer attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input of shape (batch_size, seq_len, hidden_dim). |
required |
mask
|
Tensor | None
|
Attention mask. |
None
|
return_attention
|
bool
|
Whether to return attention weights. |
False
|
Returns:
| Type | Description |
|---|---|
Tensor or tuple[Tensor, Tensor]
|
Output tensor and optionally None for weights. |
Source code in spectrans/layers/attention/spectral.py
KernelAttention ¶
KernelAttention(hidden_dim: int, num_heads: int = 8, kernel_type: Literal['gaussian', 'polynomial', 'spectral'] = 'gaussian', rank: int | None = None, num_features: int | None = None, dropout: float = 0.0)
Bases: AttentionLayer
General kernel-based attention with various kernel options.
Supports multiple kernel types including Gaussian, polynomial, and learnable spectral kernels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden dimension. |
required |
num_heads
|
int
|
Number of heads. |
8
|
kernel_type
|
Literal['gaussian', 'polynomial', 'spectral']
|
Type of kernel to use. |
"gaussian"
|
rank
|
int | None
|
Rank for low-rank approximations. |
None
|
num_features
|
int | None
|
Number of features for RFF kernels. |
None
|
dropout
|
float
|
Dropout probability. |
0.0
|
Attributes:
| Name | Type | Description |
|---|---|---|
kernel_type |
str
|
Type of kernel being used. |
rank |
int | None
|
Rank for approximations. |
Methods:
| Name | Description |
|---|---|
forward |
Forward pass of kernel attention. |
Source code in spectrans/layers/attention/spectral.py
Functions¶
forward ¶
forward(x: Tensor, mask: Tensor | None = None, return_attention: bool = False) -> Tensor | tuple[Tensor, ...]
Forward pass of kernel attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input of shape (batch_size, seq_len, hidden_dim). |
required |
mask
|
Tensor | None
|
Attention mask. |
None
|
return_attention
|
bool
|
Whether to return attention weights. |
False
|
Returns:
| Type | Description |
|---|---|
Tensor or tuple[Tensor, Tensor]
|
Output and optionally attention weights. |
Source code in spectrans/layers/attention/spectral.py
467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 | |