Attention Layers¶
spectrans.layers.attention ¶
Spectral attention layer implementations with linear complexity.
Provides attention mechanisms based on spectral methods and kernel approximations, achieving linear or logarithmic complexity compared to the quadratic complexity of standard attention. Implementations include Random Fourier Features, orthogonal transforms, and hybrid approaches.
Modules:
| Name | Description |
|---|---|
lst |
Linear Spectral Transform attention implementations. |
spectral |
Kernel-based spectral attention mechanisms. |
Classes:
| Name | Description |
|---|---|
DCTAttention |
Specialized LST attention using discrete cosine transform. |
HadamardAttention |
Fast attention using Hadamard transform operations. |
KernelAttention |
General kernel-based attention with various kernel options. |
LSTAttention |
Linear Spectral Transform attention with configurable transforms. |
MixedSpectralAttention |
Multi-transform attention combining multiple spectral methods. |
PerformerAttention |
Performer-style attention with FAVOR+ algorithm. |
SpectralAttention |
Multi-head spectral attention using random Fourier features. |
Examples:
Using spectral attention with RFF:
>>> import torch
>>> from spectrans.layers.attention import SpectralAttention
>>>
>>> attn = SpectralAttention(hidden_dim=512, num_heads=8, num_features=256)
>>> x = torch.randn(32, 100, 512)
>>> output = attn(x)
>>> assert output.shape == x.shape
Using LST attention with DCT:
>>> from spectrans.layers.attention import DCTAttention
>>>
>>> attn = DCTAttention(hidden_dim=512, num_heads=8)
>>> x = torch.randn(16, 128, 512)
>>> output = attn(x)
Using Performer attention:
>>> from spectrans.layers.attention import PerformerAttention
>>>
>>> attn = PerformerAttention(
... hidden_dim=768,
... num_heads=12,
... num_features=256,
... use_orthogonal=True
... )
>>> output = attn(x)
Notes
Complexity Analysis:
Standard attention requires \(O(n^2 d)\) time and \(O(n^2)\) memory. Spectral attention reduces this to \(O(n d k)\) time and \(O(n k)\) memory, where \(k\) is the number of random features. LST attention achieves \(O(n d \log n)\) time with \(O(n d)\) memory. Performer uses \(O(n d k)\) time with orthogonal features. Here \(n\) is sequence length and \(d\) is dimension.
Kernel approximation quality scales as \(O(1/\sqrt{k})\) for random features.
References
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. 2021. Rethinking attention with performers. In Proceedings of the International Conference on Learning Representations (ICLR).
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 5156-5165.
See Also
spectrans.kernels : Kernel functions used by attention mechanisms.
spectrans.transforms : Spectral transforms used by LST attention.
spectrans.layers : Parent module containing all layer implementations.
Classes¶
DCTAttention ¶
DCTAttention(hidden_dim: int, num_heads: int = 8, dct_type: int = 2, learnable_scale: bool = True, dropout: float = 0.0)
Bases: LSTAttention
Attention using Discrete Cosine Transform.
Specialized LST attention that uses DCT for all heads for real-valued signals with energy compaction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden dimension. |
required |
num_heads
|
int
|
Number of attention heads. |
8
|
dct_type
|
int
|
DCT type (2 is most common). |
2
|
learnable_scale
|
bool
|
Whether to use learnable scaling. |
True
|
dropout
|
float
|
Dropout probability. |
0.0
|
Source code in spectrans/layers/attention/lst.py
HadamardAttention ¶
HadamardAttention(hidden_dim: int, num_heads: int = 8, scale_by_sqrt: bool = True, learnable_scale: bool = True, dropout: float = 0.0)
Bases: LSTAttention
Attention using fast Hadamard transform.
Uses Hadamard transform for \(O(n \log n)\) attention computation with binary coefficients.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden dimension. |
required |
num_heads
|
int
|
Number of attention heads. |
8
|
scale_by_sqrt
|
bool
|
Whether to scale by sqrt(n) for orthogonality. |
True
|
learnable_scale
|
bool
|
Whether to use learnable diagonal scaling. |
True
|
dropout
|
float
|
Dropout probability. |
0.0
|
Source code in spectrans/layers/attention/lst.py
LSTAttention ¶
LSTAttention(hidden_dim: int, num_heads: int = 8, transform_type: Literal['dct', 'dst', 'hadamard', 'mixed'] = 'dct', learnable_scale: bool = True, normalize: bool = True, dropout: float = 0.0, use_bias: bool = True)
Bases: AttentionLayer
Linear Spectral Transform attention mechanism.
Implements attention using orthogonal transforms (DCT, DST, Hadamard) with learnable diagonal scaling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden dimension of the model. |
required |
num_heads
|
int
|
Number of attention heads. |
8
|
transform_type
|
Literal['dct', 'dst', 'hadamard', 'mixed']
|
Type of transform to use. "mixed" uses different transforms per head. |
"dct"
|
learnable_scale
|
bool
|
Whether to use learnable diagonal scaling matrix. |
True
|
normalize
|
bool
|
Whether to normalize in transform domain. |
True
|
dropout
|
float
|
Dropout probability. |
0.0
|
use_bias
|
bool
|
Whether to use bias in projections. |
True
|
Attributes:
| Name | Type | Description |
|---|---|---|
head_dim |
int
|
Dimension per attention head. |
transform_type |
str
|
Type of transform being used. |
transforms |
ModuleList
|
List of transforms (one per head if mixed). |
scale |
Parameter | None
|
Learnable diagonal scaling if enabled. |
Methods:
| Name | Description |
|---|---|
forward |
Forward pass of LST attention. |
Source code in spectrans/layers/attention/lst.py
Functions¶
forward ¶
forward(x: Tensor, mask: Tensor | None = None, return_attention: bool = False) -> Tensor | tuple[Tensor, ...]
Forward pass of LST attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of shape (batch_size, seq_len, hidden_dim). |
required |
mask
|
Tensor | None
|
Attention mask of shape (batch_size, seq_len). |
None
|
return_attention
|
bool
|
Whether to return attention weights (not supported). |
False
|
Returns:
| Type | Description |
|---|---|
Tensor or tuple[Tensor, Tensor]
|
Output tensor of shape (batch_size, seq_len, hidden_dim). If return_attention=True, returns (output, None). |
Source code in spectrans/layers/attention/lst.py
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 | |
MixedSpectralAttention ¶
MixedSpectralAttention(hidden_dim: int, num_heads: int = 9, use_fft: bool = True, use_dct: bool = True, use_hadamard: bool = True, dropout: float = 0.0)
Bases: AttentionLayer
Mixed spectral attention using multiple transform types.
Combines different spectral transforms across heads for diverse frequency representations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden dimension. |
required |
num_heads
|
int
|
Number of attention heads (should be divisible by 3 for even split). |
8
|
use_fft
|
bool
|
Whether to include FFT heads. |
True
|
use_dct
|
bool
|
Whether to include DCT heads. |
True
|
use_hadamard
|
bool
|
Whether to include Hadamard heads. |
True
|
dropout
|
float
|
Dropout probability. |
0.0
|
Methods:
| Name | Description |
|---|---|
forward |
Forward pass of mixed spectral attention. |
Source code in spectrans/layers/attention/lst.py
Functions¶
forward ¶
forward(x: Tensor, _mask: Tensor | None = None, return_attention: bool = False) -> Tensor | tuple[Tensor, ...]
Forward pass of mixed spectral attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input of shape (batch_size, seq_len, hidden_dim). |
required |
_mask
|
Tensor | None
|
Attention mask (not implemented for spectral attention). |
None
|
return_attention
|
bool
|
Whether to return attention weights. |
False
|
Returns:
| Type | Description |
|---|---|
Tensor or tuple[Tensor, Tensor]
|
Output and optionally None for weights. |
Source code in spectrans/layers/attention/lst.py
474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 | |
KernelAttention ¶
KernelAttention(hidden_dim: int, num_heads: int = 8, kernel_type: Literal['gaussian', 'polynomial', 'spectral'] = 'gaussian', rank: int | None = None, num_features: int | None = None, dropout: float = 0.0)
Bases: AttentionLayer
General kernel-based attention with various kernel options.
Supports multiple kernel types including Gaussian, polynomial, and learnable spectral kernels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden dimension. |
required |
num_heads
|
int
|
Number of heads. |
8
|
kernel_type
|
Literal['gaussian', 'polynomial', 'spectral']
|
Type of kernel to use. |
"gaussian"
|
rank
|
int | None
|
Rank for low-rank approximations. |
None
|
num_features
|
int | None
|
Number of features for RFF kernels. |
None
|
dropout
|
float
|
Dropout probability. |
0.0
|
Attributes:
| Name | Type | Description |
|---|---|---|
kernel_type |
str
|
Type of kernel being used. |
rank |
int | None
|
Rank for approximations. |
Methods:
| Name | Description |
|---|---|
forward |
Forward pass of kernel attention. |
Source code in spectrans/layers/attention/spectral.py
Functions¶
forward ¶
forward(x: Tensor, mask: Tensor | None = None, return_attention: bool = False) -> Tensor | tuple[Tensor, ...]
Forward pass of kernel attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input of shape (batch_size, seq_len, hidden_dim). |
required |
mask
|
Tensor | None
|
Attention mask. |
None
|
return_attention
|
bool
|
Whether to return attention weights. |
False
|
Returns:
| Type | Description |
|---|---|
Tensor or tuple[Tensor, Tensor]
|
Output and optionally attention weights. |
Source code in spectrans/layers/attention/spectral.py
467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 | |
PerformerAttention ¶
PerformerAttention(hidden_dim: int, num_heads: int = 8, num_features: int | None = None, generalized: bool = False, dropout: float = 0.0)
Bases: SpectralAttention
Performer-style attention with FAVOR+ algorithm.
Implements the Performer architecture with positive orthogonal random features (FAVOR+) for softmax kernel approximation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden dimension. |
required |
num_heads
|
int
|
Number of attention heads. |
8
|
num_features
|
int | None
|
Number of random features. |
None
|
generalized
|
bool
|
Whether to use generalized attention (without softmax). |
False
|
dropout
|
float
|
Dropout probability. |
0.0
|
Attributes:
| Name | Type | Description |
|---|---|---|
generalized |
bool
|
Whether using generalized attention. |
Methods:
| Name | Description |
|---|---|
forward |
Forward pass of Performer attention. |
Source code in spectrans/layers/attention/spectral.py
Functions¶
forward ¶
forward(x: Tensor, mask: Tensor | None = None, return_attention: bool = False) -> Tensor | tuple[Tensor, ...]
Forward pass of Performer attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input of shape (batch_size, seq_len, hidden_dim). |
required |
mask
|
Tensor | None
|
Attention mask. |
None
|
return_attention
|
bool
|
Whether to return attention weights. |
False
|
Returns:
| Type | Description |
|---|---|
Tensor or tuple[Tensor, Tensor]
|
Output tensor and optionally None for weights. |
Source code in spectrans/layers/attention/spectral.py
SpectralAttention ¶
SpectralAttention(hidden_dim: int, num_heads: int = 8, num_features: int | None = None, head_dim: int | None = None, kernel_type: Literal['gaussian', 'softmax'] = 'softmax', use_orthogonal: bool = True, feature_redraw: bool = False, dropout: float = 0.0, use_bias: bool = True)
Bases: AttentionLayer
Multi-head spectral attention using RFF approximation.
Implements attention using Random Fourier Features to approximate the softmax kernel with linear complexity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Hidden dimension of the model. |
required |
num_heads
|
int
|
Number of attention heads. |
8
|
num_features
|
int | None
|
Number of random features. If None, uses hidden_dim. |
None
|
head_dim
|
int | None
|
Dimension per head. If None, uses hidden_dim // num_heads. |
None
|
kernel_type
|
Literal['gaussian', 'softmax']
|
Type of kernel to approximate. |
"softmax"
|
use_orthogonal
|
bool
|
Whether to use orthogonal random features. |
True
|
feature_redraw
|
bool
|
Whether to redraw features at each forward pass. |
False
|
dropout
|
float
|
Dropout probability. |
0.0
|
use_bias
|
bool
|
Whether to use bias in projections. |
True
|
Attributes:
| Name | Type | Description |
|---|---|---|
head_dim |
int
|
Dimension per attention head. |
num_features |
int
|
Number of random features used. |
q_proj |
Linear
|
Query projection. |
k_proj |
Linear
|
Key projection. |
v_proj |
Linear
|
Value projection. |
out_proj |
Linear
|
Output projection. |
kernel |
RandomFeatureMap | KernelFunction
|
Kernel for attention approximation. |
Methods:
| Name | Description |
|---|---|
forward |
Forward pass of spectral attention. |
Source code in spectrans/layers/attention/spectral.py
Functions¶
forward ¶
forward(x: Tensor, mask: Tensor | None = None, return_attention: bool = False) -> Tensor | tuple[Tensor, ...]
Forward pass of spectral attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
Input tensor of shape (batch_size, seq_len, hidden_dim). |
required |
mask
|
Tensor | None
|
Attention mask of shape (batch_size, seq_len). |
None
|
return_attention
|
bool
|
Whether to return attention weights (not supported). |
False
|
Returns:
| Type | Description |
|---|---|
Tensor or tuple[Tensor, Tensor]
|
Output tensor of shape (batch_size, seq_len, hidden_dim). If return_attention=True, also returns None (weights not available). |