Spectral Attention Models¶
spectrans.models.spectral_attention ¶
Spectral Attention transformer models using kernel approximations.
This module implements transformer models based on spectral attention mechanisms that use Random Fourier Features (RFF) to linearize attention computation. These models achieve \(O(n)\) complexity instead of the quadratic \(O(n^2)\) complexity of standard transformers, making them efficient for long sequences.
The spectral attention mechanism approximates the softmax kernel using random feature maps, maintaining the expressive power of attention while dramatically reducing computational cost.
Classes:
| Name | Description |
|---|---|
SpectralAttentionTransformer |
Complete transformer model using spectral attention layers. |
SpectralAttentionEncoder |
Encoder-only model for representation learning. |
PerformerTransformer |
Performer-style model with positive orthogonal random features. |
Examples:
Basic spectral attention transformer:
>>> import torch
>>> from spectrans.models.spectral_attention import SpectralAttentionTransformer
>>> model = SpectralAttentionTransformer(
... hidden_dim=512,
... num_layers=6,
... num_heads=8,
... num_features=256,
... max_sequence_length=1024
... )
>>> x = torch.randn(32, 100, 512) # (batch, seq_len, dim)
>>> output = model(inputs_embeds=x)
>>> assert output.shape == x.shape
Using with token inputs and classification head:
>>> model = SpectralAttentionTransformer(
... vocab_size=10000,
... hidden_dim=512,
... num_layers=6,
... num_heads=8,
... num_classes=10,
... max_sequence_length=512
... )
>>> input_ids = torch.randint(0, 10000, (32, 100))
>>> logits = model(input_ids)
>>> assert logits.shape == (32, 10)
Performer model with orthogonal features:
>>> from spectrans.models.spectral_attention import PerformerTransformer
>>> performer = PerformerTransformer(
... hidden_dim=512,
... num_layers=6,
... num_heads=8,
... num_features=256,
... use_orthogonal=True
... )
Notes
Mathematical Foundation:
The spectral attention mechanism approximates standard attention as:
Where \(\boldsymbol{\Phi}\) is a random feature map:
With random frequencies \(\omega_i \sim \mathcal{N}(0, \sigma^2 I)\) and phases \(b_i \sim \text{Uniform}[0, 2\pi]\).
The approximation quality improves with more random features \(D\), with error decreasing as \(O(\frac{1}{\sqrt{D}})\). The linear complexity \(O(nDd)\) becomes favorable over standard attention \(O(n^2d)\) when \(D \ll n\).
For the Performer variant, orthogonal random features are used to reduce the variance of the approximation, leading to better convergence.
References
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. 2021. Rethinking attention with performers. In Proceedings of the International Conference on Learning Representations (ICLR).
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. 2021. Random feature attention. In Proceedings of the International Conference on Learning Representations (ICLR).
Ali Rahimi and Benjamin Recht. 2007. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20 (NeurIPS 2007), pages 1177-1184.
See Also
spectrans.layers.attention.spectral : Spectral attention layer implementations. spectrans.kernels.rff : Random Fourier Features kernel approximations. spectrans.models.lst : Linear Spectral Transform models for comparison.
Classes¶
SpectralAttentionTransformer ¶
SpectralAttentionTransformer(vocab_size: int | None = None, hidden_dim: int = 512, num_layers: int = 6, max_sequence_length: int = 1024, num_heads: int = 8, num_features: int | None = None, kernel_type: KernelType = 'softmax', use_orthogonal: bool = False, num_classes: int | None = None, ffn_hidden_dim: int | None = None, dropout: float = 0.0, use_positional_encoding: bool = True, positional_encoding_type: PositionalEncodingType = 'sinusoidal', gradient_checkpointing: bool = False)
Bases: BaseModel
Spectral Attention transformer using Random Fourier Features.
This model uses spectral attention layers with RFF approximation to achieve linear complexity attention computation. The model maintains the expressive power of standard transformers while being efficient for long sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int | None
|
Size of the vocabulary for token embeddings. If None, expects pre-embedded inputs. |
None
|
hidden_dim
|
int
|
Hidden dimension size for the model. |
512
|
num_layers
|
int
|
Number of transformer blocks. |
6
|
max_sequence_length
|
int
|
Maximum sequence length the model can process. |
1024
|
num_heads
|
int
|
Number of attention heads. |
8
|
num_features
|
int | None
|
Number of random features for RFF approximation. If None, uses hidden_dim. |
None
|
kernel_type
|
KernelType
|
Type of kernel to approximate. |
"softmax"
|
use_orthogonal
|
bool
|
Whether to use orthogonal random features. |
False
|
num_classes
|
int | None
|
Number of output classes for classification. |
None
|
ffn_hidden_dim
|
int | None
|
Hidden dimension of the feedforward network. Default is 4 * hidden_dim. |
None
|
dropout
|
float
|
Dropout probability. |
0.0
|
use_positional_encoding
|
bool
|
Whether to use positional encoding. |
True
|
positional_encoding_type
|
str
|
Type of positional encoding ("sinusoidal" or "learned"). |
"sinusoidal"
|
gradient_checkpointing
|
bool
|
Whether to use gradient checkpointing to save memory. |
False
|
Attributes:
| Name | Type | Description |
|---|---|---|
blocks |
ModuleList
|
Stack of spectral attention transformer blocks. |
Examples:
>>> model = SpectralAttentionTransformer(
... hidden_dim=512,
... num_layers=6,
... num_heads=8,
... num_features=256,
... max_sequence_length=1024
... )
>>> x = torch.randn(32, 100, 512)
>>> output = model(inputs_embeds=x)
>>> assert output.shape == x.shape
Methods:
| Name | Description |
|---|---|
build_blocks |
Build transformer blocks with spectral attention layers. |
from_config |
Create model from configuration. |
Source code in spectrans/models/spectral_attention.py
Functions¶
build_blocks ¶
Build transformer blocks with spectral attention layers.
Returns:
| Type | Description |
|---|---|
ModuleList
|
List of spectral attention transformer blocks. |
Source code in spectrans/models/spectral_attention.py
from_config
classmethod
¶
from_config(config: SpectralAttentionModelConfig) -> SpectralAttentionTransformer
Create model from configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
SpectralAttentionModelConfig
|
Model configuration object. |
required |
Returns:
| Type | Description |
|---|---|
SpectralAttentionTransformer
|
Configured model instance. |
Source code in spectrans/models/spectral_attention.py
SpectralAttentionEncoder ¶
SpectralAttentionEncoder(vocab_size: int | None = None, hidden_dim: int = 512, num_layers: int = 6, max_sequence_length: int = 1024, num_heads: int = 8, num_features: int | None = None, kernel_type: KernelType = 'softmax', use_orthogonal: bool = False, ffn_hidden_dim: int | None = None, dropout: float = 0.0, use_positional_encoding: bool = True, positional_encoding_type: PositionalEncodingType = 'sinusoidal')
Bases: BaseModel
Encoder-only spectral attention model for representation learning.
This model uses spectral attention layers without a classification head, suitable for generating embeddings or as a component in larger architectures.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int | None
|
Size of the vocabulary for token embeddings. |
None
|
hidden_dim
|
int
|
Hidden dimension size. |
512
|
num_layers
|
int
|
Number of transformer blocks. |
6
|
max_sequence_length
|
int
|
Maximum sequence length. |
1024
|
num_heads
|
int
|
Number of attention heads. |
8
|
num_features
|
int | None
|
Number of random features. |
None
|
kernel_type
|
KernelType
|
Kernel type. |
"softmax"
|
use_orthogonal
|
bool
|
Use orthogonal features. |
False
|
ffn_hidden_dim
|
int | None
|
FFN hidden dimension. |
None
|
dropout
|
float
|
Dropout probability. |
0.0
|
use_positional_encoding
|
bool
|
Use positional encoding. |
True
|
positional_encoding_type
|
str
|
Positional encoding type. |
"sinusoidal"
|
Methods:
| Name | Description |
|---|---|
build_blocks |
Build encoder blocks with spectral attention. |
Source code in spectrans/models/spectral_attention.py
Functions¶
build_blocks ¶
Build encoder blocks with spectral attention.
Returns:
| Type | Description |
|---|---|
ModuleList
|
List of spectral attention blocks. |
Source code in spectrans/models/spectral_attention.py
PerformerTransformer ¶
PerformerTransformer(vocab_size: int | None = None, hidden_dim: int = 512, num_layers: int = 6, max_sequence_length: int = 1024, num_heads: int = 8, num_features: int | None = None, num_classes: int | None = None, ffn_hidden_dim: int | None = None, dropout: float = 0.0, use_positional_encoding: bool = True, positional_encoding_type: PositionalEncodingType = 'sinusoidal', gradient_checkpointing: bool = False)
Bases: BaseModel
Performer transformer with positive orthogonal random features.
This model implements the Performer architecture which uses positive orthogonal random features (PORF) to approximate the softmax kernel with improved variance reduction compared to standard RFF.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int | None
|
Vocabulary size. |
None
|
hidden_dim
|
int
|
Hidden dimension. |
512
|
num_layers
|
int
|
Number of layers. |
6
|
max_sequence_length
|
int
|
Maximum sequence length. |
1024
|
num_heads
|
int
|
Number of heads. |
8
|
num_features
|
int | None
|
Number of random features. |
None
|
num_classes
|
int | None
|
Number of classes. |
None
|
ffn_hidden_dim
|
int | None
|
FFN dimension. |
None
|
dropout
|
float
|
Dropout rate. |
0.0
|
use_positional_encoding
|
bool
|
Use positional encoding. |
True
|
positional_encoding_type
|
str
|
Positional encoding type. |
"sinusoidal"
|
gradient_checkpointing
|
bool
|
Use gradient checkpointing. |
False
|
Examples:
>>> performer = PerformerTransformer(
... hidden_dim=512,
... num_layers=6,
... num_heads=8,
... num_features=256,
... max_sequence_length=1024
... )
>>> x = torch.randn(32, 100, 512)
>>> output = performer(inputs_embeds=x)
Methods:
| Name | Description |
|---|---|
build_blocks |
Build Performer blocks with orthogonal features. |
Source code in spectrans/models/spectral_attention.py
Functions¶
build_blocks ¶
Build Performer blocks with orthogonal features.
Returns:
| Type | Description |
|---|---|
ModuleList
|
List of Performer blocks. |