Skip to content

Spectral Attention Models

spectrans.models.spectral_attention

Spectral Attention transformer models using kernel approximations.

This module implements transformer models based on spectral attention mechanisms that use Random Fourier Features (RFF) to linearize attention computation. These models achieve \(O(n)\) complexity instead of the quadratic \(O(n^2)\) complexity of standard transformers, making them efficient for long sequences.

The spectral attention mechanism approximates the softmax kernel using random feature maps, maintaining the expressive power of attention while dramatically reducing computational cost.

Classes:

Name Description
SpectralAttentionTransformer

Complete transformer model using spectral attention layers.

SpectralAttentionEncoder

Encoder-only model for representation learning.

PerformerTransformer

Performer-style model with positive orthogonal random features.

Examples:

Basic spectral attention transformer:

>>> import torch
>>> from spectrans.models.spectral_attention import SpectralAttentionTransformer
>>> model = SpectralAttentionTransformer(
...     hidden_dim=512,
...     num_layers=6,
...     num_heads=8,
...     num_features=256,
...     max_sequence_length=1024
... )
>>> x = torch.randn(32, 100, 512)  # (batch, seq_len, dim)
>>> output = model(inputs_embeds=x)
>>> assert output.shape == x.shape

Using with token inputs and classification head:

>>> model = SpectralAttentionTransformer(
...     vocab_size=10000,
...     hidden_dim=512,
...     num_layers=6,
...     num_heads=8,
...     num_classes=10,
...     max_sequence_length=512
... )
>>> input_ids = torch.randint(0, 10000, (32, 100))
>>> logits = model(input_ids)
>>> assert logits.shape == (32, 10)

Performer model with orthogonal features:

>>> from spectrans.models.spectral_attention import PerformerTransformer
>>> performer = PerformerTransformer(
...     hidden_dim=512,
...     num_layers=6,
...     num_heads=8,
...     num_features=256,
...     use_orthogonal=True
... )
Notes

Mathematical Foundation:

The spectral attention mechanism approximates standard attention as:

\[ \text{Attention}(Q, K, V) \approx \boldsymbol{\Phi}(Q) \left(\boldsymbol{\Phi}(K)^T V\right) / Z \]

Where \(\boldsymbol{\Phi}\) is a random feature map:

\[ \boldsymbol{\varphi}(x) = \sqrt{\frac{2}{D}} \begin{bmatrix} \cos(\boldsymbol{\omega}_1^T x + b_1) \\ \cos(\boldsymbol{\omega}_2^T x + b_2) \\ \vdots \\ \cos(\boldsymbol{\omega}_D^T x + b_D) \end{bmatrix} \]

With random frequencies \(\omega_i \sim \mathcal{N}(0, \sigma^2 I)\) and phases \(b_i \sim \text{Uniform}[0, 2\pi]\).

The approximation quality improves with more random features \(D\), with error decreasing as \(O(\frac{1}{\sqrt{D}})\). The linear complexity \(O(nDd)\) becomes favorable over standard attention \(O(n^2d)\) when \(D \ll n\).

For the Performer variant, orthogonal random features are used to reduce the variance of the approximation, leading to better convergence.

References

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. 2021. Rethinking attention with performers. In Proceedings of the International Conference on Learning Representations (ICLR).

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. 2021. Random feature attention. In Proceedings of the International Conference on Learning Representations (ICLR).

Ali Rahimi and Benjamin Recht. 2007. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20 (NeurIPS 2007), pages 1177-1184.

See Also

spectrans.layers.attention.spectral : Spectral attention layer implementations. spectrans.kernels.rff : Random Fourier Features kernel approximations. spectrans.models.lst : Linear Spectral Transform models for comparison.

Classes

SpectralAttentionTransformer

SpectralAttentionTransformer(vocab_size: int | None = None, hidden_dim: int = 512, num_layers: int = 6, max_sequence_length: int = 1024, num_heads: int = 8, num_features: int | None = None, kernel_type: KernelType = 'softmax', use_orthogonal: bool = False, num_classes: int | None = None, ffn_hidden_dim: int | None = None, dropout: float = 0.0, use_positional_encoding: bool = True, positional_encoding_type: PositionalEncodingType = 'sinusoidal', gradient_checkpointing: bool = False)

Bases: BaseModel

Spectral Attention transformer using Random Fourier Features.

This model uses spectral attention layers with RFF approximation to achieve linear complexity attention computation. The model maintains the expressive power of standard transformers while being efficient for long sequences.

Parameters:

Name Type Description Default
vocab_size int | None

Size of the vocabulary for token embeddings. If None, expects pre-embedded inputs.

None
hidden_dim int

Hidden dimension size for the model.

512
num_layers int

Number of transformer blocks.

6
max_sequence_length int

Maximum sequence length the model can process.

1024
num_heads int

Number of attention heads.

8
num_features int | None

Number of random features for RFF approximation. If None, uses hidden_dim.

None
kernel_type KernelType

Type of kernel to approximate.

"softmax"
use_orthogonal bool

Whether to use orthogonal random features.

False
num_classes int | None

Number of output classes for classification.

None
ffn_hidden_dim int | None

Hidden dimension of the feedforward network. Default is 4 * hidden_dim.

None
dropout float

Dropout probability.

0.0
use_positional_encoding bool

Whether to use positional encoding.

True
positional_encoding_type str

Type of positional encoding ("sinusoidal" or "learned").

"sinusoidal"
gradient_checkpointing bool

Whether to use gradient checkpointing to save memory.

False

Attributes:

Name Type Description
blocks ModuleList

Stack of spectral attention transformer blocks.

Examples:

>>> model = SpectralAttentionTransformer(
...     hidden_dim=512,
...     num_layers=6,
...     num_heads=8,
...     num_features=256,
...     max_sequence_length=1024
... )
>>> x = torch.randn(32, 100, 512)
>>> output = model(inputs_embeds=x)
>>> assert output.shape == x.shape

Methods:

Name Description
build_blocks

Build transformer blocks with spectral attention layers.

from_config

Create model from configuration.

Source code in spectrans/models/spectral_attention.py
def __init__(
    self,
    vocab_size: int | None = None,
    hidden_dim: int = 512,
    num_layers: int = 6,
    max_sequence_length: int = 1024,
    num_heads: int = 8,
    num_features: int | None = None,
    kernel_type: KernelType = "softmax",
    use_orthogonal: bool = False,
    num_classes: int | None = None,
    ffn_hidden_dim: int | None = None,
    dropout: float = 0.0,
    use_positional_encoding: bool = True,
    positional_encoding_type: PositionalEncodingType = "sinusoidal",
    gradient_checkpointing: bool = False,
):
    # Store all parameters before calling super().__init__ since build_blocks needs them
    self.num_heads = num_heads
    self.num_features = num_features or hidden_dim
    self.kernel_type = kernel_type
    self.use_orthogonal = use_orthogonal
    self.dropout_rate = dropout  # Store as different name to avoid conflict

    super().__init__(
        vocab_size=vocab_size,
        hidden_dim=hidden_dim,
        num_layers=num_layers,
        max_sequence_length=max_sequence_length,
        num_classes=num_classes,
        ffn_hidden_dim=ffn_hidden_dim,
        dropout=dropout,
        use_positional_encoding=use_positional_encoding,
        positional_encoding_type=positional_encoding_type,
        gradient_checkpointing=gradient_checkpointing,
    )
Functions
build_blocks
build_blocks() -> ModuleList

Build transformer blocks with spectral attention layers.

Returns:

Type Description
ModuleList

List of spectral attention transformer blocks.

Source code in spectrans/models/spectral_attention.py
def build_blocks(self) -> nn.ModuleList:
    """Build transformer blocks with spectral attention layers.

    Returns
    -------
    nn.ModuleList
        List of spectral attention transformer blocks.
    """
    blocks = []
    for _ in range(self.num_layers):
        attention_layer = SpectralAttention(
            hidden_dim=self.hidden_dim,
            num_heads=self.num_heads,
            num_features=self.num_features,
            kernel_type=self.kernel_type,
            use_orthogonal=self.use_orthogonal,
            dropout=self.dropout_rate,
        )

        block = PreNormBlock(
            mixing_layer=attention_layer,
            hidden_dim=self.hidden_dim,
            ffn_hidden_dim=self.ffn_hidden_dim,
            dropout=self.dropout_rate,
            norm_eps=1e-12,
        )
        blocks.append(block)

    return nn.ModuleList(blocks)
from_config classmethod

Create model from configuration.

Parameters:

Name Type Description Default
config SpectralAttentionModelConfig

Model configuration object.

required

Returns:

Type Description
SpectralAttentionTransformer

Configured model instance.

Source code in spectrans/models/spectral_attention.py
@classmethod
def from_config(cls, config: "SpectralAttentionModelConfig") -> "SpectralAttentionTransformer":  # type: ignore[override]
    """Create model from configuration.

    Parameters
    ----------
    config : SpectralAttentionModelConfig
        Model configuration object.

    Returns
    -------
    SpectralAttentionTransformer
        Configured model instance.
    """
    # Extract spectral attention specific config
    return cls(
        vocab_size=config.vocab_size,
        hidden_dim=config.hidden_dim,
        num_layers=config.num_layers,
        max_sequence_length=config.sequence_length,
        num_heads=config.num_heads,
        num_features=config.num_features,
        kernel_type=config.kernel_type,
        use_orthogonal=config.use_orthogonal,
        num_classes=config.num_classes,
        ffn_hidden_dim=config.ffn_hidden_dim,
        dropout=config.dropout,
        use_positional_encoding=config.use_positional_encoding,
        positional_encoding_type=config.positional_encoding_type,
        gradient_checkpointing=config.gradient_checkpointing,
    )

SpectralAttentionEncoder

SpectralAttentionEncoder(vocab_size: int | None = None, hidden_dim: int = 512, num_layers: int = 6, max_sequence_length: int = 1024, num_heads: int = 8, num_features: int | None = None, kernel_type: KernelType = 'softmax', use_orthogonal: bool = False, ffn_hidden_dim: int | None = None, dropout: float = 0.0, use_positional_encoding: bool = True, positional_encoding_type: PositionalEncodingType = 'sinusoidal')

Bases: BaseModel

Encoder-only spectral attention model for representation learning.

This model uses spectral attention layers without a classification head, suitable for generating embeddings or as a component in larger architectures.

Parameters:

Name Type Description Default
vocab_size int | None

Size of the vocabulary for token embeddings.

None
hidden_dim int

Hidden dimension size.

512
num_layers int

Number of transformer blocks.

6
max_sequence_length int

Maximum sequence length.

1024
num_heads int

Number of attention heads.

8
num_features int | None

Number of random features.

None
kernel_type KernelType

Kernel type.

"softmax"
use_orthogonal bool

Use orthogonal features.

False
ffn_hidden_dim int | None

FFN hidden dimension.

None
dropout float

Dropout probability.

0.0
use_positional_encoding bool

Use positional encoding.

True
positional_encoding_type str

Positional encoding type.

"sinusoidal"

Methods:

Name Description
build_blocks

Build encoder blocks with spectral attention.

Source code in spectrans/models/spectral_attention.py
def __init__(
    self,
    vocab_size: int | None = None,
    hidden_dim: int = 512,
    num_layers: int = 6,
    max_sequence_length: int = 1024,
    num_heads: int = 8,
    num_features: int | None = None,
    kernel_type: KernelType = "softmax",
    use_orthogonal: bool = False,
    ffn_hidden_dim: int | None = None,
    dropout: float = 0.0,
    use_positional_encoding: bool = True,
    positional_encoding_type: PositionalEncodingType = "sinusoidal",
):
    # Store parameters before calling super().__init__ since build_blocks needs them
    self.num_heads = num_heads
    self.num_features = num_features or hidden_dim
    self.kernel_type = kernel_type
    self.use_orthogonal = use_orthogonal
    self.dropout_rate = dropout

    # Initialize without classification head
    super().__init__(
        vocab_size=vocab_size,
        hidden_dim=hidden_dim,
        num_layers=num_layers,
        max_sequence_length=max_sequence_length,
        num_classes=None,  # No classification head
        ffn_hidden_dim=ffn_hidden_dim,
        dropout=dropout,
        use_positional_encoding=use_positional_encoding,
        positional_encoding_type=positional_encoding_type,
        gradient_checkpointing=False,
    )

    # Set output type to none for encoder
    self.output_type = "none"
Functions
build_blocks
build_blocks() -> ModuleList

Build encoder blocks with spectral attention.

Returns:

Type Description
ModuleList

List of spectral attention blocks.

Source code in spectrans/models/spectral_attention.py
def build_blocks(self) -> nn.ModuleList:
    """Build encoder blocks with spectral attention.

    Returns
    -------
    nn.ModuleList
        List of spectral attention blocks.
    """
    blocks = []
    for _ in range(self.num_layers):
        attention_layer = SpectralAttention(
            hidden_dim=self.hidden_dim,
            num_heads=self.num_heads,
            num_features=self.num_features,
            kernel_type=self.kernel_type,
            use_orthogonal=self.use_orthogonal,
            dropout=self.dropout_rate,
        )

        block = PreNormBlock(
            mixing_layer=attention_layer,
            hidden_dim=self.hidden_dim,
            ffn_hidden_dim=self.ffn_hidden_dim,
            dropout=self.dropout_rate,
            norm_eps=1e-12,
        )
        blocks.append(block)

    return nn.ModuleList(blocks)

PerformerTransformer

PerformerTransformer(vocab_size: int | None = None, hidden_dim: int = 512, num_layers: int = 6, max_sequence_length: int = 1024, num_heads: int = 8, num_features: int | None = None, num_classes: int | None = None, ffn_hidden_dim: int | None = None, dropout: float = 0.0, use_positional_encoding: bool = True, positional_encoding_type: PositionalEncodingType = 'sinusoidal', gradient_checkpointing: bool = False)

Bases: BaseModel

Performer transformer with positive orthogonal random features.

This model implements the Performer architecture which uses positive orthogonal random features (PORF) to approximate the softmax kernel with improved variance reduction compared to standard RFF.

Parameters:

Name Type Description Default
vocab_size int | None

Vocabulary size.

None
hidden_dim int

Hidden dimension.

512
num_layers int

Number of layers.

6
max_sequence_length int

Maximum sequence length.

1024
num_heads int

Number of heads.

8
num_features int | None

Number of random features.

None
num_classes int | None

Number of classes.

None
ffn_hidden_dim int | None

FFN dimension.

None
dropout float

Dropout rate.

0.0
use_positional_encoding bool

Use positional encoding.

True
positional_encoding_type str

Positional encoding type.

"sinusoidal"
gradient_checkpointing bool

Use gradient checkpointing.

False

Examples:

>>> performer = PerformerTransformer(
...     hidden_dim=512,
...     num_layers=6,
...     num_heads=8,
...     num_features=256,
...     max_sequence_length=1024
... )
>>> x = torch.randn(32, 100, 512)
>>> output = performer(inputs_embeds=x)

Methods:

Name Description
build_blocks

Build Performer blocks with orthogonal features.

Source code in spectrans/models/spectral_attention.py
def __init__(
    self,
    vocab_size: int | None = None,
    hidden_dim: int = 512,
    num_layers: int = 6,
    max_sequence_length: int = 1024,
    num_heads: int = 8,
    num_features: int | None = None,
    num_classes: int | None = None,
    ffn_hidden_dim: int | None = None,
    dropout: float = 0.0,
    use_positional_encoding: bool = True,
    positional_encoding_type: PositionalEncodingType = "sinusoidal",
    gradient_checkpointing: bool = False,
):
    # Store parameters before calling super().__init__ since build_blocks needs them
    self.num_heads = num_heads
    self.num_features = num_features or hidden_dim
    self.dropout_rate = dropout

    super().__init__(
        vocab_size=vocab_size,
        hidden_dim=hidden_dim,
        num_layers=num_layers,
        max_sequence_length=max_sequence_length,
        num_classes=num_classes,
        ffn_hidden_dim=ffn_hidden_dim,
        dropout=dropout,
        use_positional_encoding=use_positional_encoding,
        positional_encoding_type=positional_encoding_type,
        gradient_checkpointing=gradient_checkpointing,
    )
Functions
build_blocks
build_blocks() -> ModuleList

Build Performer blocks with orthogonal features.

Returns:

Type Description
ModuleList

List of Performer blocks.

Source code in spectrans/models/spectral_attention.py
def build_blocks(self) -> nn.ModuleList:
    """Build Performer blocks with orthogonal features.

    Returns
    -------
    nn.ModuleList
        List of Performer blocks.
    """
    blocks = []
    for _ in range(self.num_layers):
        attention_layer = PerformerAttention(
            hidden_dim=self.hidden_dim,
            num_heads=self.num_heads,
            num_features=self.num_features,
            dropout=self.dropout_rate,
        )

        block = PreNormBlock(
            mixing_layer=attention_layer,
            hidden_dim=self.hidden_dim,
            ffn_hidden_dim=self.ffn_hidden_dim,
            dropout=self.dropout_rate,
            norm_eps=1e-12,
        )
        blocks.append(block)

    return nn.ModuleList(blocks)

Functions