Spectral Attention Models¶

spectrans.models.spectral_attention ¶

Spectral Attention transformer models using kernel approximations.

This module implements transformer models based on spectral attention mechanisms that use Random Fourier Features (RFF) to linearize attention computation. These models achieve \(O(n)\) complexity instead of the quadratic \(O(n^2)\) complexity of standard transformers, making them efficient for long sequences.

The spectral attention mechanism approximates the softmax kernel using random feature maps, maintaining the expressive power of attention while dramatically reducing computational cost.

Classes:

Name	Description
`SpectralAttentionTransformer`	Complete transformer model using spectral attention layers.
`SpectralAttentionEncoder`	Encoder-only model for representation learning.
`PerformerTransformer`	Performer-style model with positive orthogonal random features.

Examples:

Basic spectral attention transformer:

>>> import torch
>>> from spectrans.models.spectral_attention import SpectralAttentionTransformer
>>> model = SpectralAttentionTransformer(
...     hidden_dim=512,
...     num_layers=6,
...     num_heads=8,
...     num_features=256,
...     max_sequence_length=1024
... )
>>> x = torch.randn(32, 100, 512)  # (batch, seq_len, dim)
>>> output = model(inputs_embeds=x)
>>> assert output.shape == x.shape

Using with token inputs and classification head:

>>> model = SpectralAttentionTransformer(
...     vocab_size=10000,
...     hidden_dim=512,
...     num_layers=6,
...     num_heads=8,
...     num_classes=10,
...     max_sequence_length=512
... )
>>> input_ids = torch.randint(0, 10000, (32, 100))
>>> logits = model(input_ids)
>>> assert logits.shape == (32, 10)

Performer model with orthogonal features:

>>> from spectrans.models.spectral_attention import PerformerTransformer
>>> performer = PerformerTransformer(
...     hidden_dim=512,
...     num_layers=6,
...     num_heads=8,
...     num_features=256,
...     use_orthogonal=True
... )

Notes

Mathematical Foundation:

The spectral attention mechanism approximates standard attention as:

\[ \text{Attention}(Q, K, V) \approx \boldsymbol{\Phi}(Q) \left(\boldsymbol{\Phi}(K)^T V\right) / Z \]

Where \(\boldsymbol{\Phi}\) is a random feature map:

\[ \boldsymbol{\varphi}(x) = \sqrt{\frac{2}{D}} \begin{bmatrix} \cos(\boldsymbol{\omega}_1^T x + b_1) \\ \cos(\boldsymbol{\omega}_2^T x + b_2) \\ \vdots \\ \cos(\boldsymbol{\omega}_D^T x + b_D) \end{bmatrix} \]

With random frequencies \(\omega_i \sim \mathcal{N}(0, \sigma^2 I)\) and phases \(b_i \sim \text{Uniform}[0, 2\pi]\).

The approximation quality improves with more random features \(D\), with error decreasing as \(O(\frac{1}{\sqrt{D}})\). The linear complexity \(O(nDd)\) becomes favorable over standard attention \(O(n^2d)\) when \(D \ll n\).

For the Performer variant, orthogonal random features are used to reduce the variance of the approximation, leading to better convergence.

References

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. 2021. Rethinking attention with performers. In Proceedings of the International Conference on Learning Representations (ICLR).

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. 2021. Random feature attention. In Proceedings of the International Conference on Learning Representations (ICLR).

Ali Rahimi and Benjamin Recht. 2007. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20 (NeurIPS 2007), pages 1177-1184.

Classes¶

SpectralAttentionTransformer ¶

SpectralAttentionTransformer(vocab_size: int | None = None, hidden_dim: int = 512, num_layers: int = 6, max_sequence_length: int = 1024, num_heads: int = 8, num_features: int | None = None, kernel_type: KernelType = 'softmax', use_orthogonal: bool = False, num_classes: int | None = None, ffn_hidden_dim: int | None = None, dropout: float = 0.0, use_positional_encoding: bool = True, positional_encoding_type: PositionalEncodingType = 'sinusoidal', gradient_checkpointing: bool = False)

Bases: BaseModel

Spectral Attention transformer using Random Fourier Features.

This model uses spectral attention layers with RFF approximation to achieve linear complexity attention computation. The model maintains the expressive power of standard transformers while being efficient for long sequences.

Parameters:

Name	Type	Description	Default
`vocab_size`	`int \| None`	Size of the vocabulary for token embeddings. If None, expects pre-embedded inputs.	`None`
`hidden_dim`	`int`	Hidden dimension size for the model.	`512`
`num_layers`	`int`	Number of transformer blocks.	`6`
`max_sequence_length`	`int`	Maximum sequence length the model can process.	`1024`
`num_heads`	`int`	Number of attention heads.	`8`
`num_features`	`int \| None`	Number of random features for RFF approximation. If None, uses hidden_dim.	`None`
`kernel_type`	`KernelType`	Type of kernel to approximate.	`"softmax"`
`use_orthogonal`	`bool`	Whether to use orthogonal random features.	`False`
`num_classes`	`int \| None`	Number of output classes for classification.	`None`
`ffn_hidden_dim`	`int \| None`	Hidden dimension of the feedforward network. Default is 4 * hidden_dim.	`None`
`dropout`	`float`	Dropout probability.	`0.0`
`use_positional_encoding`	`bool`	Whether to use positional encoding.	`True`
`positional_encoding_type`	`str`	Type of positional encoding ("sinusoidal" or "learned").	`"sinusoidal"`
`gradient_checkpointing`	`bool`	Whether to use gradient checkpointing to save memory.	`False`

Attributes:

Name	Type	Description
`blocks`	`ModuleList`	Stack of spectral attention transformer blocks.

Examples:

>>> model = SpectralAttentionTransformer(
...     hidden_dim=512,
...     num_layers=6,
...     num_heads=8,
...     num_features=256,
...     max_sequence_length=1024
... )
>>> x = torch.randn(32, 100, 512)
>>> output = model(inputs_embeds=x)
>>> assert output.shape == x.shape

Methods:

Name	Description
`build_blocks`	Build transformer blocks with spectral attention layers.
`from_config`	Create model from configuration.

Source code in spectrans/models/spectral_attention.py

def __init__(
    self,
    vocab_size: int | None = None,
    hidden_dim: int = 512,
    num_layers: int = 6,
    max_sequence_length: int = 1024,
    num_heads: int = 8,
    num_features: int | None = None,
    kernel_type: KernelType = "softmax",
    use_orthogonal: bool = False,
    num_classes: int | None = None,
    ffn_hidden_dim: int | None = None,
    dropout: float = 0.0,
    use_positional_encoding: bool = True,
    positional_encoding_type: PositionalEncodingType = "sinusoidal",
    gradient_checkpointing: bool = False,
):
    # Store all parameters before calling super().__init__ since build_blocks needs them
    self.num_heads = num_heads
    self.num_features = num_features or hidden_dim
    self.kernel_type = kernel_type
    self.use_orthogonal = use_orthogonal
    self.dropout_rate = dropout  # Store as different name to avoid conflict

    super().__init__(
        vocab_size=vocab_size,
        hidden_dim=hidden_dim,
        num_layers=num_layers,
        max_sequence_length=max_sequence_length,
        num_classes=num_classes,
        ffn_hidden_dim=ffn_hidden_dim,
        dropout=dropout,
        use_positional_encoding=use_positional_encoding,
        positional_encoding_type=positional_encoding_type,
        gradient_checkpointing=gradient_checkpointing,
    )

Functions¶

build_blocks ¶

build_blocks() -> ModuleList

Build transformer blocks with spectral attention layers.

Returns:

Type	Description
`ModuleList`	List of spectral attention transformer blocks.

Source code in spectrans/models/spectral_attention.py

def build_blocks(self) -> nn.ModuleList:
    """Build transformer blocks with spectral attention layers.

    Returns
    -------
    nn.ModuleList
        List of spectral attention transformer blocks.
    """
    blocks = []
    for _ in range(self.num_layers):
        attention_layer = SpectralAttention(
            hidden_dim=self.hidden_dim,
            num_heads=self.num_heads,
            num_features=self.num_features,
            kernel_type=self.kernel_type,
            use_orthogonal=self.use_orthogonal,
            dropout=self.dropout_rate,
        )

        block = PreNormBlock(
            mixing_layer=attention_layer,
            hidden_dim=self.hidden_dim,
            ffn_hidden_dim=self.ffn_hidden_dim,
            dropout=self.dropout_rate,
            norm_eps=1e-12,
        )
        blocks.append(block)

    return nn.ModuleList(blocks)

from_config `classmethod` ¶

from_config(config: SpectralAttentionModelConfig) -> SpectralAttentionTransformer

Create model from configuration.

Parameters:

Name	Type	Description	Default
`config`	`SpectralAttentionModelConfig`	Model configuration object.	required

Returns:

Type	Description
`SpectralAttentionTransformer`	Configured model instance.

Source code in spectrans/models/spectral_attention.py

@classmethod
def from_config(cls, config: "SpectralAttentionModelConfig") -> "SpectralAttentionTransformer":  # type: ignore[override]
    """Create model from configuration.

    Parameters
    ----------
    config : SpectralAttentionModelConfig
        Model configuration object.

    Returns
    -------
    SpectralAttentionTransformer
        Configured model instance.
    """
    # Extract spectral attention specific config
    return cls(
        vocab_size=config.vocab_size,
        hidden_dim=config.hidden_dim,
        num_layers=config.num_layers,
        max_sequence_length=config.sequence_length,
        num_heads=config.num_heads,
        num_features=config.num_features,
        kernel_type=config.kernel_type,
        use_orthogonal=config.use_orthogonal,
        num_classes=config.num_classes,
        ffn_hidden_dim=config.ffn_hidden_dim,
        dropout=config.dropout,
        use_positional_encoding=config.use_positional_encoding,
        positional_encoding_type=config.positional_encoding_type,
        gradient_checkpointing=config.gradient_checkpointing,
    )

SpectralAttentionEncoder ¶

SpectralAttentionEncoder(vocab_size: int | None = None, hidden_dim: int = 512, num_layers: int = 6, max_sequence_length: int = 1024, num_heads: int = 8, num_features: int | None = None, kernel_type: KernelType = 'softmax', use_orthogonal: bool = False, ffn_hidden_dim: int | None = None, dropout: float = 0.0, use_positional_encoding: bool = True, positional_encoding_type: PositionalEncodingType = 'sinusoidal')

Bases: BaseModel

Encoder-only spectral attention model for representation learning.

This model uses spectral attention layers without a classification head, suitable for generating embeddings or as a component in larger architectures.

Parameters:

Name	Type	Description	Default
`vocab_size`	`int \| None`	Size of the vocabulary for token embeddings.	`None`
`hidden_dim`	`int`	Hidden dimension size.	`512`
`num_layers`	`int`	Number of transformer blocks.	`6`
`max_sequence_length`	`int`	Maximum sequence length.	`1024`
`num_heads`	`int`	Number of attention heads.	`8`
`num_features`	`int \| None`	Number of random features.	`None`
`kernel_type`	`KernelType`	Kernel type.	`"softmax"`
`use_orthogonal`	`bool`	Use orthogonal features.	`False`
`ffn_hidden_dim`	`int \| None`	FFN hidden dimension.	`None`
`dropout`	`float`	Dropout probability.	`0.0`
`use_positional_encoding`	`bool`	Use positional encoding.	`True`
`positional_encoding_type`	`str`	Positional encoding type.	`"sinusoidal"`

Methods:

Name	Description
`build_blocks`	Build encoder blocks with spectral attention.

Source code in spectrans/models/spectral_attention.py

def __init__(
    self,
    vocab_size: int | None = None,
    hidden_dim: int = 512,
    num_layers: int = 6,
    max_sequence_length: int = 1024,
    num_heads: int = 8,
    num_features: int | None = None,
    kernel_type: KernelType = "softmax",
    use_orthogonal: bool = False,
    ffn_hidden_dim: int | None = None,
    dropout: float = 0.0,
    use_positional_encoding: bool = True,
    positional_encoding_type: PositionalEncodingType = "sinusoidal",
):
    # Store parameters before calling super().__init__ since build_blocks needs them
    self.num_heads = num_heads
    self.num_features = num_features or hidden_dim
    self.kernel_type = kernel_type
    self.use_orthogonal = use_orthogonal
    self.dropout_rate = dropout

    # Initialize without classification head
    super().__init__(
        vocab_size=vocab_size,
        hidden_dim=hidden_dim,
        num_layers=num_layers,
        max_sequence_length=max_sequence_length,
        num_classes=None,  # No classification head
        ffn_hidden_dim=ffn_hidden_dim,
        dropout=dropout,
        use_positional_encoding=use_positional_encoding,
        positional_encoding_type=positional_encoding_type,
        gradient_checkpointing=False,
    )

    # Set output type to none for encoder
    self.output_type = "none"

Functions¶

build_blocks ¶

build_blocks() -> ModuleList

Build encoder blocks with spectral attention.

Returns:

Type	Description
`ModuleList`	List of spectral attention blocks.

Source code in spectrans/models/spectral_attention.py

def build_blocks(self) -> nn.ModuleList:
    """Build encoder blocks with spectral attention.

    Returns
    -------
    nn.ModuleList
        List of spectral attention blocks.
    """
    blocks = []
    for _ in range(self.num_layers):
        attention_layer = SpectralAttention(
            hidden_dim=self.hidden_dim,
            num_heads=self.num_heads,
            num_features=self.num_features,
            kernel_type=self.kernel_type,
            use_orthogonal=self.use_orthogonal,
            dropout=self.dropout_rate,
        )

        block = PreNormBlock(
            mixing_layer=attention_layer,
            hidden_dim=self.hidden_dim,
            ffn_hidden_dim=self.ffn_hidden_dim,
            dropout=self.dropout_rate,
            norm_eps=1e-12,
        )
        blocks.append(block)

    return nn.ModuleList(blocks)

PerformerTransformer ¶

PerformerTransformer(vocab_size: int | None = None, hidden_dim: int = 512, num_layers: int = 6, max_sequence_length: int = 1024, num_heads: int = 8, num_features: int | None = None, num_classes: int | None = None, ffn_hidden_dim: int | None = None, dropout: float = 0.0, use_positional_encoding: bool = True, positional_encoding_type: PositionalEncodingType = 'sinusoidal', gradient_checkpointing: bool = False)

Bases: BaseModel

Performer transformer with positive orthogonal random features.

This model implements the Performer architecture which uses positive orthogonal random features (PORF) to approximate the softmax kernel with improved variance reduction compared to standard RFF.

Parameters:

Name	Type	Description	Default
`vocab_size`	`int \| None`	Vocabulary size.	`None`
`hidden_dim`	`int`	Hidden dimension.	`512`
`num_layers`	`int`	Number of layers.	`6`
`max_sequence_length`	`int`	Maximum sequence length.	`1024`
`num_heads`	`int`	Number of heads.	`8`
`num_features`	`int \| None`	Number of random features.	`None`
`num_classes`	`int \| None`	Number of classes.	`None`
`ffn_hidden_dim`	`int \| None`	FFN dimension.	`None`
`dropout`	`float`	Dropout rate.	`0.0`
`use_positional_encoding`	`bool`	Use positional encoding.	`True`
`positional_encoding_type`	`str`	Positional encoding type.	`"sinusoidal"`
`gradient_checkpointing`	`bool`	Use gradient checkpointing.	`False`

Examples:

>>> performer = PerformerTransformer(
...     hidden_dim=512,
...     num_layers=6,
...     num_heads=8,
...     num_features=256,
...     max_sequence_length=1024
... )
>>> x = torch.randn(32, 100, 512)
>>> output = performer(inputs_embeds=x)

Methods:

Name	Description
`build_blocks`	Build Performer blocks with orthogonal features.

Source code in spectrans/models/spectral_attention.py

def __init__(
    self,
    vocab_size: int | None = None,
    hidden_dim: int = 512,
    num_layers: int = 6,
    max_sequence_length: int = 1024,
    num_heads: int = 8,
    num_features: int | None = None,
    num_classes: int | None = None,
    ffn_hidden_dim: int | None = None,
    dropout: float = 0.0,
    use_positional_encoding: bool = True,
    positional_encoding_type: PositionalEncodingType = "sinusoidal",
    gradient_checkpointing: bool = False,
):
    # Store parameters before calling super().__init__ since build_blocks needs them
    self.num_heads = num_heads
    self.num_features = num_features or hidden_dim
    self.dropout_rate = dropout

    super().__init__(
        vocab_size=vocab_size,
        hidden_dim=hidden_dim,
        num_layers=num_layers,
        max_sequence_length=max_sequence_length,
        num_classes=num_classes,
        ffn_hidden_dim=ffn_hidden_dim,
        dropout=dropout,
        use_positional_encoding=use_positional_encoding,
        positional_encoding_type=positional_encoding_type,
        gradient_checkpointing=gradient_checkpointing,
    )

Functions¶

build_blocks ¶

build_blocks() -> ModuleList

Build Performer blocks with orthogonal features.

Returns:

Type	Description
`ModuleList`	List of Performer blocks.

Source code in spectrans/models/spectral_attention.py

def build_blocks(self) -> nn.ModuleList:
    """Build Performer blocks with orthogonal features.

    Returns
    -------
    nn.ModuleList
        List of Performer blocks.
    """
    blocks = []
    for _ in range(self.num_layers):
        attention_layer = PerformerAttention(
            hidden_dim=self.hidden_dim,
            num_heads=self.num_heads,
            num_features=self.num_features,
            dropout=self.dropout_rate,
        )

        block = PreNormBlock(
            mixing_layer=attention_layer,
            hidden_dim=self.hidden_dim,
            ffn_hidden_dim=self.ffn_hidden_dim,
            dropout=self.dropout_rate,
            norm_eps=1e-12,
        )
        blocks.append(block)

    return nn.ModuleList(blocks)

Spectral Attention Models¶

spectrans.models.spectral_attention ¶

Classes¶

SpectralAttentionTransformer ¶

Functions¶

build_blocks ¶

from_config classmethod ¶

SpectralAttentionEncoder ¶

Functions¶

build_blocks ¶

PerformerTransformer ¶

Functions¶

build_blocks ¶

Functions¶

from_config `classmethod` ¶