Skip to content

Transformer Blocks

spectrans.blocks

Transformer block implementations for spectral architectures.

This module provides transformer blocks that combine spectral mixing or attention layers with feedforward networks, residual connections, and normalization. The blocks implement different architectural patterns including pre-norm, post-norm, parallel, and hybrid configurations for various spectral transformer models.

Modules:

Name Description
base

Base classes for transformer blocks.

hybrid

Hybrid blocks combining multiple mixing strategies.

spectral

Spectral transformer blocks using frequency-domain methods.

Classes:

Name Description
AFNOBlock

Adaptive Fourier Neural Operator block with mode truncation.

AdaptiveBlock

Block with adaptive routing between components.

AlternatingBlock

Alternates between different mixing strategies.

CascadeBlock

Cascades multiple blocks with different configurations.

FeedForwardNetwork

Standard MLP feedforward network.

FNetBlock

FNet-style block with Fourier mixing.

FNO2DBlock

2D Fourier Neural Operator block for spatial data.

FNOBlock

1D Fourier Neural Operator block.

GFNetBlock

Global Filter Network block with learnable filters.

HybridBlock

Combines multiple mixing strategies in parallel.

LSTBlock

Linear Spectral Transform block.

MultiscaleBlock

Multi-resolution processing with wavelets.

ParallelBlock

Parallel execution of mixing and feedforward.

PostNormBlock

Post-normalization transformer block.

PreNormBlock

Pre-normalization transformer block.

SpectralAttentionBlock

Block using spectral attention mechanisms.

TransformerBlock

Base class for all transformer blocks.

WaveletBlock

Block using wavelet transforms for mixing.

Examples:

Using a FNet block:

>>> import torch
>>> from spectrans.blocks import FNetBlock
>>>
>>> block = FNetBlock(hidden_dim=768, ffn_hidden_dim=3072)
>>> x = torch.randn(32, 512, 768)
>>> output = block(x)
>>> assert output.shape == x.shape

Using a hybrid block with multiple mixing strategies:

>>> from spectrans.blocks import AlternatingBlock
>>> from spectrans.layers.mixing.fourier import FourierMixing
>>> from spectrans.layers.mixing.wavelet import WaveletMixing
>>>
>>> layer1 = FourierMixing(hidden_dim=512)
>>> layer2 = WaveletMixing(hidden_dim=512, wavelet='db4')
>>> block = AlternatingBlock(layer1=layer1, layer2=layer2, hidden_dim=512)
>>> output = block(x)

Using parallel execution:

>>> from spectrans.blocks import ParallelBlock
>>> from spectrans.layers.mixing.fourier import FourierMixing
>>>
>>> mixing = FourierMixing(hidden_dim=768)
>>> block = ParallelBlock(mixing_layer=mixing, hidden_dim=768)
>>> output = block(x)
Notes

Architectural Patterns:

  1. Pre-Norm: LayerNorm → Mixing → Residual → LayerNorm → FFN → Residual
  2. Post-Norm: Mixing → Residual → LayerNorm → FFN → Residual → LayerNorm
  3. Parallel: Mixing and FFN execute simultaneously with single residual
  4. Hybrid: Multiple mixing strategies combined with learnable or fixed weights

Complexity Comparison:

  • Standard Transformer: \(O(n^2 d)\) per block
  • FNet Block: \(O(nd \log n)\) per block
  • GFNet Block: \(O(nd \log n)\) with learnable parameters
  • Wavelet Block: \(O(nd)\) with multi-resolution analysis
  • Hybrid Block: Weighted combination of component complexities

All blocks maintain: - Residual connections for gradient flow - LayerNorm for training stability - Dropout for regularization - Optional activation checkpointing

References

James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. 2022. FNet: Mixing tokens with Fourier transforms. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4296-4313, Seattle.

Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. 2021. Global filter networks for image classification. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021), pages 980-993.

See Also

spectrans.layers : Layer implementations used in blocks. spectrans.models : Models built from these blocks. spectrans.blocks.base : Base classes and interfaces.

Classes

FeedForwardNetwork

FeedForwardNetwork(hidden_dim: int, ffn_hidden_dim: int, activation: str = 'gelu', dropout: float = 0.0)

Bases: Module

Standard feedforward network for transformer blocks.

A two-layer MLP with configurable activation function and dropout.

Parameters:

Name Type Description Default
hidden_dim int

Input and output dimension.

required
ffn_hidden_dim int

Hidden dimension of the FFN.

required
activation str

Activation function name. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0

Attributes:

Name Type Description
fc1 Linear

First linear layer.

fc2 Linear

Second linear layer.

activation Module

Activation function.

dropout Dropout

Dropout layer.

Methods:

Name Description
forward

Forward pass through the FFN.

Source code in spectrans/blocks/base.py
def __init__(
    self,
    hidden_dim: int,
    ffn_hidden_dim: int,
    activation: str = "gelu",
    dropout: float = 0.0,
):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.ffn_hidden_dim = ffn_hidden_dim

    # Linear layers
    self.fc1 = nn.Linear(hidden_dim, ffn_hidden_dim)
    self.fc2 = nn.Linear(ffn_hidden_dim, hidden_dim)

    # Activation function
    activation_functions = {
        "gelu": nn.GELU(),
        "relu": nn.ReLU(),
        "silu": nn.SiLU(),
        "tanh": nn.Tanh(),
        "sigmoid": nn.Sigmoid(),
        "elu": nn.ELU(),
        "leaky_relu": nn.LeakyReLU(),
    }
    if activation not in activation_functions:
        raise ValueError(f"Unknown activation: {activation}")
    self.activation = activation_functions[activation]

    # Dropout
    self.dropout = nn.Dropout(dropout)
Functions
forward
forward(x: Tensor) -> Tensor

Forward pass through the FFN.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (..., hidden_dim).

required

Returns:

Type Description
Tensor

Output tensor of shape (..., hidden_dim).

Source code in spectrans/blocks/base.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the FFN.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (..., hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (..., hidden_dim).
    """
    x = self.fc1(x)
    x = self.activation(x)
    x = self.dropout(x)
    x = self.fc2(x)
    return x

ParallelBlock

ParallelBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: SpectralComponent

Transformer block with parallel mixing and FFN branches.

This block processes the mixing layer and FFN in parallel rather than sequentially, which can improve efficiency and has been shown to work well in practice.

Parameters:

Name Type Description Default
mixing_layer MixingLayer | Module

The mixing or attention layer.

required
hidden_dim int

Hidden dimension of the model.

required
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12

Attributes:

Name Type Description
mixing_layer MixingLayer | Module

The mixing or attention layer.

ffn FeedForwardNetwork

The feedforward network.

norm LayerNorm

Layer normalization.

dropout Dropout

Dropout layer.

Methods:

Name Description
forward

Forward pass through the parallel block.

Source code in spectrans/blocks/base.py
def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.mixing_layer = mixing_layer

    # Default FFN dimension
    if ffn_hidden_dim is None:
        ffn_hidden_dim = 4 * hidden_dim

    # Components
    self.ffn = FeedForwardNetwork(
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
    )
    self.norm = nn.LayerNorm(hidden_dim, eps=norm_eps)
    self.dropout = nn.Dropout(dropout)
Functions
forward
forward(x: Tensor) -> Tensor

Forward pass through the parallel block.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required

Returns:

Type Description
Tensor

Output tensor of shape (batch_size, sequence_length, hidden_dim).

Source code in spectrans/blocks/base.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the parallel block.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (batch_size, sequence_length, hidden_dim).
    """
    # Normalize input
    normed = self.norm(x)

    # Process mixing and FFN in parallel
    mixed = self.mixing_layer(normed)
    ffn_out = self.ffn(normed)

    # Combine and add residual
    output: torch.Tensor = x + self.dropout(mixed + ffn_out)

    return output

PostNormBlock

PostNormBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: TransformerBlock

Transformer block with post-layer normalization.

This block applies layer normalization after the mixing layer and FFN, following the original transformer architecture.

Parameters:

Name Type Description Default
mixing_layer MixingLayer | Module

The mixing or attention layer.

required
hidden_dim int

Hidden dimension of the model.

required
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/base.py
def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    if ffn_hidden_dim is None:
        ffn_hidden_dim = 4 * hidden_dim
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
        use_pre_norm=False,
        norm_eps=norm_eps,
    )

PreNormBlock

PreNormBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: TransformerBlock

Transformer block with pre-layer normalization.

This block applies layer normalization before the mixing layer and FFN, which has been shown to improve training stability.

Parameters:

Name Type Description Default
mixing_layer MixingLayer | Module

The mixing or attention layer.

required
hidden_dim int

Hidden dimension of the model.

required
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/base.py
def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    if ffn_hidden_dim is None:
        ffn_hidden_dim = 4 * hidden_dim
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
        use_pre_norm=True,
        norm_eps=norm_eps,
    )

TransformerBlock

TransformerBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, use_pre_norm: bool = True, norm_eps: float = 1e-12)

Bases: SpectralComponent

Base class for transformer blocks.

A transformer block combines a mixing/attention layer with a feedforward network, using residual connections and layer normalization.

Parameters:

Name Type Description Default
mixing_layer MixingLayer | Module

The mixing or attention layer for token interaction.

required
hidden_dim int

Hidden dimension of the model.

required
ffn_hidden_dim int | None

Hidden dimension of the feedforward network. Default is 4 * hidden_dim.

None
activation str

Activation function for the FFN. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
use_pre_norm bool

Whether to use pre-layer normalization. Default is True.

True
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12

Attributes:

Name Type Description
mixing_layer MixingLayer | Module

The mixing or attention layer.

ffn FeedForwardNetwork | None

The feedforward network.

norm1 LayerNorm

First layer normalization.

norm2 LayerNorm | None

Second layer normalization (if FFN is used).

dropout Dropout

Dropout layer.

use_pre_norm bool

Whether pre-normalization is used.

Methods:

Name Description
forward

Forward pass through the transformer block.

Source code in spectrans/blocks/base.py
def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    use_pre_norm: bool = True,
    norm_eps: float = 1e-12,
):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.mixing_layer = mixing_layer
    self.use_pre_norm = use_pre_norm

    # Layer normalization
    self.norm1 = nn.LayerNorm(hidden_dim, eps=norm_eps)

    # Feedforward network
    if ffn_hidden_dim is not None:
        self.ffn = FeedForwardNetwork(
            hidden_dim=hidden_dim,
            ffn_hidden_dim=ffn_hidden_dim,
            activation=activation,
            dropout=dropout,
        )
        self.norm2 = nn.LayerNorm(hidden_dim, eps=norm_eps)
    else:
        self.ffn = None
        self.norm2 = None

    # Dropout
    self.dropout = nn.Dropout(dropout)
Functions
forward
forward(x: Tensor) -> Tensor

Forward pass through the transformer block.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required

Returns:

Type Description
Tensor

Output tensor of shape (batch_size, sequence_length, hidden_dim).

Source code in spectrans/blocks/base.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the transformer block.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (batch_size, sequence_length, hidden_dim).
    """
    output: torch.Tensor
    if self.use_pre_norm:
        # Pre-norm: normalize before mixing
        h = x + self.dropout(self.mixing_layer(self.norm1(x)))
        if self.ffn is not None and self.norm2 is not None:
            output = h + self.dropout(self.ffn(self.norm2(h)))
        else:
            output = h
    else:
        # Post-norm: normalize after mixing
        h = self.norm1(x + self.dropout(self.mixing_layer(x)))
        if self.ffn is not None and self.norm2 is not None:
            output = self.norm2(h + self.dropout(self.ffn(h)))
        else:
            output = h

    return output

AdaptiveBlock

AdaptiveBlock(layers: list[MixingLayer | Module], hidden_dim: int, gate_type: str = 'soft', ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: HybridBlock

Transformer block that adaptively selects between mixing strategies.

This block uses a gating mechanism to dynamically choose or blend between different mixing strategies based on the input.

Parameters:

Name Type Description Default
layers list[MixingLayer | Module]

List of mixing layers to choose from.

required
hidden_dim int

Hidden dimension of the model.

required
gate_type str

Type of gating ('soft' for weighted sum, 'hard' for selection). Default is 'soft'.

'soft'
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12

Attributes:

Name Type Description
layers ModuleList

List of mixing layers.

gate Linear

Gating network for layer selection.

gate_type str

Type of gating mechanism.

Methods:

Name Description
forward

Forward pass through the adaptive block.

Source code in spectrans/blocks/hybrid.py
def __init__(
    self,
    layers: list[MixingLayer | nn.Module],
    hidden_dim: int,
    gate_type: str = "soft",
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    super().__init__(
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
        norm_eps=norm_eps,
    )
    self.layers = nn.ModuleList(layers)
    self.num_layers = len(layers)
    self.gate_type = gate_type

    # Gating network
    self.gate = nn.Linear(hidden_dim, self.num_layers)

    # Initialize gate to uniform weights
    nn.init.constant_(self.gate.weight, 0)
    nn.init.constant_(self.gate.bias, 0)
Functions
forward
forward(x: Tensor) -> Tensor

Forward pass through the adaptive block.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required

Returns:

Type Description
Tensor

Output tensor of shape (batch_size, sequence_length, hidden_dim).

Source code in spectrans/blocks/hybrid.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the adaptive block.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (batch_size, sequence_length, hidden_dim).
    """
    # Normalize input for mixing
    normed = self.norm1(x)

    # Compute gate values
    gate_input = normed.mean(dim=1)  # (batch_size, hidden_dim)
    gate_logits = self.gate(gate_input)  # (batch_size, num_layers)

    if self.gate_type == "soft":
        # Soft gating: weighted sum of all layers
        gate_weights = F.softmax(gate_logits, dim=-1)  # (batch_size, num_layers)

        # Apply each layer and combine
        mixed = torch.zeros_like(x)
        for i, layer in enumerate(self.layers):
            weight = gate_weights[:, i : i + 1].unsqueeze(1)  # (batch_size, 1, 1)
            mixed = mixed + weight * layer(normed)
    else:  # hard gating
        # Hard gating: select single layer
        gate_idx = torch.argmax(gate_logits, dim=-1)  # (batch_size,)

        # Apply selected layer for each sample
        mixed = torch.zeros_like(x)
        for i in range(x.shape[0]):
            idx = int(gate_idx[i].item())
            mixed[i] = self.layers[idx](normed[i : i + 1])

    # Add residual
    h = x + self.dropout(mixed)

    # Apply FFN with pre-norm
    output: Tensor = h + self.dropout(self.ffn(self.norm2(h)))

    return output

AlternatingBlock

AlternatingBlock(layer1: MixingLayer | Module, layer2: MixingLayer | Module, hidden_dim: int, use_layer1: bool = True, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: HybridBlock

Transformer block that alternates between two mixing strategies.

This block can be used in alternating patterns, e.g., even layers use one type of mixing and odd layers use another.

Parameters:

Name Type Description Default
layer1 MixingLayer | Module

First mixing layer.

required
layer2 MixingLayer | Module

Second mixing layer.

required
hidden_dim int

Hidden dimension of the model.

required
use_layer1 bool

Whether to use layer1 (True) or layer2 (False). Default is True.

True
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12

Attributes:

Name Type Description
layer1 MixingLayer | Module

First mixing layer.

layer2 MixingLayer | Module

Second mixing layer.

use_layer1 bool

Which layer to use for this block.

Methods:

Name Description
forward

Forward pass through the alternating block.

set_layer

Set which layer to use.

Source code in spectrans/blocks/hybrid.py
def __init__(
    self,
    layer1: MixingLayer | nn.Module,
    layer2: MixingLayer | nn.Module,
    hidden_dim: int,
    use_layer1: bool = True,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    super().__init__(
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
        norm_eps=norm_eps,
    )
    self.layer1 = layer1
    self.layer2 = layer2
    self.use_layer1 = use_layer1
Functions
forward
forward(x: Tensor) -> Tensor

Forward pass through the alternating block.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required

Returns:

Type Description
Tensor

Output tensor of shape (batch_size, sequence_length, hidden_dim).

Source code in spectrans/blocks/hybrid.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the alternating block.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (batch_size, sequence_length, hidden_dim).
    """
    # Select which layer to use
    mixing_layer = self.layer1 if self.use_layer1 else self.layer2

    # Apply mixing with pre-norm
    h = x + self.dropout(mixing_layer(self.norm1(x)))

    # Apply FFN with pre-norm
    output: Tensor = h + self.dropout(self.ffn(self.norm2(h)))

    return output
set_layer
set_layer(use_layer1: bool) -> None

Set which layer to use.

Parameters:

Name Type Description Default
use_layer1 bool

Whether to use layer1 (True) or layer2 (False).

required
Source code in spectrans/blocks/hybrid.py
def set_layer(self, use_layer1: bool) -> None:
    """Set which layer to use.

    Parameters
    ----------
    use_layer1 : bool
        Whether to use layer1 (True) or layer2 (False).
    """
    self.use_layer1 = use_layer1

CascadeBlock

CascadeBlock(layers: list[MixingLayer | Module], hidden_dim: int, share_norm: bool = False, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: HybridBlock

Transformer block that cascades multiple mixing strategies.

This block applies mixing layers sequentially, allowing each layer to refine the representations produced by the previous one.

Parameters:

Name Type Description Default
layers list[MixingLayer | Module]

List of mixing layers to cascade.

required
hidden_dim int

Hidden dimension of the model.

required
share_norm bool

Whether to share normalization across layers. Default is False.

False
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12

Attributes:

Name Type Description
layers ModuleList

List of mixing layers to cascade.

norms ModuleList

Normalization layers for each mixing layer.

share_norm bool

Whether normalization is shared.

Methods:

Name Description
forward

Forward pass through the cascade block.

Source code in spectrans/blocks/hybrid.py
def __init__(
    self,
    layers: list[MixingLayer | nn.Module],
    hidden_dim: int,
    share_norm: bool = False,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    super().__init__(
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
        norm_eps=norm_eps,
    )
    self.layers = nn.ModuleList(layers)
    self.share_norm = share_norm

    # Create normalization layers
    if share_norm:
        # Use the same norm for all layers
        self.norms = nn.ModuleList([self.norm1] * len(layers))
    else:
        # Create separate norms for each layer
        self.norms = nn.ModuleList([nn.LayerNorm(hidden_dim, eps=norm_eps) for _ in layers])
Functions
forward
forward(x: Tensor) -> Tensor

Forward pass through the cascade block.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required

Returns:

Type Description
Tensor

Output tensor of shape (batch_size, sequence_length, hidden_dim).

Source code in spectrans/blocks/hybrid.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the cascade block.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (batch_size, sequence_length, hidden_dim).
    """
    # Cascade through mixing layers
    h = x
    for layer, norm in zip(self.layers, self.norms, strict=False):
        h = h + self.dropout(layer(norm(h)))

    # Apply FFN with pre-norm
    output: Tensor = h + self.dropout(self.ffn(self.norm2(h)))

    return output

HybridBlock

HybridBlock(hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: SpectralComponent

Base class for hybrid transformer blocks.

This class provides the foundation for blocks that combine multiple mixing strategies in various ways.

Parameters:

Name Type Description Default
hidden_dim int

Hidden dimension of the model.

required
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12

Attributes:

Name Type Description
hidden_dim int

Hidden dimension of the model.

ffn FeedForwardNetwork | None

The feedforward network.

dropout Dropout

Dropout layer.

Source code in spectrans/blocks/hybrid.py
def __init__(
    self,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    super().__init__()
    self.hidden_dim = hidden_dim

    # Default FFN dimension
    if ffn_hidden_dim is None:
        ffn_hidden_dim = 4 * hidden_dim

    # Feedforward network
    self.ffn = FeedForwardNetwork(
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
    )

    # Normalization layers (to be used by subclasses)
    self.norm1 = nn.LayerNorm(hidden_dim, eps=norm_eps)
    self.norm2 = nn.LayerNorm(hidden_dim, eps=norm_eps)
    self.norm3 = nn.LayerNorm(hidden_dim, eps=norm_eps)

    # Dropout
    self.dropout = nn.Dropout(dropout)

MultiscaleBlock

MultiscaleBlock(layers: list[MixingLayer | Module], hidden_dim: int, fusion_type: str = 'add', ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: HybridBlock

Transformer block that processes multiple scales in parallel.

This block applies different mixing strategies at different scales and combines their outputs, capturing both local and global patterns.

Parameters:

Name Type Description Default
layers list[MixingLayer | Module]

List of mixing layers for different scales.

required
hidden_dim int

Hidden dimension of the model.

required
fusion_type str

How to fuse outputs ('concat', 'add', 'weighted'). Default is 'add'.

'add'
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12

Attributes:

Name Type Description
layers ModuleList

List of mixing layers for different scales.

fusion_type str

Type of fusion mechanism.

fusion_weights Parameter | None

Learnable weights for fusion (if fusion_type is 'weighted').

fusion_proj Linear | None

Projection for concatenation fusion.

Methods:

Name Description
forward

Forward pass through the multiscale block.

Source code in spectrans/blocks/hybrid.py
def __init__(
    self,
    layers: list[MixingLayer | nn.Module],
    hidden_dim: int,
    fusion_type: str = "add",
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    super().__init__(
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
        norm_eps=norm_eps,
    )
    self.layers = nn.ModuleList(layers)
    self.num_scales = len(layers)
    self.fusion_type = fusion_type

    # Type annotations for optional attributes
    self.fusion_weights: nn.Parameter | None
    self.fusion_proj: nn.Linear | None

    # Fusion mechanisms
    if fusion_type == "weighted":
        self.fusion_weights = nn.Parameter(torch.ones(self.num_scales) / self.num_scales)
    else:
        self.fusion_weights = None

    if fusion_type == "concat":
        self.fusion_proj = nn.Linear(hidden_dim * self.num_scales, hidden_dim)
    else:
        self.fusion_proj = None
Functions
forward
forward(x: Tensor) -> Tensor

Forward pass through the multiscale block.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required

Returns:

Type Description
Tensor

Output tensor of shape (batch_size, sequence_length, hidden_dim).

Source code in spectrans/blocks/hybrid.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the multiscale block.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (batch_size, sequence_length, hidden_dim).
    """
    # Normalize input
    normed = self.norm1(x)

    # Apply each scale
    outputs = []
    for layer in self.layers:
        outputs.append(layer(normed))

    # Fuse outputs
    if self.fusion_type == "add":
        mixed = sum(outputs) / self.num_scales
    elif self.fusion_type == "weighted":
        assert self.fusion_weights is not None, (
            "fusion_weights should not be None for weighted fusion"
        )
        weights = F.softmax(self.fusion_weights, dim=0)
        mixed = sum(w * out for w, out in zip(weights, outputs, strict=False))
    elif self.fusion_type == "concat":
        mixed = torch.cat(outputs, dim=-1)
        assert self.fusion_proj is not None, "fusion_proj should not be None for concat fusion"
        mixed = self.fusion_proj(mixed)
    else:
        raise ValueError(f"Unknown fusion type: {self.fusion_type}")

    # Add residual
    h = x + self.dropout(mixed)

    # Apply FFN with pre-norm
    output: Tensor = h + self.dropout(self.ffn(self.norm2(h)))

    return output

AFNOBlock

AFNOBlock(hidden_dim: int, sequence_length: int, modes: int | None = None, mlp_hidden_dim: int | None = None, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: PreNormBlock

AFNO transformer block with adaptive Fourier neural operator.

This block uses adaptive Fourier mode selection with MLPs in the frequency domain for token mixing.

Parameters:

Name Type Description Default
hidden_dim int

Hidden dimension of the model.

required
sequence_length int

Maximum sequence length.

required
modes int | None

Number of Fourier modes to retain. Default is sequence_length // 2.

None
mlp_hidden_dim int | None

Hidden dimension of the frequency-domain MLP. Default is hidden_dim.

None
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/spectral.py
def __init__(
    self,
    hidden_dim: int,
    sequence_length: int,
    modes: int | None = None,
    mlp_hidden_dim: int | None = None,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    # Determine mlp_ratio from mlp_hidden_dim if provided
    mlp_ratio = mlp_hidden_dim / hidden_dim if mlp_hidden_dim is not None else 2.0

    mixing_layer = AFNOMixing(
        hidden_dim=hidden_dim,
        max_sequence_length=sequence_length,
        modes_seq=modes,
        modes_hidden=modes,
        mlp_ratio=mlp_ratio,
        activation=cast(ActivationType, activation),
        dropout=dropout,
    )
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=cast(ActivationType, activation),
        dropout=dropout,
        norm_eps=norm_eps,
    )

FNetBlock

FNetBlock(hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: PreNormBlock

FNet transformer block with Fourier mixing.

This block uses Fourier transforms for token mixing, providing an alternative to attention with \(O(n \log n)\) complexity.

Parameters:

Name Type Description Default
hidden_dim int

Hidden dimension of the model.

required
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/spectral.py
def __init__(
    self,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    mixing_layer = FourierMixing(hidden_dim=hidden_dim)
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=cast(ActivationType, activation),
        dropout=dropout,
        norm_eps=norm_eps,
    )

FNO2DBlock

FNO2DBlock(hidden_dim: int, modes_h: int | None = None, modes_w: int | None = None, num_layers: int = 1, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: PreNormBlock

2D FNO transformer block for image or grid data.

This block uses 2D Fourier neural operators for spatial data processing.

Parameters:

Name Type Description Default
hidden_dim int

Hidden dimension (number of channels).

required
modes_h int | None

Number of Fourier modes for height. Default is 16.

None
modes_w int | None

Number of Fourier modes for width. Default is 16.

None
num_layers int

Number of FNO layers. Default is 1.

1
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/spectral.py
def __init__(
    self,
    hidden_dim: int,
    modes_h: int | None = None,
    modes_w: int | None = None,
    num_layers: int = 1,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    if modes_h is None:
        modes_h = 16
    if modes_w is None:
        modes_w = 16

    # For 2D, we use FourierNeuralOperator with 2D mode specification
    modes_2d = (modes_h, modes_w)
    if num_layers == 1:
        mixing_layer: nn.Module = FourierNeuralOperator(
            hidden_dim=hidden_dim,
            modes=modes_2d,  # Use 2D mode tuple
            activation=cast(ActivationType, activation),
        )
    else:
        # Stack multiple FNO layers
        layers = []
        for _ in range(num_layers):
            layers.append(
                FourierNeuralOperator(
                    hidden_dim=hidden_dim,
                    modes=modes_2d,  # Use 2D mode tuple
                    activation=cast(ActivationType, activation),
                )
            )
        mixing_layer = nn.Sequential(*layers)

    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=cast(ActivationType, activation),
        dropout=dropout,
        norm_eps=norm_eps,
    )

FNOBlock

FNOBlock(hidden_dim: int, modes: int | None = None, num_layers: int = 1, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: PreNormBlock

FNO transformer block with Fourier neural operator.

This block uses Fourier neural operators for learning mappings between function spaces.

Parameters:

Name Type Description Default
hidden_dim int

Hidden dimension of the model.

required
modes int | None

Number of Fourier modes. Default is 16.

None
num_layers int

Number of FNO layers. Default is 1.

1
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/spectral.py
def __init__(
    self,
    hidden_dim: int,
    modes: int | None = None,
    num_layers: int = 1,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    if modes is None:
        modes = 16

    # Use FourierNeuralOperator for mixing
    if num_layers == 1:
        mixing_layer: nn.Module = FourierNeuralOperator(
            hidden_dim=hidden_dim,
            modes=modes,
            activation=cast(ActivationType, activation),
        )
    else:
        # Stack multiple FNO layers
        layers = []
        for _ in range(num_layers):
            layers.append(
                FourierNeuralOperator(
                    hidden_dim=hidden_dim,
                    modes=modes,
                    activation=cast(ActivationType, activation),
                )
            )
        mixing_layer = nn.Sequential(*layers)

    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=cast(ActivationType, activation),
        dropout=dropout,
        norm_eps=norm_eps,
    )

GFNetBlock

GFNetBlock(hidden_dim: int, sequence_length: int, ffn_hidden_dim: int | None = None, filter_activation: str = 'sigmoid', filter_init_std: float = 0.02, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: PreNormBlock

GFNet transformer block with global filter mixing.

This block uses learnable frequency-domain filters for token mixing.

Parameters:

Name Type Description Default
hidden_dim int

Hidden dimension of the model.

required
sequence_length int

Maximum sequence length.

required
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
filter_activation str

Activation for filters ('sigmoid', 'tanh', or 'identity'). Default is 'sigmoid'.

'sigmoid'
filter_init_std float

Initialization std for filters. Default is 0.02.

0.02
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/spectral.py
def __init__(
    self,
    hidden_dim: int,
    sequence_length: int,
    ffn_hidden_dim: int | None = None,
    filter_activation: str = "sigmoid",
    filter_init_std: float = 0.02,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    mixing_layer = GlobalFilterMixing(
        hidden_dim=hidden_dim,
        sequence_length=sequence_length,
        activation=cast(ActivationType, filter_activation),
        filter_init_std=filter_init_std,
    )
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=cast(ActivationType, activation),
        dropout=dropout,
        norm_eps=norm_eps,
    )

LSTBlock

LSTBlock(hidden_dim: int, num_heads: int = 8, transform_type: str = 'dct', use_scaling: bool = True, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: PreNormBlock

LST transformer block with linear spectral transform attention.

This block uses orthogonal transforms (DCT, DST, or Hadamard) for attention computation.

Parameters:

Name Type Description Default
hidden_dim int

Hidden dimension of the model.

required
num_heads int

Number of attention heads. Default is 8.

8
transform_type str

Type of transform ('dct', 'dst', or 'hadamard'). Default is 'dct'.

'dct'
use_scaling bool

Whether to use learnable scaling. Default is True.

True
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/spectral.py
def __init__(
    self,
    hidden_dim: int,
    num_heads: int = 8,
    transform_type: str = "dct",
    use_scaling: bool = True,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    mixing_layer = LSTAttention(
        hidden_dim=hidden_dim,
        num_heads=num_heads,
        transform_type=cast(TransformLSTType, transform_type),
        learnable_scale=use_scaling,
        dropout=dropout,
    )
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=cast(ActivationType, activation),
        dropout=dropout,
        norm_eps=norm_eps,
    )

SpectralAttentionBlock

SpectralAttentionBlock(hidden_dim: int, num_heads: int = 8, num_features: int | None = None, kernel_type: str = 'gaussian', ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: PreNormBlock

Spectral attention transformer block.

This block uses spectral attention with random Fourier features for kernel approximation.

Parameters:

Name Type Description Default
hidden_dim int

Hidden dimension of the model.

required
num_heads int

Number of attention heads. Default is 8.

8
num_features int | None

Number of random features. Default is 256.

None
kernel_type str

Type of kernel ('gaussian' or 'laplacian'). Default is 'gaussian'.

'gaussian'
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/spectral.py
def __init__(
    self,
    hidden_dim: int,
    num_heads: int = 8,
    num_features: int | None = None,
    kernel_type: str = "gaussian",
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    mixing_layer = SpectralAttention(
        hidden_dim=hidden_dim,
        num_heads=num_heads,
        num_features=num_features,
        kernel_type=cast(KernelType, kernel_type),
        dropout=dropout,
    )
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=cast(ActivationType, activation),
        dropout=dropout,
        norm_eps=norm_eps,
    )

WaveletBlock

WaveletBlock(hidden_dim: int, wavelet: str = 'db4', levels: int = 3, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: PreNormBlock

Wavelet transformer block with wavelet mixing.

This block uses discrete wavelet transforms for multiscale token mixing.

Parameters:

Name Type Description Default
hidden_dim int

Hidden dimension of the model.

required
wavelet str

Type of wavelet. Default is 'db4'.

'db4'
levels int

Number of decomposition levels. Default is 3.

3
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/spectral.py
def __init__(
    self,
    hidden_dim: int,
    wavelet: str = "db4",
    levels: int = 3,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    mixing_layer = WaveletMixing(
        hidden_dim=hidden_dim,
        wavelet=cast(WaveletType, wavelet),
        levels=levels,
        dropout=dropout,
    )
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=cast(ActivationType, activation),
        dropout=dropout,
        norm_eps=norm_eps,
    )