Skip to content

Base Block Classes

spectrans.blocks.base

Base classes and interfaces for transformer blocks.

This module provides the base classes and interfaces for building transformer blocks in the spectrans library. Transformer blocks are composed of mixing/attention layers followed by feedforward networks, with residual connections and normalization.

Classes:

Name Description
TransformerBlock

Base class for all transformer blocks.

FeedForwardNetwork

Standard feedforward network with configurable activation.

PreNormBlock

Transformer block with pre-layer normalization.

PostNormBlock

Transformer block with post-layer normalization.

ParallelBlock

Transformer block with parallel mixing and FFN branches.

Examples:

Creating a custom transformer block:

>>> from spectrans.blocks.base import TransformerBlock
>>> from spectrans.layers.mixing.fourier import FourierMixing
>>> block = TransformerBlock(
...     mixing_layer=FourierMixing(hidden_dim=768),
...     hidden_dim=768,
...     use_pre_norm=True
... )
Notes

The transformer block architecture follows the standard pattern: - Mixing/Attention layer with residual connection - Feedforward network with residual connection - Layer normalization (pre-norm or post-norm) - Optional dropout for regularization

References

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pages 5998-6008.

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. 2020. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), pages 10524-10533.

Classes

TransformerBlock

TransformerBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, use_pre_norm: bool = True, norm_eps: float = 1e-12)

Bases: SpectralComponent

Base class for transformer blocks.

A transformer block combines a mixing/attention layer with a feedforward network, using residual connections and layer normalization.

Parameters:

Name Type Description Default
mixing_layer MixingLayer | Module

The mixing or attention layer for token interaction.

required
hidden_dim int

Hidden dimension of the model.

required
ffn_hidden_dim int | None

Hidden dimension of the feedforward network. Default is 4 * hidden_dim.

None
activation str

Activation function for the FFN. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
use_pre_norm bool

Whether to use pre-layer normalization. Default is True.

True
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12

Attributes:

Name Type Description
mixing_layer MixingLayer | Module

The mixing or attention layer.

ffn FeedForwardNetwork | None

The feedforward network.

norm1 LayerNorm

First layer normalization.

norm2 LayerNorm | None

Second layer normalization (if FFN is used).

dropout Dropout

Dropout layer.

use_pre_norm bool

Whether pre-normalization is used.

Methods:

Name Description
forward

Forward pass through the transformer block.

Source code in spectrans/blocks/base.py
def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    use_pre_norm: bool = True,
    norm_eps: float = 1e-12,
):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.mixing_layer = mixing_layer
    self.use_pre_norm = use_pre_norm

    # Layer normalization
    self.norm1 = nn.LayerNorm(hidden_dim, eps=norm_eps)

    # Feedforward network
    if ffn_hidden_dim is not None:
        self.ffn = FeedForwardNetwork(
            hidden_dim=hidden_dim,
            ffn_hidden_dim=ffn_hidden_dim,
            activation=activation,
            dropout=dropout,
        )
        self.norm2 = nn.LayerNorm(hidden_dim, eps=norm_eps)
    else:
        self.ffn = None
        self.norm2 = None

    # Dropout
    self.dropout = nn.Dropout(dropout)
Functions
forward
forward(x: Tensor) -> Tensor

Forward pass through the transformer block.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required

Returns:

Type Description
Tensor

Output tensor of shape (batch_size, sequence_length, hidden_dim).

Source code in spectrans/blocks/base.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the transformer block.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (batch_size, sequence_length, hidden_dim).
    """
    output: torch.Tensor
    if self.use_pre_norm:
        # Pre-norm: normalize before mixing
        h = x + self.dropout(self.mixing_layer(self.norm1(x)))
        if self.ffn is not None and self.norm2 is not None:
            output = h + self.dropout(self.ffn(self.norm2(h)))
        else:
            output = h
    else:
        # Post-norm: normalize after mixing
        h = self.norm1(x + self.dropout(self.mixing_layer(x)))
        if self.ffn is not None and self.norm2 is not None:
            output = self.norm2(h + self.dropout(self.ffn(h)))
        else:
            output = h

    return output

FeedForwardNetwork

FeedForwardNetwork(hidden_dim: int, ffn_hidden_dim: int, activation: str = 'gelu', dropout: float = 0.0)

Bases: Module

Standard feedforward network for transformer blocks.

A two-layer MLP with configurable activation function and dropout.

Parameters:

Name Type Description Default
hidden_dim int

Input and output dimension.

required
ffn_hidden_dim int

Hidden dimension of the FFN.

required
activation str

Activation function name. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0

Attributes:

Name Type Description
fc1 Linear

First linear layer.

fc2 Linear

Second linear layer.

activation Module

Activation function.

dropout Dropout

Dropout layer.

Methods:

Name Description
forward

Forward pass through the FFN.

Source code in spectrans/blocks/base.py
def __init__(
    self,
    hidden_dim: int,
    ffn_hidden_dim: int,
    activation: str = "gelu",
    dropout: float = 0.0,
):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.ffn_hidden_dim = ffn_hidden_dim

    # Linear layers
    self.fc1 = nn.Linear(hidden_dim, ffn_hidden_dim)
    self.fc2 = nn.Linear(ffn_hidden_dim, hidden_dim)

    # Activation function
    activation_functions = {
        "gelu": nn.GELU(),
        "relu": nn.ReLU(),
        "silu": nn.SiLU(),
        "tanh": nn.Tanh(),
        "sigmoid": nn.Sigmoid(),
        "elu": nn.ELU(),
        "leaky_relu": nn.LeakyReLU(),
    }
    if activation not in activation_functions:
        raise ValueError(f"Unknown activation: {activation}")
    self.activation = activation_functions[activation]

    # Dropout
    self.dropout = nn.Dropout(dropout)
Functions
forward
forward(x: Tensor) -> Tensor

Forward pass through the FFN.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (..., hidden_dim).

required

Returns:

Type Description
Tensor

Output tensor of shape (..., hidden_dim).

Source code in spectrans/blocks/base.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the FFN.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (..., hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (..., hidden_dim).
    """
    x = self.fc1(x)
    x = self.activation(x)
    x = self.dropout(x)
    x = self.fc2(x)
    return x

PreNormBlock

PreNormBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: TransformerBlock

Transformer block with pre-layer normalization.

This block applies layer normalization before the mixing layer and FFN, which has been shown to improve training stability.

Parameters:

Name Type Description Default
mixing_layer MixingLayer | Module

The mixing or attention layer.

required
hidden_dim int

Hidden dimension of the model.

required
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/base.py
def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    if ffn_hidden_dim is None:
        ffn_hidden_dim = 4 * hidden_dim
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
        use_pre_norm=True,
        norm_eps=norm_eps,
    )

PostNormBlock

PostNormBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: TransformerBlock

Transformer block with post-layer normalization.

This block applies layer normalization after the mixing layer and FFN, following the original transformer architecture.

Parameters:

Name Type Description Default
mixing_layer MixingLayer | Module

The mixing or attention layer.

required
hidden_dim int

Hidden dimension of the model.

required
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
Source code in spectrans/blocks/base.py
def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    if ffn_hidden_dim is None:
        ffn_hidden_dim = 4 * hidden_dim
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
        use_pre_norm=False,
        norm_eps=norm_eps,
    )

ParallelBlock

ParallelBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: SpectralComponent

Transformer block with parallel mixing and FFN branches.

This block processes the mixing layer and FFN in parallel rather than sequentially, which can improve efficiency and has been shown to work well in practice.

Parameters:

Name Type Description Default
mixing_layer MixingLayer | Module

The mixing or attention layer.

required
hidden_dim int

Hidden dimension of the model.

required
ffn_hidden_dim int | None

Hidden dimension of the FFN. Default is 4 * hidden_dim.

None
activation str

Activation function. Default is 'gelu'.

'gelu'
dropout float

Dropout probability. Default is 0.0.

0.0
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12

Attributes:

Name Type Description
mixing_layer MixingLayer | Module

The mixing or attention layer.

ffn FeedForwardNetwork

The feedforward network.

norm LayerNorm

Layer normalization.

dropout Dropout

Dropout layer.

Methods:

Name Description
forward

Forward pass through the parallel block.

Source code in spectrans/blocks/base.py
def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.mixing_layer = mixing_layer

    # Default FFN dimension
    if ffn_hidden_dim is None:
        ffn_hidden_dim = 4 * hidden_dim

    # Components
    self.ffn = FeedForwardNetwork(
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
    )
    self.norm = nn.LayerNorm(hidden_dim, eps=norm_eps)
    self.dropout = nn.Dropout(dropout)
Functions
forward
forward(x: Tensor) -> Tensor

Forward pass through the parallel block.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required

Returns:

Type Description
Tensor

Output tensor of shape (batch_size, sequence_length, hidden_dim).

Source code in spectrans/blocks/base.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the parallel block.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (batch_size, sequence_length, hidden_dim).
    """
    # Normalize input
    normed = self.norm(x)

    # Process mixing and FFN in parallel
    mixed = self.mixing_layer(normed)
    ffn_out = self.ffn(normed)

    # Combine and add residual
    output: torch.Tensor = x + self.dropout(mixed + ffn_out)

    return output

Functions