Base Block Classes¶

spectrans.blocks.base ¶

Base classes and interfaces for transformer blocks.

This module provides the base classes and interfaces for building transformer blocks in the spectrans library. Transformer blocks are composed of mixing/attention layers followed by feedforward networks, with residual connections and normalization.

Classes:

Name	Description
`TransformerBlock`	Base class for all transformer blocks.
`FeedForwardNetwork`	Standard feedforward network with configurable activation.
`PreNormBlock`	Transformer block with pre-layer normalization.
`PostNormBlock`	Transformer block with post-layer normalization.
`ParallelBlock`	Transformer block with parallel mixing and FFN branches.

Examples:

Creating a custom transformer block:

>>> from spectrans.blocks.base import TransformerBlock
>>> from spectrans.layers.mixing.fourier import FourierMixing
>>> block = TransformerBlock(
...     mixing_layer=FourierMixing(hidden_dim=768),
...     hidden_dim=768,
...     use_pre_norm=True
... )

Notes

The transformer block architecture follows the standard pattern: - Mixing/Attention layer with residual connection - Feedforward network with residual connection - Layer normalization (pre-norm or post-norm) - Optional dropout for regularization

References

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pages 5998-6008.

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. 2020. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), pages 10524-10533.

Classes¶

TransformerBlock ¶

TransformerBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, use_pre_norm: bool = True, norm_eps: float = 1e-12)

Bases: SpectralComponent

Base class for transformer blocks.

A transformer block combines a mixing/attention layer with a feedforward network, using residual connections and layer normalization.

Parameters:

Name	Type	Description	Default
`mixing_layer`	`MixingLayer \| Module`	The mixing or attention layer for token interaction.	required
`hidden_dim`	`int`	Hidden dimension of the model.	required
`ffn_hidden_dim`	`int \| None`	Hidden dimension of the feedforward network. Default is 4 * hidden_dim.	`None`
`activation`	`str`	Activation function for the FFN. Default is 'gelu'.	`'gelu'`
`dropout`	`float`	Dropout probability. Default is 0.0.	`0.0`
`use_pre_norm`	`bool`	Whether to use pre-layer normalization. Default is True.	`True`
`norm_eps`	`float`	Epsilon for layer normalization. Default is 1e-12.	`1e-12`

Attributes:

Name	Type	Description
`mixing_layer`	`MixingLayer \| Module`	The mixing or attention layer.
`ffn`	`FeedForwardNetwork \| None`	The feedforward network.
`norm1`	`LayerNorm`	First layer normalization.
`norm2`	`LayerNorm \| None`	Second layer normalization (if FFN is used).
`dropout`	`Dropout`	Dropout layer.
`use_pre_norm`	`bool`	Whether pre-normalization is used.

Methods:

Name	Description
`forward`	Forward pass through the transformer block.

Source code in spectrans/blocks/base.py

def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    use_pre_norm: bool = True,
    norm_eps: float = 1e-12,
):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.mixing_layer = mixing_layer
    self.use_pre_norm = use_pre_norm

    # Layer normalization
    self.norm1 = nn.LayerNorm(hidden_dim, eps=norm_eps)

    # Feedforward network
    if ffn_hidden_dim is not None:
        self.ffn = FeedForwardNetwork(
            hidden_dim=hidden_dim,
            ffn_hidden_dim=ffn_hidden_dim,
            activation=activation,
            dropout=dropout,
        )
        self.norm2 = nn.LayerNorm(hidden_dim, eps=norm_eps)
    else:
        self.ffn = None
        self.norm2 = None

    # Dropout
    self.dropout = nn.Dropout(dropout)

Functions¶

forward ¶

forward(x: Tensor) -> Tensor

Forward pass through the transformer block.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape (batch_size, sequence_length, hidden_dim).	required

Returns:

Type	Description
`Tensor`	Output tensor of shape (batch_size, sequence_length, hidden_dim).

Source code in spectrans/blocks/base.py

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the transformer block.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (batch_size, sequence_length, hidden_dim).
    """
    output: torch.Tensor
    if self.use_pre_norm:
        # Pre-norm: normalize before mixing
        h = x + self.dropout(self.mixing_layer(self.norm1(x)))
        if self.ffn is not None and self.norm2 is not None:
            output = h + self.dropout(self.ffn(self.norm2(h)))
        else:
            output = h
    else:
        # Post-norm: normalize after mixing
        h = self.norm1(x + self.dropout(self.mixing_layer(x)))
        if self.ffn is not None and self.norm2 is not None:
            output = self.norm2(h + self.dropout(self.ffn(h)))
        else:
            output = h

    return output

FeedForwardNetwork ¶

FeedForwardNetwork(hidden_dim: int, ffn_hidden_dim: int, activation: str = 'gelu', dropout: float = 0.0)

Bases: Module

Standard feedforward network for transformer blocks.

A two-layer MLP with configurable activation function and dropout.

Parameters:

Name	Type	Description	Default
`hidden_dim`	`int`	Input and output dimension.	required
`ffn_hidden_dim`	`int`	Hidden dimension of the FFN.	required
`activation`	`str`	Activation function name. Default is 'gelu'.	`'gelu'`
`dropout`	`float`	Dropout probability. Default is 0.0.	`0.0`

Attributes:

Name	Type	Description
`fc1`	`Linear`	First linear layer.
`fc2`	`Linear`	Second linear layer.
`activation`	`Module`	Activation function.
`dropout`	`Dropout`	Dropout layer.

Methods:

Name	Description
`forward`	Forward pass through the FFN.

Source code in spectrans/blocks/base.py

def __init__(
    self,
    hidden_dim: int,
    ffn_hidden_dim: int,
    activation: str = "gelu",
    dropout: float = 0.0,
):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.ffn_hidden_dim = ffn_hidden_dim

    # Linear layers
    self.fc1 = nn.Linear(hidden_dim, ffn_hidden_dim)
    self.fc2 = nn.Linear(ffn_hidden_dim, hidden_dim)

    # Activation function
    activation_functions = {
        "gelu": nn.GELU(),
        "relu": nn.ReLU(),
        "silu": nn.SiLU(),
        "tanh": nn.Tanh(),
        "sigmoid": nn.Sigmoid(),
        "elu": nn.ELU(),
        "leaky_relu": nn.LeakyReLU(),
    }
    if activation not in activation_functions:
        raise ValueError(f"Unknown activation: {activation}")
    self.activation = activation_functions[activation]

    # Dropout
    self.dropout = nn.Dropout(dropout)

Functions¶

forward ¶

forward(x: Tensor) -> Tensor

Forward pass through the FFN.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape (..., hidden_dim).	required

Returns:

Type	Description
`Tensor`	Output tensor of shape (..., hidden_dim).

Source code in spectrans/blocks/base.py

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the FFN.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (..., hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (..., hidden_dim).
    """
    x = self.fc1(x)
    x = self.activation(x)
    x = self.dropout(x)
    x = self.fc2(x)
    return x

PreNormBlock ¶

PreNormBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: TransformerBlock

Transformer block with pre-layer normalization.

This block applies layer normalization before the mixing layer and FFN, which has been shown to improve training stability.

Parameters:

Name	Type	Description	Default
`mixing_layer`	`MixingLayer \| Module`	The mixing or attention layer.	required
`hidden_dim`	`int`	Hidden dimension of the model.	required
`ffn_hidden_dim`	`int \| None`	Hidden dimension of the FFN. Default is 4 * hidden_dim.	`None`
`activation`	`str`	Activation function. Default is 'gelu'.	`'gelu'`
`dropout`	`float`	Dropout probability. Default is 0.0.	`0.0`
`norm_eps`	`float`	Epsilon for layer normalization. Default is 1e-12.	`1e-12`

Source code in spectrans/blocks/base.py

def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    if ffn_hidden_dim is None:
        ffn_hidden_dim = 4 * hidden_dim
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
        use_pre_norm=True,
        norm_eps=norm_eps,
    )

PostNormBlock ¶

PostNormBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: TransformerBlock

Transformer block with post-layer normalization.

This block applies layer normalization after the mixing layer and FFN, following the original transformer architecture.

Parameters:

Name	Type	Description	Default
`mixing_layer`	`MixingLayer \| Module`	The mixing or attention layer.	required
`hidden_dim`	`int`	Hidden dimension of the model.	required
`ffn_hidden_dim`	`int \| None`	Hidden dimension of the FFN. Default is 4 * hidden_dim.	`None`
`activation`	`str`	Activation function. Default is 'gelu'.	`'gelu'`
`dropout`	`float`	Dropout probability. Default is 0.0.	`0.0`
`norm_eps`	`float`	Epsilon for layer normalization. Default is 1e-12.	`1e-12`

Source code in spectrans/blocks/base.py

def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    if ffn_hidden_dim is None:
        ffn_hidden_dim = 4 * hidden_dim
    super().__init__(
        mixing_layer=mixing_layer,
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
        use_pre_norm=False,
        norm_eps=norm_eps,
    )

ParallelBlock ¶

ParallelBlock(mixing_layer: MixingLayer | Module, hidden_dim: int, ffn_hidden_dim: int | None = None, activation: str = 'gelu', dropout: float = 0.0, norm_eps: float = 1e-12)

Bases: SpectralComponent

Transformer block with parallel mixing and FFN branches.

This block processes the mixing layer and FFN in parallel rather than sequentially, which can improve efficiency and has been shown to work well in practice.

Parameters:

Name	Type	Description	Default
`mixing_layer`	`MixingLayer \| Module`	The mixing or attention layer.	required
`hidden_dim`	`int`	Hidden dimension of the model.	required
`ffn_hidden_dim`	`int \| None`	Hidden dimension of the FFN. Default is 4 * hidden_dim.	`None`
`activation`	`str`	Activation function. Default is 'gelu'.	`'gelu'`
`dropout`	`float`	Dropout probability. Default is 0.0.	`0.0`
`norm_eps`	`float`	Epsilon for layer normalization. Default is 1e-12.	`1e-12`

Attributes:

Name	Type	Description
`mixing_layer`	`MixingLayer \| Module`	The mixing or attention layer.
`ffn`	`FeedForwardNetwork`	The feedforward network.
`norm`	`LayerNorm`	Layer normalization.
`dropout`	`Dropout`	Dropout layer.

Methods:

Name	Description
`forward`	Forward pass through the parallel block.

Source code in spectrans/blocks/base.py

def __init__(
    self,
    mixing_layer: MixingLayer | nn.Module,
    hidden_dim: int,
    ffn_hidden_dim: int | None = None,
    activation: str = "gelu",
    dropout: float = 0.0,
    norm_eps: float = 1e-12,
):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.mixing_layer = mixing_layer

    # Default FFN dimension
    if ffn_hidden_dim is None:
        ffn_hidden_dim = 4 * hidden_dim

    # Components
    self.ffn = FeedForwardNetwork(
        hidden_dim=hidden_dim,
        ffn_hidden_dim=ffn_hidden_dim,
        activation=activation,
        dropout=dropout,
    )
    self.norm = nn.LayerNorm(hidden_dim, eps=norm_eps)
    self.dropout = nn.Dropout(dropout)

Functions¶

forward ¶

forward(x: Tensor) -> Tensor

Forward pass through the parallel block.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape (batch_size, sequence_length, hidden_dim).	required

Returns:

Type	Description
`Tensor`	Output tensor of shape (batch_size, sequence_length, hidden_dim).

Source code in spectrans/blocks/base.py

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward pass through the parallel block.

    Parameters
    ----------
    x : torch.Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    torch.Tensor
        Output tensor of shape (batch_size, sequence_length, hidden_dim).
    """
    # Normalize input
    normed = self.norm(x)

    # Process mixing and FFN in parallel
    mixed = self.mixing_layer(normed)
    ffn_out = self.ffn(normed)

    # Combine and add residual
    output: torch.Tensor = x + self.dropout(mixed + ffn_out)

    return output