Skip to content

Base Model Classes

spectrans.models.base

Base model classes for spectral transformers.

This module provides the base model classes and common functionality for building complete spectral transformer models. The base classes handle common tasks like embeddings, positional encoding, and output projections, while allowing specific models to customize the core transformer blocks and mixing layers.

Classes:

Name Description
BaseModel

Abstract base class for all spectral transformer models.

PositionalEncoding

Sinusoidal positional encoding following the original Transformer paper.

LearnedPositionalEncoding

Learnable positional embeddings as an alternative to sinusoidal encoding.

ClassificationHead

Output head for classification tasks.

RegressionHead

Output head for regression tasks.

SequenceHead

Output head for sequence-to-sequence tasks.

Examples:

Creating a custom model by extending BaseModel:

>>> from spectrans.models.base import BaseModel
>>> from spectrans.layers.mixing.fourier import FourierMixing
>>> from spectrans.blocks.base import PreNormBlock
>>> class MyModel(BaseModel):
...     def build_blocks(self):
...         return nn.ModuleList([
...             PreNormBlock(
...                 mixing_layer=FourierMixing(self.hidden_dim),
...                 hidden_dim=self.hidden_dim,
...                 ffn_hidden_dim=self.ffn_hidden_dim,
...                 dropout=self.dropout
...             )
...             for _ in range(self.num_layers)
...         ])

Using positional encoding:

>>> from spectrans.models.base import PositionalEncoding
>>> pos_encoder = PositionalEncoding(hidden_dim=768, max_sequence_length=1024)
>>> embeddings = torch.randn(32, 512, 768)
>>> encoded = pos_encoder(embeddings)
Notes

The base model architecture follows the standard transformer pattern:

  1. Input embedding (optional)
  2. Positional encoding (optional)
  3. Stack of transformer blocks
  4. Output projection/head (task-specific)

All models support gradient checkpointing for memory-efficient training and can be easily configured through Pydantic configuration objects.

References

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pages 5998-6008.

Classes

BaseModel

BaseModel(vocab_size: int | None = None, hidden_dim: int = 768, num_layers: int = 12, max_sequence_length: int = 512, num_classes: int | None = None, use_positional_encoding: bool = True, positional_encoding_type: PositionalEncodingType = 'sinusoidal', dropout: float = 0.1, ffn_hidden_dim: int | None = None, norm_eps: float = 1e-12, output_type: OutputHeadType = 'classification', gradient_checkpointing: bool = False)

Bases: SpectralComponent, ABC

Abstract base class for spectral transformer models.

This class provides the common functionality shared by all spectral transformer models, including embeddings, positional encoding, and output heads. Subclasses must implement the build_blocks method to define their specific architecture.

Parameters:

Name Type Description Default
vocab_size int | None

Size of the vocabulary for token embeddings. If None, no input embedding layer is created (assumes pre-embedded inputs).

None
hidden_dim int

Hidden dimension size for the model.

768
num_layers int

Number of transformer blocks in the model.

12
max_sequence_length int

Maximum sequence length the model can process.

512
num_classes int | None

Number of output classes for classification. If None, no classification head is added.

None
use_positional_encoding bool

Whether to use positional encoding. Default is True.

True
positional_encoding_type PositionalEncodingType

Type of positional encoding: 'sinusoidal', 'learned', 'rotary', 'alibi', or 'none'. Default is 'sinusoidal'.

'sinusoidal'
dropout float

Dropout probability. Default is 0.1.

0.1
ffn_hidden_dim int | None

Hidden dimension for feedforward networks. If None, defaults to 4 * hidden_dim.

None
norm_eps float

Epsilon for layer normalization. Default is 1e-12.

1e-12
output_type OutputHeadType

Type of output head: 'classification', 'regression', 'sequence', 'lm', or 'none'. Default is 'classification'.

'classification'
gradient_checkpointing bool

Whether to use gradient checkpointing for memory efficiency. Default is False.

False

Attributes:

Name Type Description
hidden_dim int

Hidden dimension size.

num_layers int

Number of transformer blocks.

max_sequence_length int

Maximum sequence length.

embedding Embedding | None

Token embedding layer (if vocab_size is provided).

positional_encoding PositionalEncoding | LearnedPositionalEncoding | None

Positional encoding module.

blocks ModuleList

List of transformer blocks.

output_head Module | None

Task-specific output head.

dropout Dropout

Dropout layer.

Methods:

Name Description
build_blocks

Build the transformer blocks for the model.

forward

Forward pass through the model.

from_config

Create model instance from configuration object.

Source code in spectrans/models/base.py
def __init__(
    self,
    vocab_size: int | None = None,
    hidden_dim: int = 768,
    num_layers: int = 12,
    max_sequence_length: int = 512,
    num_classes: int | None = None,
    use_positional_encoding: bool = True,
    positional_encoding_type: PositionalEncodingType = "sinusoidal",
    dropout: float = 0.1,
    ffn_hidden_dim: int | None = None,
    norm_eps: float = 1e-12,
    output_type: OutputHeadType = "classification",
    gradient_checkpointing: bool = False,
):
    super().__init__()
    self.hidden_dim = hidden_dim
    self.num_layers = num_layers
    self.max_sequence_length = max_sequence_length
    self.ffn_hidden_dim = ffn_hidden_dim or (4 * hidden_dim)
    self.norm_eps = norm_eps
    self.gradient_checkpointing = gradient_checkpointing
    self.output_type = output_type
    self.num_classes = num_classes

    # Input embedding
    if vocab_size is not None:
        self.embedding = nn.Embedding(vocab_size, hidden_dim)
        self._initialize_embeddings()
    else:
        self.embedding = None

    # Positional encoding
    self.positional_encoding_type = positional_encoding_type
    if use_positional_encoding and positional_encoding_type != "none":
        if positional_encoding_type == "sinusoidal":
            self.positional_encoding = PositionalEncoding(
                hidden_dim=hidden_dim,
                max_sequence_length=max_sequence_length,
                dropout=dropout,
            )
        elif positional_encoding_type == "learned":
            self.positional_encoding = LearnedPositionalEncoding(
                hidden_dim=hidden_dim,
                max_sequence_length=max_sequence_length,
                dropout=dropout,
            )
        elif positional_encoding_type == "rotary":
            self.positional_encoding = RotaryPositionalEncoding(
                hidden_dim=hidden_dim,
                max_sequence_length=max_sequence_length,
            )
        elif positional_encoding_type == "alibi":
            # ALiBi is handled differently - it's added to attention scores
            # We'll need to pass num_heads, which we'll get from the first block
            self.positional_encoding = None  # Will be initialized after blocks
        else:
            raise ValueError(f"Unknown positional encoding type: {positional_encoding_type}")
    else:
        self.positional_encoding = None

    # Build transformer blocks (implemented by subclasses)
    self.blocks = self.build_blocks()

    # Layer norm before output
    self.final_norm = nn.LayerNorm(hidden_dim, eps=norm_eps)

    # Output head
    if output_type == "classification" and num_classes is not None:
        self.output_head = ClassificationHead(hidden_dim, num_classes, dropout)
    elif output_type == "regression":
        self.output_head = RegressionHead(hidden_dim, dropout)
    elif output_type == "sequence":
        vocab_size_out = (
            num_classes
            if num_classes is not None
            else (vocab_size if vocab_size is not None else hidden_dim)
        )
        self.output_head = SequenceHead(hidden_dim, vocab_size_out, dropout)
    elif output_type == "lm":
        # Language modeling head outputs vocab_size
        if num_classes is not None:
            vocab_size_out = num_classes
        elif vocab_size is not None:
            vocab_size_out = vocab_size
        else:
            raise ValueError(
                "Either num_classes or vocab_size must be specified for language modeling output"
            )
        self.output_head = SequenceHead(hidden_dim, vocab_size_out, dropout)
    else:
        self.output_head = None

    # Dropout
    self.dropout = nn.Dropout(dropout)
Functions
build_blocks abstractmethod
build_blocks() -> ModuleList

Build the transformer blocks for the model.

This method must be implemented by subclasses to define the specific architecture using appropriate mixing layers.

Returns:

Type Description
ModuleList

List of transformer blocks.

Source code in spectrans/models/base.py
@abstractmethod
def build_blocks(self) -> nn.ModuleList:
    """Build the transformer blocks for the model.

    This method must be implemented by subclasses to define
    the specific architecture using appropriate mixing layers.

    Returns
    -------
    nn.ModuleList
        List of transformer blocks.
    """
    pass
forward
forward(input_ids: Tensor | None = None, inputs_embeds: Tensor | None = None, attention_mask: Tensor | None = None) -> Tensor

Forward pass through the model.

Parameters:

Name Type Description Default
input_ids Tensor | None

Input token IDs of shape (batch_size, sequence_length). Required if embedding layer exists.

None
inputs_embeds Tensor | None

Pre-embedded inputs of shape (batch_size, sequence_length, hidden_dim). Used if no embedding layer or to bypass embedding.

None
attention_mask Tensor | None

Attention mask of shape (batch_size, sequence_length). Values should be 0 or 1 (1 for tokens to attend to).

None

Returns:

Type Description
Tensor

Output tensor. Shape depends on the output head: - Classification: (batch_size, num_classes) - Regression: (batch_size, 1) - Sequence: (batch_size, sequence_length, vocab_size) - None: (batch_size, sequence_length, hidden_dim)

Raises:

Type Description
ValueError

If neither input_ids nor inputs_embeds is provided.

Source code in spectrans/models/base.py
def forward(
    self,
    input_ids: Tensor | None = None,
    inputs_embeds: Tensor | None = None,
    attention_mask: Tensor | None = None,
) -> Tensor:
    """Forward pass through the model.

    Parameters
    ----------
    input_ids : Tensor | None, optional
        Input token IDs of shape (batch_size, sequence_length).
        Required if embedding layer exists.
    inputs_embeds : Tensor | None, optional
        Pre-embedded inputs of shape (batch_size, sequence_length, hidden_dim).
        Used if no embedding layer or to bypass embedding.
    attention_mask : Tensor | None, optional
        Attention mask of shape (batch_size, sequence_length).
        Values should be 0 or 1 (1 for tokens to attend to).

    Returns
    -------
    Tensor
        Output tensor. Shape depends on the output head:
        - Classification: (batch_size, num_classes)
        - Regression: (batch_size, 1)
        - Sequence: (batch_size, sequence_length, vocab_size)
        - None: (batch_size, sequence_length, hidden_dim)

    Raises
    ------
    ValueError
        If neither input_ids nor inputs_embeds is provided.
    """
    # Get embeddings
    if inputs_embeds is not None:
        hidden_states = inputs_embeds
    elif input_ids is not None and self.embedding is not None:
        hidden_states = self.embedding(input_ids)
    elif input_ids is not None:
        raise ValueError("Model has no embedding layer but input_ids was provided")
    else:
        raise ValueError("Either input_ids or inputs_embeds must be provided")

    # Add positional encoding
    if self.positional_encoding is not None:
        hidden_states = self.positional_encoding(hidden_states)

    # Apply dropout
    hidden_states = self.dropout(hidden_states)

    # Process through transformer blocks
    for block in self.blocks:
        if self.gradient_checkpointing and self.training:
            hidden_states = checkpoint.checkpoint(block, hidden_states, use_reentrant=False)
        else:
            hidden_states = block(hidden_states)

    # Final layer norm
    hidden_states = self.final_norm(hidden_states)

    # Apply output head if present
    if self.output_head is not None:
        output: Tensor = self.output_head(hidden_states, attention_mask)
    else:
        output = hidden_states

    return output
from_config classmethod
from_config(config: ModelConfig) -> BaseModel

Create model instance from configuration object.

Parameters:

Name Type Description Default
config ModelConfig

Configuration object with model parameters.

required

Returns:

Type Description
BaseModel

Configured model instance.

Source code in spectrans/models/base.py
@classmethod
def from_config(cls, config: "ModelConfig") -> "BaseModel":
    """Create model instance from configuration object.

    Parameters
    ----------
    config : ModelConfig
        Configuration object with model parameters.

    Returns
    -------
    BaseModel
        Configured model instance.
    """
    # Build model directly from config attributes
    return cls(
        vocab_size=getattr(config, "vocab_size", None),
        hidden_dim=config.hidden_dim,
        num_layers=config.num_layers,
        max_sequence_length=config.sequence_length,
        num_classes=getattr(config, "num_classes", None),
        use_positional_encoding=getattr(config, "use_positional_encoding", True),
        positional_encoding_type=getattr(config, "positional_encoding_type", "sinusoidal"),
        dropout=config.dropout,
        ffn_hidden_dim=getattr(config, "ffn_hidden_dim", None),
        norm_eps=getattr(config, "norm_eps", 1e-12),
        output_type=getattr(config, "output_type", "classification"),
        gradient_checkpointing=getattr(config, "gradient_checkpointing", False),
    )

PositionalEncoding

PositionalEncoding(hidden_dim: int, max_sequence_length: int = 5000, dropout: float = 0.1)

Bases: Module

Sinusoidal positional encoding.

This module adds sinusoidal positional encodings to embeddings, following the approach in "Attention is All You Need".

Parameters:

Name Type Description Default
hidden_dim int

Dimension of the embeddings.

required
max_sequence_length int

Maximum sequence length to encode.

5000
dropout float

Dropout probability. Default is 0.1.

0.1

Attributes:

Name Type Description
dropout Dropout

Dropout layer.

pe Tensor

Precomputed positional encodings.

Methods:

Name Description
forward

Add positional encoding to input tensor.

Source code in spectrans/models/base.py
def __init__(
    self,
    hidden_dim: int,
    max_sequence_length: int = 5000,
    dropout: float = 0.1,
):
    super().__init__()
    self.dropout = nn.Dropout(p=dropout)
    self.pe: Tensor  # Type annotation for buffer

    # Create positional encodings
    pe = torch.zeros(max_sequence_length, hidden_dim)
    position = torch.arange(0, max_sequence_length).unsqueeze(1).float()

    # Create div_term for the sinusoidal pattern
    div_term = torch.exp(
        torch.arange(0, hidden_dim, 2).float() * -(math.log(10000.0) / hidden_dim)
    )

    # Apply sinusoidal functions
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    # Register as buffer (not a parameter)
    self.register_buffer("pe", pe.unsqueeze(0))
Functions
forward
forward(x: Tensor) -> Tensor

Add positional encoding to input tensor.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required

Returns:

Type Description
Tensor

Tensor with positional encoding added.

Source code in spectrans/models/base.py
def forward(self, x: Tensor) -> Tensor:
    """Add positional encoding to input tensor.

    Parameters
    ----------
    x : Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    Tensor
        Tensor with positional encoding added.
    """
    # Add positional encoding
    seq_length = x.size(1)
    x = x + self.pe[:, :seq_length, :]
    result: Tensor = self.dropout(x)
    return result

LearnedPositionalEncoding

LearnedPositionalEncoding(hidden_dim: int, max_sequence_length: int = 5000, dropout: float = 0.1)

Bases: Module

Learned positional embeddings.

This module uses learnable positional embeddings instead of fixed sinusoidal encodings.

Parameters:

Name Type Description Default
hidden_dim int

Dimension of the embeddings.

required
max_sequence_length int

Maximum sequence length to encode.

5000
dropout float

Dropout probability. Default is 0.1.

0.1

Attributes:

Name Type Description
position_embeddings Embedding

Learnable position embeddings.

dropout Dropout

Dropout layer.

Methods:

Name Description
forward

Add learned positional embeddings to input tensor.

Source code in spectrans/models/base.py
def __init__(
    self,
    hidden_dim: int,
    max_sequence_length: int = 5000,
    dropout: float = 0.1,
):
    super().__init__()
    self.position_embeddings = nn.Embedding(max_sequence_length, hidden_dim)
    self.dropout = nn.Dropout(p=dropout)

    # Initialize embeddings
    nn.init.xavier_uniform_(self.position_embeddings.weight)
Functions
forward
forward(x: Tensor) -> Tensor

Add learned positional embeddings to input tensor.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required

Returns:

Type Description
Tensor

Tensor with positional embeddings added.

Source code in spectrans/models/base.py
def forward(self, x: Tensor) -> Tensor:
    """Add learned positional embeddings to input tensor.

    Parameters
    ----------
    x : Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).

    Returns
    -------
    Tensor
        Tensor with positional embeddings added.
    """
    seq_length = x.size(1)
    position_ids = torch.arange(seq_length, dtype=torch.long, device=x.device)
    position_ids = position_ids.unsqueeze(0).expand(x.size(0), -1)

    position_embeddings = self.position_embeddings(position_ids)
    x = x + position_embeddings
    result: Tensor = self.dropout(x)
    return result

RotaryPositionalEncoding

RotaryPositionalEncoding(hidden_dim: int, max_sequence_length: int = 5000, base: float = 10000.0)

Bases: Module

Rotary Position Embedding (RoPE).

This module implements Rotary Position Embeddings as described in the RoFormer paper. RoPE encodes absolute position with rotation matrix and naturally incorporates relative position dependency in self-attention formulation.

Parameters:

Name Type Description Default
hidden_dim int

Dimension of the embeddings. Must be even.

required
max_sequence_length int

Maximum sequence length to encode.

5000
base float

Base for the frequency calculation. Default is 10000.

10000.0

Attributes:

Name Type Description
inv_freq Tensor

Inverse frequencies for computing rotary embeddings.

cos_cached Tensor | None

Cached cosine values for positions.

sin_cached Tensor | None

Cached sine values for positions.

References

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.

Methods:

Name Description
forward

Apply rotary position embedding to input tensor.

Source code in spectrans/models/base.py
def __init__(
    self,
    hidden_dim: int,
    max_sequence_length: int = 5000,
    base: float = 10000.0,
):
    super().__init__()
    if hidden_dim % 2 != 0:
        raise ValueError(f"hidden_dim must be even for RoPE, got {hidden_dim}")

    self.hidden_dim = hidden_dim
    self.max_sequence_length = max_sequence_length
    self.base = base

    # Compute inverse frequencies
    inv_freq = 1.0 / (base ** (torch.arange(0, hidden_dim, 2).float() / hidden_dim))
    self.register_buffer("inv_freq", inv_freq)

    # Cache for precomputed cos/sin
    self.cos_cached: Tensor | None = None
    self.sin_cached: Tensor | None = None
    self._build_cache(max_sequence_length)
Functions
forward
forward(x: Tensor, offset: int = 0) -> Tensor

Apply rotary position embedding to input tensor.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape (batch_size, num_heads, sequence_length, head_dim) or (batch_size, sequence_length, hidden_dim).

required
offset int

Position offset for incremental decoding. Default is 0.

0

Returns:

Type Description
Tensor

Tensor with rotary position embeddings applied.

Source code in spectrans/models/base.py
def forward(self, x: Tensor, offset: int = 0) -> Tensor:
    """Apply rotary position embedding to input tensor.

    Parameters
    ----------
    x : Tensor
        Input tensor of shape (batch_size, num_heads, sequence_length, head_dim)
        or (batch_size, sequence_length, hidden_dim).
    offset : int, optional
        Position offset for incremental decoding. Default is 0.

    Returns
    -------
    Tensor
        Tensor with rotary position embeddings applied.
    """
    # Handle both 3D and 4D inputs
    if x.ndim == 3:
        _, seq_len, _ = x.shape
        # For simplicity, we'll apply RoPE directly to the hidden dimension
        # In practice, this would be applied separately to Q and K in attention
        was_3d = True
    else:
        x.shape[0]
        seq_len = x.shape[2] if x.ndim == 4 else x.shape[1]
        was_3d = False

    # Rebuild cache if needed
    self._build_cache(seq_len + offset)

    if was_3d:
        # For 3D tensor, apply rotation directly
        assert self.cos_cached is not None
        assert self.sin_cached is not None
        cos = self.cos_cached[:, :, offset : offset + seq_len, :].squeeze(1)
        sin = self.sin_cached[:, :, offset : offset + seq_len, :].squeeze(1)

        # Split x into two halves for rotation
        x1, x2 = x.chunk(2, dim=-1)

        # Apply rotation
        rotated = torch.cat(
            [
                x1 * cos[:, :, : x1.shape[-1]] - x2 * sin[:, :, : x2.shape[-1]],
                x1 * sin[:, :, : x1.shape[-1]] + x2 * cos[:, :, : x2.shape[-1]],
            ],
            dim=-1,
        )
    else:
        # For 4D tensor (batch, heads, seq, head_dim)
        assert self.cos_cached is not None
        assert self.sin_cached is not None
        cos = self.cos_cached[:, :, offset : offset + seq_len, :]
        sin = self.sin_cached[:, :, offset : offset + seq_len, :]

        # Split x into two halves for rotation
        x1, x2 = x.chunk(2, dim=-1)

        # Apply rotation
        rotated = torch.cat(
            [
                x1 * cos[:, :, :, : x1.shape[-1]] - x2 * sin[:, :, :, : x2.shape[-1]],
                x1 * sin[:, :, :, : x1.shape[-1]] + x2 * cos[:, :, :, : x2.shape[-1]],
            ],
            dim=-1,
        )

    return rotated

ALiBiPositionalBias

ALiBiPositionalBias(num_heads: int, max_sequence_length: int = 5000)

Bases: Module

Attention with Linear Biases (ALiBi) positional encoding.

This module implements ALiBi, which adds a linear bias to attention scores based on the relative distance between tokens. Unlike traditional position embeddings, ALiBi enables extrapolation to longer sequences.

Parameters:

Name Type Description Default
num_heads int

Number of attention heads.

required
max_sequence_length int

Maximum sequence length to encode.

5000

Attributes:

Name Type Description
num_heads int

Number of attention heads.

slopes Tensor

Head-specific slope parameters.

alibi Tensor | None

Cached linear bias matrix.

References

Ofir Press, Noah A. Smith, and Mike Lewis. 2022. Train short, test long: Attention with linear biases enables input length extrapolation. In Proceedings of the International Conference on Learning Representations (ICLR).

Methods:

Name Description
forward

Add ALiBi bias to attention scores.

get_bias

Get ALiBi bias matrix for a given sequence length.

Source code in spectrans/models/base.py
def __init__(
    self,
    num_heads: int,
    max_sequence_length: int = 5000,
):
    super().__init__()
    self.num_heads = num_heads
    self.max_sequence_length = max_sequence_length

    # Compute slopes for each head
    slopes = self._get_slopes(num_heads)
    self.register_buffer("slopes", slopes)

    # Cache for bias matrix
    self.alibi: Tensor | None = None
    self._build_alibi_tensor(max_sequence_length)
Functions
forward
forward(attention_scores: Tensor) -> Tensor

Add ALiBi bias to attention scores.

Parameters:

Name Type Description Default
attention_scores Tensor

Attention scores of shape (batch_size, num_heads, seq_len, seq_len).

required

Returns:

Type Description
Tensor

Attention scores with ALiBi bias added.

Source code in spectrans/models/base.py
def forward(self, attention_scores: Tensor) -> Tensor:
    """Add ALiBi bias to attention scores.

    Parameters
    ----------
    attention_scores : Tensor
        Attention scores of shape (batch_size, num_heads, seq_len, seq_len).

    Returns
    -------
    Tensor
        Attention scores with ALiBi bias added.
    """
    _, _, seq_len, _ = attention_scores.shape

    # Rebuild cache if needed
    self._build_alibi_tensor(seq_len)

    # Add ALiBi bias
    assert self.alibi is not None
    alibi_bias = self.alibi[:, :, :seq_len, :seq_len].to(attention_scores.device)
    return attention_scores + alibi_bias
get_bias
get_bias(seq_len: int, device: device | None = None) -> Tensor

Get ALiBi bias matrix for a given sequence length.

Parameters:

Name Type Description Default
seq_len int

Sequence length.

required
device device | None

Device to place the bias tensor.

None

Returns:

Type Description
Tensor

ALiBi bias of shape (1, num_heads, seq_len, seq_len).

Source code in spectrans/models/base.py
def get_bias(self, seq_len: int, device: torch.device | None = None) -> Tensor:
    """Get ALiBi bias matrix for a given sequence length.

    Parameters
    ----------
    seq_len : int
        Sequence length.
    device : torch.device | None, optional
        Device to place the bias tensor.

    Returns
    -------
    Tensor
        ALiBi bias of shape (1, num_heads, seq_len, seq_len).
    """
    self._build_alibi_tensor(seq_len)
    assert self.alibi is not None
    bias = self.alibi[:, :, :seq_len, :seq_len]
    if device is not None:
        bias = bias.to(device)
    return bias

ClassificationHead

ClassificationHead(hidden_dim: int, num_classes: int, dropout: float = 0.1, pooling: PoolingType = 'cls')

Bases: Module

Classification output head.

This module pools sequence outputs and projects to class logits.

Parameters:

Name Type Description Default
hidden_dim int

Input hidden dimension.

required
num_classes int

Number of output classes.

required
dropout float

Dropout probability. Default is 0.1.

0.1
pooling PoolingType

Pooling strategy: 'cls', 'mean', or 'max'. Default is 'cls'.

'cls'

Attributes:

Name Type Description
pooling PoolingType

Pooling strategy.

dropout Dropout

Dropout layer.

classifier Linear

Output projection layer.

Methods:

Name Description
forward

Forward pass through classification head.

Source code in spectrans/models/base.py
def __init__(
    self,
    hidden_dim: int,
    num_classes: int,
    dropout: float = 0.1,
    pooling: PoolingType = "cls",
):
    super().__init__()
    self.pooling = pooling
    self.dropout = nn.Dropout(dropout)
    self.classifier = nn.Linear(hidden_dim, num_classes)
Functions
forward
forward(hidden_states: Tensor, attention_mask: Tensor | None = None) -> Tensor

Forward pass through classification head.

Parameters:

Name Type Description Default
hidden_states Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required
attention_mask Tensor | None

Attention mask for pooling operations.

None

Returns:

Type Description
Tensor

Classification logits of shape (batch_size, num_classes).

Source code in spectrans/models/base.py
def forward(
    self,
    hidden_states: Tensor,
    attention_mask: Tensor | None = None,
) -> Tensor:
    """Forward pass through classification head.

    Parameters
    ----------
    hidden_states : Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).
    attention_mask : Tensor | None, optional
        Attention mask for pooling operations.

    Returns
    -------
    Tensor
        Classification logits of shape (batch_size, num_classes).
    """
    # Pool the sequence
    if self.pooling == "cls":
        # Use first token (CLS token)
        pooled = hidden_states[:, 0, :]
    elif self.pooling == "mean":
        # Mean pooling
        if attention_mask is not None:
            mask = attention_mask.unsqueeze(-1).float()
            pooled = (hidden_states * mask).sum(dim=1) / mask.sum(dim=1)
        else:
            pooled = hidden_states.mean(dim=1)
    elif self.pooling == "max":
        # Max pooling
        if attention_mask is not None:
            mask = attention_mask.unsqueeze(-1).float()
            hidden_states = hidden_states.masked_fill(mask == 0, -1e9)
        pooled = hidden_states.max(dim=1)[0]
    else:
        raise ValueError(f"Unknown pooling strategy: {self.pooling}")

    # Apply dropout and classifier
    pooled = self.dropout(pooled)
    logits: Tensor = self.classifier(pooled)
    return logits

RegressionHead

RegressionHead(hidden_dim: int, dropout: float = 0.1, pooling: PoolingType = 'mean')

Bases: Module

Regression output head.

This module pools sequence outputs and projects to a scalar value.

Parameters:

Name Type Description Default
hidden_dim int

Input hidden dimension.

required
dropout float

Dropout probability. Default is 0.1.

0.1
pooling PoolingType

Pooling strategy: 'cls', 'mean', or 'max'. Default is 'mean'.

'mean'

Attributes:

Name Type Description
pooling PoolingType

Pooling strategy.

dropout Dropout

Dropout layer.

regressor Linear

Output projection layer.

Methods:

Name Description
forward

Forward pass through regression head.

Source code in spectrans/models/base.py
def __init__(
    self,
    hidden_dim: int,
    dropout: float = 0.1,
    pooling: PoolingType = "mean",
):
    super().__init__()
    self.pooling = pooling
    self.dropout = nn.Dropout(dropout)
    self.regressor = nn.Linear(hidden_dim, 1)
Functions
forward
forward(hidden_states: Tensor, attention_mask: Tensor | None = None) -> Tensor

Forward pass through regression head.

Parameters:

Name Type Description Default
hidden_states Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required
attention_mask Tensor | None

Attention mask for pooling operations.

None

Returns:

Type Description
Tensor

Regression output of shape (batch_size, 1).

Source code in spectrans/models/base.py
def forward(
    self,
    hidden_states: Tensor,
    attention_mask: Tensor | None = None,
) -> Tensor:
    """Forward pass through regression head.

    Parameters
    ----------
    hidden_states : Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).
    attention_mask : Tensor | None, optional
        Attention mask for pooling operations.

    Returns
    -------
    Tensor
        Regression output of shape (batch_size, 1).
    """
    # Pool the sequence (same logic as classification)
    if self.pooling == "cls":
        pooled = hidden_states[:, 0, :]
    elif self.pooling == "mean":
        if attention_mask is not None:
            mask = attention_mask.unsqueeze(-1).float()
            pooled = (hidden_states * mask).sum(dim=1) / mask.sum(dim=1)
        else:
            pooled = hidden_states.mean(dim=1)
    elif self.pooling == "max":
        if attention_mask is not None:
            mask = attention_mask.unsqueeze(-1).float()
            hidden_states = hidden_states.masked_fill(mask == 0, -1e9)
        pooled = hidden_states.max(dim=1)[0]
    else:
        raise ValueError(f"Unknown pooling strategy: {self.pooling}")

    # Apply dropout and regressor
    pooled = self.dropout(pooled)
    output: Tensor = self.regressor(pooled)
    return output

SequenceHead

SequenceHead(hidden_dim: int, vocab_size: int, dropout: float = 0.1)

Bases: Module

Sequence-to-sequence output head.

This module projects hidden states to vocabulary logits for sequence generation tasks.

Parameters:

Name Type Description Default
hidden_dim int

Input hidden dimension.

required
vocab_size int

Output vocabulary size.

required
dropout float

Dropout probability. Default is 0.1.

0.1

Attributes:

Name Type Description
dropout Dropout

Dropout layer.

lm_head Linear

Language modeling head.

Methods:

Name Description
forward

Forward pass through sequence head.

Source code in spectrans/models/base.py
def __init__(
    self,
    hidden_dim: int,
    vocab_size: int,
    dropout: float = 0.1,
):
    super().__init__()
    self.dropout = nn.Dropout(dropout)
    self.lm_head = nn.Linear(hidden_dim, vocab_size)
Functions
forward
forward(hidden_states: Tensor, attention_mask: Tensor | None = None) -> Tensor

Forward pass through sequence head.

Parameters:

Name Type Description Default
hidden_states Tensor

Input tensor of shape (batch_size, sequence_length, hidden_dim).

required
attention_mask Tensor | None

Not used, kept for interface consistency.

None

Returns:

Type Description
Tensor

Vocabulary logits of shape (batch_size, sequence_length, vocab_size).

Source code in spectrans/models/base.py
def forward(
    self,
    hidden_states: Tensor,
    attention_mask: Tensor | None = None,  # noqa: ARG002
) -> Tensor:
    """Forward pass through sequence head.

    Parameters
    ----------
    hidden_states : Tensor
        Input tensor of shape (batch_size, sequence_length, hidden_dim).
    attention_mask : Tensor | None, optional
        Not used, kept for interface consistency.

    Returns
    -------
    Tensor
        Vocabulary logits of shape (batch_size, sequence_length, vocab_size).
    """
    hidden_states = self.dropout(hidden_states)
    logits: Tensor = self.lm_head(hidden_states)
    return logits