Skip to content

HAL

HAL

Hexamer Additive Linear model for predicting alternative splicing from sequence.

Disclaimer

This is an UNOFFICIAL implementation of Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences by Alexander B. Rosenberg et al.

The OFFICIAL repository of HAL is at Alex-Rosenberg/cell-2015.

Tip

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing HAL did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

HAL is a linear (additive) model that scores alternative 5’ splice-site usage from normalized hexamer (6-mer) frequencies across a 160-nucleotide donor-region window. It was learned from massively parallel reporter assays measuring splicing of millions of random synthetic sequences. The published coefficient table contains a (4096, 8) matrix of hexamer effects; the model averages the eight coefficient columns into one effect per hexamer and applies those effects to normalized hexamer frequencies.

Model Specification

Window Published Coefficient Columns Hexamer Features Num Parameters FLOPs MACs
160 nt 8 averaged 4,096 4,096 8,192 4,096

Usage

The model file depends on the multimolecule library. You can install it using pip:

Bash
pip install multimolecule

Direct Use

Alternative Splicing Prediction

You can use this model directly to predict a splicing score for a 160-nucleotide DNA sequence window:

Python
>>> import torch
>>> from multimolecule import DnaTokenizer, HalForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/hal")
>>> model = HalForSequencePrediction.from_pretrained("multimolecule/hal")
>>> sequence = "ACGT" * 40
>>> input = tokenizer(sequence, add_special_tokens=False, return_tensors="pt")
>>> output = model(**input)

>>> output.logits.shape
torch.Size([1, 1])

Interface

  • Input length: 160 nt fixed donor-region window
  • Alphabet: ACGT only; any hexamer spanning an unknown / N token is ignored
  • Special tokens: do not add (add_special_tokens=False)
  • Output: single scalar splicing score per window
  • Variant effect: subtract two window scores and apply sigmoid externally for paired donor comparisons

Training Details

HAL was learned from massively parallel splicing reporter assays in which millions of random synthetic sequences were inserted into an alternatively spliced reporter minigene. Splicing outcomes were measured by high-throughput sequencing of the resulting mRNA isoforms.

Training Data

The model was trained on the splicing measurements of millions of degenerate (random) sequences from the reporter library described in the HAL paper. Hexamer coefficients were estimated by regressing the measured splicing index against the hexamer composition of each sequence.

Training Procedure

Pre-training

HAL is a linear regression model. The published hexamer coefficient table is fit to the measured splicing index, and the model prediction is the linear combination of normalized hexamer frequencies with the averaged hexamer effects.

The HAL model uses the published HAL_mer_scores.npz hexamer coefficient table from Rosenberg et al. The table stores 4,096 hexamer rows and eight coefficient columns; the eight columns are averaged into the single per-hexamer effect used by the HAL formula.

Citation

BibTeX
@article{rosenberg2015learning,
  author    = {Rosenberg, Alexander B. and Patwardhan, Rupali P. and Shendure, Jay and Seelig, Georg},
  journal   = {Cell},
  number    = 3,
  pages     = {698--711},
  publisher = {Elsevier BV},
  title     = {Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences},
  volume    = 163,
  year      = 2015,
  doi       = {10.1016/j.cell.2015.09.054}
}

Note

The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

BibTeX
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the HAL paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

Text Only
SPDX-License-Identifier: AGPL-3.0-or-later

multimolecule.models.hal

DnaTokenizer

Bases: Tokenizer

Tokenizer for DNA sequences.

Parameters:

Name Type Description Default

alphabet

Alphabet | str | List[str] | None

alphabet to use for tokenization.

  • If is None, the standard RNA alphabet will be used.
  • If is a string, it should correspond to the name of a predefined alphabet. The options include
    • standard
    • iupac
    • streamline
    • nucleobase
  • If is an alphabet or a list of characters, that specific alphabet will be used.
None

nmers

int

Size of kmer to tokenize.

1

codon

bool

Whether to tokenize into codons.

False

replace_U_with_T

bool

Whether to replace U with T.

True

do_upper_case

bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import DnaTokenizer
>>> tokenizer = DnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHVX|.*-?')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = DnaTokenizer(replace_U_with_T=False)
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = DnaTokenizer(nmers=3)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 21, 81, 6, 8, 19, 71, 2]
>>> tokenizer = DnaTokenizer(codon=True)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 6, 71, 2]
>>> tokenizer('tataaagtaa')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
Python
class DnaTokenizer(Tokenizer):
    """
    Tokenizer for DNA sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.
        replace_U_with_T: Whether to replace U with T.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import DnaTokenizer
        >>> tokenizer = DnaTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHVX|.*-?')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer = DnaTokenizer(replace_U_with_T=False)
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 3, 2]
        >>> tokenizer = DnaTokenizer(nmers=3)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 21, 81, 6, 8, 19, 71, 2]
        >>> tokenizer = DnaTokenizer(codon=True)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 6, 71, 2]
        >>> tokenizer('tataaagtaa')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        replace_U_with_T: bool = True,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            replace_U_with_T=replace_U_with_T,
            do_upper_case=do_upper_case,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.replace_U_with_T = replace_U_with_T
        self.nmers = nmers
        self.codon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        if self.replace_U_with_T:
            text = text.replace("U", "T")
        if self.codon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

HalConfig

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a HalModel. It is used to instantiate a HAL model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the HAL model from Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences.

HAL (Hexamer Additive Linear model) is a linear model over hexamer (k-mer) features that predicts alternative splicing outcomes such as 5’ splice-site usage. The model weights are a published table of hexamer coefficients.

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Parameters:

Name Type Description Default

vocab_size

int

Vocabulary size of the HAL model. Defines the number of different tokens that can be represented by the input_ids passed when calling [HalModel]. Only the four canonical nucleotides contribute hexamer features; remaining ids are ignored when counting hexamers. Defaults to 5.

5

kmer_size

int

The k-mer (hexamer) size used for feature extraction. The published HAL model uses hexamers (kmer_size=6).

6

nucleobase_size

int

Number of canonical nucleotides used to enumerate k-mers. The number of k-mer features is nucleobase_size ** kmer_size.

4

region_length

int

The length of the sequence region scored by the model. The published HAL/Kipoi model scores a fixed 160-nucleotide 5’ splice-site window.

160

hidden_size

int

Size of the scalar feature consumed by the optional sequence prediction loss wrapper. HAL emits one score, so this must be 1.

1

num_labels

int

Number of output labels. HAL is a single-output regression model, so this defaults to 1.

1

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> from multimolecule import HalConfig, HalModel
>>> # Initializing a HAL multimolecule/hal style configuration
>>> configuration = HalConfig()
>>> # Initializing a model (with random weights) from the multimolecule/hal style configuration
>>> model = HalModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in multimolecule/models/hal/configuration_hal.py
Python
class HalConfig(PreTrainedConfig):
    r"""
    This is the configuration class to store the configuration of a
    [`HalModel`][multimolecule.models.HalModel]. It is used to instantiate a HAL model according to the specified
    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
    configuration to that of the HAL model from
    [Learning the Sequence Determinants of Alternative Splicing from Millions of Random
    Sequences](https://doi.org/10.1016/j.cell.2015.09.054).

    HAL (Hexamer Additive Linear model) is a linear model over hexamer (k-mer) features that predicts alternative
    splicing outcomes such as 5' splice-site usage. The model weights are a published table of hexamer coefficients.

    Configuration objects inherit from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig] and can be used to
    control the model outputs. Read the documentation from [`PreTrainedConfig`][multimolecule.models.PreTrainedConfig]
    for more information.

    Args:
        vocab_size:
            Vocabulary size of the HAL model. Defines the number of different tokens that can be represented by the
            `input_ids` passed when calling [`HalModel`]. Only the four canonical nucleotides contribute hexamer
            features; remaining ids are ignored when counting hexamers.
            Defaults to 5.
        kmer_size:
            The k-mer (hexamer) size used for feature extraction. The published HAL model uses hexamers (`kmer_size=6`).
        nucleobase_size:
            Number of canonical nucleotides used to enumerate k-mers. The number of k-mer features is
            `nucleobase_size ** kmer_size`.
        region_length:
            The length of the sequence region scored by the model. The published HAL/Kipoi model scores a fixed
            160-nucleotide 5' splice-site window.
        hidden_size:
            Size of the scalar feature consumed by the optional sequence prediction loss wrapper. HAL emits one score,
            so this must be 1.
        num_labels:
            Number of output labels. HAL is a single-output regression model, so this defaults to 1.

    Examples:
        >>> from multimolecule import HalConfig, HalModel
        >>> # Initializing a HAL multimolecule/hal style configuration
        >>> configuration = HalConfig()
        >>> # Initializing a model (with random weights) from the multimolecule/hal style configuration
        >>> model = HalModel(configuration)
        >>> # Accessing the model configuration
        >>> configuration = model.config
    """

    model_type = "hal"

    def __init__(
        self,
        vocab_size: int = 5,
        kmer_size: int = 6,
        nucleobase_size: int = 4,
        region_length: int = 160,
        hidden_size: int = 1,
        head: HeadConfig | None = None,
        num_labels: int = 1,
        bos_token_id: int | None = None,
        eos_token_id: int | None = None,
        pad_token_id: int = 4,
        **kwargs,
    ):
        super().__init__(num_labels=num_labels, pad_token_id=pad_token_id, **kwargs)
        if vocab_size < 5:
            raise ValueError(f"vocab_size ({vocab_size}) must cover the streamline DNA alphabet `ACGTN`")
        if kmer_size != 6:
            raise ValueError(f"The published HAL checkpoint is a hexamer model; `kmer_size` must be 6, got {kmer_size}")
        if nucleobase_size != 4:
            raise ValueError(
                f"The published HAL checkpoint enumerates four canonical nucleotides; "
                f"`nucleobase_size` must be 4, got {nucleobase_size}"
            )
        if region_length != 160:
            raise ValueError(
                f"The published HAL checkpoint scores a fixed 160-nucleotide window; "
                f"`region_length` must be 160, got {region_length}"
            )
        if hidden_size != 1:
            raise ValueError(f"HAL emits a single scalar feature; `hidden_size` must be 1, got {hidden_size}")
        if num_labels != 1:
            raise ValueError(f"HAL emits a single score; `num_labels` must be 1, got {num_labels}")
        self.bos_token_id = bos_token_id  # type: ignore[assignment]
        self.eos_token_id = eos_token_id  # type: ignore[assignment]
        self.vocab_size = vocab_size
        self.kmer_size = kmer_size
        self.nucleobase_size = nucleobase_size
        self.region_length = region_length
        self.hidden_size = hidden_size
        self.num_labels = num_labels
        self.problem_type = "regression"
        self.head = HeadConfig(head) if head is not None else HeadConfig(num_labels=1, problem_type="regression")

    @property
    def num_kmers(self) -> int:
        r"""Number of distinct k-mer features (`nucleobase_size ** kmer_size`)."""
        return self.nucleobase_size**self.kmer_size

    @property
    def num_regions(self) -> int:
        r"""Number of position-specific HAL coefficient regions in the published artifact."""
        return 8

    @property
    def num_features(self) -> int:
        r"""Number of normalized k-mer frequency features consumed by the HAL linear layer."""
        return self.num_kmers

    @property
    def feature_size(self) -> int:
        r"""Alias for `num_features`, matching the feature vector consumed by the HAL linear layer."""
        return self.num_features

num_kmers property

Python
num_kmers: int

Number of distinct k-mer features (nucleobase_size ** kmer_size).

num_regions property

Python
num_regions: int

Number of position-specific HAL coefficient regions in the published artifact.

num_features property

Python
num_features: int

Number of normalized k-mer frequency features consumed by the HAL linear layer.

feature_size property

Python
feature_size: int

Alias for num_features, matching the feature vector consumed by the HAL linear layer.

HalForSequencePrediction

Bases: HalPreTrainedModel

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> import torch
>>> from multimolecule import HalConfig, HalForSequencePrediction
>>> config = HalConfig()
>>> model = HalForSequencePrediction(config)
>>> output = model(torch.randint(4, (1, config.region_length)), labels=torch.tensor([[1.0]]))
>>> output["logits"].shape
torch.Size([1, 1])
Source code in multimolecule/models/hal/modeling_hal.py
Python
class HalForSequencePrediction(HalPreTrainedModel):
    """
    Examples:
        >>> import torch
        >>> from multimolecule import HalConfig, HalForSequencePrediction
        >>> config = HalConfig()
        >>> model = HalForSequencePrediction(config)
        >>> output = model(torch.randint(4, (1, config.region_length)), labels=torch.tensor([[1.0]]))
        >>> output["logits"].shape
        torch.Size([1, 1])
    """

    def __init__(self, config: HalConfig):
        super().__init__(config)
        self.model = HalModel(config)
        head = config.head
        if head is None:
            raise ValueError("HalForSequencePrediction requires `config.head` to be set")
        self.criterion = Criterion(head)
        # Initialize weights and apply final processing
        self.post_init()

    @can_return_tuple
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        labels: Tensor | None = None,
        **kwargs: Any,
    ) -> Tuple[Tensor, ...] | SequencePredictorOutput:
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            return_dict=True,
            **kwargs,
        )

        logits = outputs.pooler_output
        if logits is None:
            raise RuntimeError("HalModel did not return `pooler_output`")
        loss = self.criterion(logits, labels) if labels is not None else None
        return SequencePredictorOutput(loss=loss, logits=logits)

HalModel

Bases: HalPreTrainedModel

Examples:

Python Console Session
1
2
3
4
5
6
7
>>> import torch
>>> from multimolecule import HalConfig, HalModel
>>> config = HalConfig()
>>> model = HalModel(config)
>>> output = model(torch.randint(4, (1, config.region_length)))
>>> output["pooler_output"].shape
torch.Size([1, 1])
Source code in multimolecule/models/hal/modeling_hal.py
Python
class HalModel(HalPreTrainedModel):
    """
    Examples:
        >>> import torch
        >>> from multimolecule import HalConfig, HalModel
        >>> config = HalConfig()
        >>> model = HalModel(config)
        >>> output = model(torch.randint(4, (1, config.region_length)))
        >>> output["pooler_output"].shape
        torch.Size([1, 1])
    """

    def __init__(self, config: HalConfig):
        super().__init__(config)
        self.embeddings = HalEmbedding(config)
        self.prediction = HalModule(config)
        # Initialize weights and apply final processing
        self.post_init()

    @merge_with_config_defaults
    @capture_outputs
    def forward(
        self,
        input_ids: Tensor | NestedTensor | None = None,
        attention_mask: Tensor | None = None,
        inputs_embeds: Tensor | NestedTensor | None = None,
        **kwargs: Any,
    ) -> HalModelOutput:
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        if input_ids is None and inputs_embeds is None:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        if isinstance(input_ids, NestedTensor):
            if attention_mask is None:
                attention_mask = input_ids.mask
            input_ids = input_ids.tensor
        if isinstance(inputs_embeds, NestedTensor):
            inputs_embeds = inputs_embeds.tensor

        if inputs_embeds is None:
            assert input_ids is not None
            inputs_embeds = self.embeddings(
                input_ids, attention_mask=attention_mask, dtype=self.prediction.prediction.weight.dtype
            )
        else:
            if inputs_embeds.dim() == 1:
                inputs_embeds = inputs_embeds.unsqueeze(0)
            inputs_embeds = inputs_embeds.to(dtype=self.prediction.prediction.weight.dtype)

        score = self.prediction(inputs_embeds)

        return HalModelOutput(pooler_output=score, hexamer_frequencies=inputs_embeds)

HalModelOutput dataclass

Bases: ModelOutput

Base class for outputs of the HAL model.

Parameters:

Name Type Description Default

pooler_output

`torch.FloatTensor` of shape `(batch_size, num_labels)`

The HAL splicing score predicted by the linear hexamer model.

None

hexamer_frequencies

`torch.FloatTensor` of shape `(batch_size, num_kmers)`, *optional*

The normalized hexamer (k-mer) frequency features derived from the input sequence region.

None

hidden_states

Tuple[FloatTensor, ...] | None

Always None; HAL is a single linear layer and has no intermediate hidden states. Provided for compatibility with the Transformers output convention.

None

attentions

Tuple[FloatTensor, ...] | None

Always None; HAL has no attention layers. Provided for compatibility with the Transformers output convention.

None
Source code in multimolecule/models/hal/modeling_hal.py
Python
@dataclass
class HalModelOutput(ModelOutput):
    """
    Base class for outputs of the HAL model.

    Args:
        pooler_output (`torch.FloatTensor` of shape `(batch_size, num_labels)`):
            The HAL splicing score predicted by the linear hexamer model.
        hexamer_frequencies (`torch.FloatTensor` of shape `(batch_size, num_kmers)`, *optional*):
            The normalized hexamer (k-mer) frequency features derived from the input sequence region.
        hidden_states:
            Always `None`; HAL is a single linear layer and has no intermediate hidden states. Provided for
            compatibility with the Transformers output convention.
        attentions:
            Always `None`; HAL has no attention layers. Provided for compatibility with the Transformers output
            convention.
    """

    pooler_output: Tensor | None = None
    hexamer_frequencies: Tensor | None = None
    hidden_states: Tuple[torch.FloatTensor, ...] | None = None
    attentions: Tuple[torch.FloatTensor, ...] | None = None

HalPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in multimolecule/models/hal/modeling_hal.py
Python
class HalPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = HalConfig
    base_model_prefix = "model"
    _can_record_outputs: dict[str, Any] | None = None
    _no_split_modules = ["HalModule"]

    @torch.no_grad()
    def _init_weights(self, module):
        super()._init_weights(module)
        # The HAL hexamer-coefficient layer is the actual published model weight. It is
        # zero-initialized here so a freshly constructed model is well-defined before the
        # converter loads the published coefficient table.
        if isinstance(module, nn.Linear):
            init.zeros_(module.weight)
            if module.bias is not None:
                init.zeros_(module.bias)