DotBracketTokenizer¶

DotBracketTokenizer provides a simple way to tokenize secondary structure in dot-bracket notation. It also supports tokenization into nmers and codons, so you don’t have to write complex code to preprocess your data.

By default, DotBracketTokenizer uses the standard alphabet. If nmers is greater than 1, or codon is set to True, it will instead use the streamline alphabet.

multimolecule.tokenisers.DotBracketTokenizer ¶

Bases: Tokenizer

Tokenizer for Secondary Structure sequences.

Parameters:

Name	Type	Description	Default
`alphabet` ¶	`Alphabet \| str \| List[str] \| None`	alphabet to use for tokenization. If is `None`, the standard Secondary Structure alphabet will be used. If is a `string`, it should correspond to the name of a predefined alphabet. The options include `standard` `iupac` `streamline` `nucleobase` If is an alphabet or a list of characters, that specific alphabet will be used.	`None`
`nmers` ¶	`int`	Size of kmer to tokenize.	`1`
`codon` ¶	`bool`	Whether to tokenize into codons.	`False`

Examples:

Python Console Session
>>> from multimolecule import DotBracketTokenizer
>>> tokenizer = DotBracketTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
>>> tokenizer('(.)')["input_ids"]
[1, 7, 6, 8, 2]
>>> tokenizer('+(.)')["input_ids"]
[1, 9, 7, 6, 8, 2]
>>> tokenizer = DotBracketTokenizer(nmers=3)
>>> tokenizer('(((((+..........)))))')["input_ids"]
[1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]
>>> tokenizer = DotBracketTokenizer(codon=True)
>>> tokenizer('(((((+..........)))))')["input_ids"]
[1, 27, 29, 6, 6, 6, 16, 48, 2]
>>> tokenizer('(((((+...........)))))')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22

Source code in multimolecule/tokenisers/dot_bracket/tokenization_db.py

Python
class DotBracketTokenizer(Tokenizer):
    """
    Tokenizer for Secondary Structure sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard Secondary Structure alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.

    Examples:
        >>> from multimolecule import DotBracketTokenizer
        >>> tokenizer = DotBracketTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
        >>> tokenizer('(.)')["input_ids"]
        [1, 7, 6, 8, 2]
        >>> tokenizer('+(.)')["input_ids"]
        [1, 9, 7, 6, 8, 2]
        >>> tokenizer = DotBracketTokenizer(nmers=3)
        >>> tokenizer('(((((+..........)))))')["input_ids"]
        [1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]
        >>> tokenizer = DotBracketTokenizer(codon=True)
        >>> tokenizer('(((((+..........)))))')["input_ids"]
        [1, 27, 29, 6, 6, 6, 16, 48, 2]
        >>> tokenizer('(((((+...........)))))')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.nmers = nmers
        self.codon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.codon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

MultiMolecule provides a set of predefined alphabets for tokenization.

Standard Alphabet¶

The standard alphabet is an extended version of the Extended Dot-Bracket Notation. This extension includes most symbols from the WUSS notation for better compatibility with existing tools.

Code	Represents
.	unpaired
(	internal helices of all terminal stems
)	internal helices of all terminal stems
+	nick between strand
,	unpaired in multibranch loops
[	internal helices that includes at least one annotated () stem
]	internal helices that includes at least one annotated () stem
{	all internal helices of deeper multifurcations
}	all internal helices of deeper multifurcations
\|	mostly paired
<	simple terminal stems
>	simple terminal stems
-	bulges and interior loops
_	unpaired
:	single stranded in the exterior loop
~	local structural alignment left regions of target and query unaligned
$	Not Used
@	Not Used
^	Not Used
%	Not Used
*	Not Used

Extended Alphabet¶

Extended Dot-Bracket Notation is a more generalized version of the original Dot-Bracket notation may use additional pairs of brackets for annotating pseudo-knots, since different pairs of brackets are not required to be nested.

Code	Represents
.	unpaired
(	internal helices of all terminal stems
)	internal helices of all terminal stems
+	nick between strand
,	unpaired in multibranch loops
[	internal helices that includes at least one annotated () stem
]	internal helices that includes at least one annotated () stem
{	all internal helices of deeper multifurcations
}	all internal helices of deeper multifurcations
\|	mostly paired
<	simple terminal stems
>	simple terminal stems

Note that we use . to represent a gap in the sequence.

Streamline Alphabet¶

The streamline alphabet includes one additional symbol to the dot-bracket alphabet, + to represent nick between strand.

Code	Represents
.	unpaired
(	internal helices of all terminal stems
)	internal helices of all terminal stems
+	nick between strand

Dot-Bracket Alphabet¶

Code	Represents
.	unpaired
(	internal helices of all terminal stems
)	internal helices of all terminal stems

DotBracketTokenizer¶

multimolecule.tokenisers.DotBracketTokenizer ¶

`alphabet` ¶

`nmers` ¶

`codon` ¶

Standard Alphabet¶

Extended Alphabet¶

Streamline Alphabet¶

Dot-Bracket Alphabet¶

DotBracketTokenizer¶

multimolecule.tokenisers.DotBracketTokenizer ¶

alphabet ¶

nmers ¶

codon ¶

Standard Alphabet¶

Extended Alphabet¶

Streamline Alphabet¶

Dot-Bracket Alphabet¶

`alphabet` ¶

`nmers` ¶

`codon` ¶