跳转至

DotBracketTokenizer

DotBracketTokenizer provides a simple way to tokenize secondary structure in dot-bracket notation. It also supports tokenization into nmers and codons, so you don’t have to write complex code to preprocess your data.

By default, DotBracketTokenizer uses the standard alphabet. If nmers is greater than 1, or codon is set to True, it will instead use the streamline alphabet.

multimolecule.tokenisers.DotBracketTokenizer

Bases: Tokenizer

Tokenizer for Secondary Structure sequences.

Parameters:

Name Type Description Default

alphabet

Alphabet | str | List[str] | None

alphabet to use for tokenization.

  • If is None, the standard Secondary Structure alphabet will be used.
  • If is a string, it should correspond to the name of a predefined alphabet. The options include
    • standard
    • iupac
    • streamline
    • nucleobase
  • If is an alphabet or a list of characters, that specific alphabet will be used.
None

nmers

int

Size of kmer to tokenize.

1

codon

bool

Whether to tokenize into codons.

False

Examples:

Python Console Session
>>> from multimolecule import DotBracketTokenizer
>>> tokenizer = DotBracketTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
>>> tokenizer('(.)')["input_ids"]
[1, 7, 6, 8, 2]
>>> tokenizer('+(.)')["input_ids"]
[1, 9, 7, 6, 8, 2]
>>> tokenizer = DotBracketTokenizer(nmers=3)
>>> tokenizer('(((((+..........)))))')["input_ids"]
[1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]
>>> tokenizer = DotBracketTokenizer(codon=True)
>>> tokenizer('(((((+..........)))))')["input_ids"]
[1, 27, 29, 6, 6, 6, 16, 48, 2]
>>> tokenizer('(((((+...........)))))')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22
Source code in multimolecule/tokenisers/dot_bracket/tokenization_db.py
Python
class DotBracketTokenizer(Tokenizer):
    """
    Tokenizer for Secondary Structure sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard Secondary Structure alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.

    Examples:
        >>> from multimolecule import DotBracketTokenizer
        >>> tokenizer = DotBracketTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
        >>> tokenizer('(.)')["input_ids"]
        [1, 7, 6, 8, 2]
        >>> tokenizer('+(.)')["input_ids"]
        [1, 9, 7, 6, 8, 2]
        >>> tokenizer = DotBracketTokenizer(nmers=3)
        >>> tokenizer('(((((+..........)))))')["input_ids"]
        [1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]
        >>> tokenizer = DotBracketTokenizer(codon=True)
        >>> tokenizer('(((((+..........)))))')["input_ids"]
        [1, 27, 29, 6, 6, 6, 16, 48, 2]
        >>> tokenizer('(((((+...........)))))')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.nmers = nmers
        self.condon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.condon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

MultiMolecule provides a set of predefined alphabets for tokenization.

Standard Alphabet

The standard alphabet is an extended version of the Extended Dot-Bracket Notation. This extension includes most symbols from the WUSS notation for better compatibility with existing tools.

Code Represents
. unpaired
( internal helices of all terminal stems
) internal helices of all terminal stems
+ nick between strand
, unpaired in multibranch loops
[ internal helices that includes at least one annotated () stem
] internal helices that includes at least one annotated () stem
{ all internal helices of deeper multifurcations
} all internal helices of deeper multifurcations
| mostly paired
< simple terminal stems
> simple terminal stems
- bulges and interior loops
_ unpaired
: single stranded in the exterior loop
~ local structural alignment left regions of target and query unaligned
$ Not Used
@ Not Used
^ Not Used
% Not Used
* Not Used

Extended Alphabet

Extended Dot-Bracket Notation is a more generalized version of the original Dot-Bracket notation may use additional pairs of brackets for annotating pseudo-knots, since different pairs of brackets are not required to be nested.

Code Represents
. unpaired
( internal helices of all terminal stems
) internal helices of all terminal stems
+ nick between strand
, unpaired in multibranch loops
[ internal helices that includes at least one annotated () stem
] internal helices that includes at least one annotated () stem
{ all internal helices of deeper multifurcations
} all internal helices of deeper multifurcations
| mostly paired
< simple terminal stems
> simple terminal stems

Note that we use . to represent a gap in the sequence.

Streamline Alphabet

The streamline alphabet includes one additional symbol to the nucleobase alphabet, N to represent unknown nucleobase.

Code Represents
. unpaired
( internal helices of all terminal stems
) internal helices of all terminal stems
+ nick between strand