ProteinTokenizer¶

ProteinTokenizer is smart, it tokenizes raw amino acids into tokens, no matter if the input is in uppercase or lowercase, and with or without special tokens.

By default, ProteinTokenizer uses the standard alphabet.

multimolecule.tokenisers.ProteinTokenizer ¶

Bases: Tokenizer

Tokenizer for Protein sequences.

Parameters:

Name	Type	Description	Default
`alphabet` ¶	`Alphabet \| str \| List[str] \| None`	alphabet to use for tokenization. If is `None`, the standard RNA alphabet will be used. If is a `string`, it should correspond to the name of a predefined alphabet. The options include `standard` `iupac` `streamline` If is an alphabet or a list of characters, that specific alphabet will be used.	`None`
`do_upper_case` ¶	`bool`	Whether to convert input to uppercase.	`True`

Examples:

Python Console Session
>>> from multimolecule import ProteinTokenizer
>>> tokenizer = ProteinTokenizer()
>>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')["input_ids"]
[1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]
>>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]
>>> tokenizer('manlgcwmlv')["input_ids"]
[1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]

Source code in multimolecule/tokenisers/protein/tokenization_protein.py

Python
class ProteinTokenizer(Tokenizer):
    """
    Tokenizer for Protein sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import ProteinTokenizer
        >>> tokenizer = ProteinTokenizer()
        >>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')["input_ids"]
        [1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]
        >>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]
        >>> tokenizer('manlgcwmlv')["input_ids"]
        [1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet)
        super().__init__(
            alphabet=alphabet,
            additional_special_tokens=additional_special_tokens,
            do_upper_case=do_upper_case,
            **kwargs,
        )

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        return list(text)

MultiMolecule provides a set of predefined alphabets for tokenization.

Standard Alphabet¶

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes six additional symbols to the IUPAC alphabet, J, U, O, ., -, and *.

J: Xle; Leucine (L) or Isoleucine (I)
U: Sec; Selenocysteine
O: Pyl; Pyrrolysine
.: is not used in MultiMolecule and is reserved for future use.
-: is not used in MultiMolecule and is reserved for future use.
*: is not used in MultiMolecule and is reserved for future use.

Amino Acid Code	Three letter Code	Amino Acid
A	Ala	Alanine
C	Cys	Cysteine
D	Asp	Aspartic Acid
E	Glu	Glutamic Acid
F	Phe	Phenylalanine
G	Gly	Glycine
H	His	Histidine
I	Ile	Isoleucine
K	Lys	Lysine
L	Leu	Leucine
M	Met	Methionine
N	Asn	Asparagine
P	Pro	Proline
Q	Gln	Glutamine
R	Arg	Arginine
S	Ser	Serine
T	Thr	Threonine
V	Val	Valine
W	Trp	Tryptophan
Y	Tyr	Tyrosine
X	Xaa	Any amino acid
Z	Glx	Glutamine (Q) or Glutamic acid (E)
B	Asx	Aspartic acid (D) or Asparagine (N)
J	Xle	Leucine (L) or Isoleucine (I)
U	Sec	Selenocysteine
O	Pyl	Pyrrolysine
.	…	Not Used
*	***	Not Used
-	—	Not Used

IUPAC Alphabet¶

IUPAC amino acid code is a standard amino acid code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent Protein sequences.

The IUPAC amino acid code consists of three additional symbols to Streamline Alphabet, B, Z, and X.

Amino Acid Code	Three letter Code	Amino Acid
A	Ala	Alanine
B	Asx	Aspartic acid (D) or Asparagine (N)
C	Cys	Cysteine
D	Asp	Aspartic Acid
E	Glu	Glutamic Acid
F	Phe	Phenylalanine
G	Gly	Glycine
H	His	Histidine
I	Ile	Isoleucine
K	Lys	Lysine
L	Leu	Leucine
M	Met	Methionine
N	Asn	Asparagine
P	Pro	Proline
Q	Gln	Glutamine
R	Arg	Arginine
S	Ser	Serine
T	Thr	Threonine
V	Val	Valine
W	Trp	Tryptophan
Y	Tyr	Tyrosine
X	Xaa	Any amino acid
Z	Glx	Glutamine (Q) or Glutamic acid (E)

Streamline Alphabet¶

The streamline alphabet is a simplified version of the standard alphabet.

Amino Acid Code	Three letter Code	Amino Acid
A	Ala	Alanine
C	Cys	Cysteine
D	Asp	Aspartic Acid
E	Glu	Glutamic Acid
F	Phe	Phenylalanine
G	Gly	Glycine
H	His	Histidine
I	Ile	Isoleucine
K	Lys	Lysine
L	Leu	Leucine
M	Met	Methionine
N	Asn	Asparagine
P	Pro	Proline
Q	Gln	Glutamine
R	Arg	Arginine
S	Ser	Serine
T	Thr	Threonine
V	Val	Valine
W	Trp	Tryptophan
Y	Tyr	Tyrosine
X	Xaa	Any amino acid

ProteinTokenizer¶

multimolecule.tokenisers.ProteinTokenizer ¶

`alphabet` ¶

`do_upper_case` ¶

Standard Alphabet¶

IUPAC Alphabet¶

Streamline Alphabet¶

ProteinTokenizer¶

multimolecule.tokenisers.ProteinTokenizer ¶

alphabet ¶

do_upper_case ¶

Standard Alphabet¶

IUPAC Alphabet¶

Streamline Alphabet¶

`alphabet` ¶

`do_upper_case` ¶