跳转至

ProteinTokenizer

ProteinTokenizer is smart, it tokenizes raw amino acids into tokens, no matter if the input is in uppercase or lowercase, and with or without special tokens.

By default, ProteinTokenizer uses the standard alphabet.

multimolecule.tokenisers.ProteinTokenizer

Bases: Tokenizer

Tokenizer for Protein sequences.

Parameters:

Name Type Description Default
alphabet Alphabet | str | List[str] | None

alphabet to use for tokenization.

  • If is None, the standard RNA alphabet will be used.
  • If is a string, it should correspond to the name of a predefined alphabet. The options include
    • standard
    • iupac
    • streamline
  • If is an alphabet or a list of characters, that specific alphabet will be used.
None
do_upper_case bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import ProteinTokenizer
>>> tokenizer = ProteinTokenizer()
>>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')["input_ids"]
[1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]
>>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]
>>> tokenizer('manlgcwmlv')["input_ids"]
[1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]
Source code in multimolecule/tokenisers/protein/tokenization_protein.py
Python
class ProteinTokenizer(Tokenizer):
    """
    Tokenizer for Protein sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import ProteinTokenizer
        >>> tokenizer = ProteinTokenizer()
        >>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')["input_ids"]
        [1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]
        >>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]
        >>> tokenizer('manlgcwmlv')["input_ids"]
        [1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet)
        super().__init__(
            alphabet=alphabet,
            additional_special_tokens=additional_special_tokens,
            do_upper_case=do_upper_case,
            **kwargs,
        )

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        return list(text)

MultiMolecule provides a set of predefined alphabets for tokenization.

Standard Alphabet

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes six additional symbols to the IUPAC alphabet, J, U, O, ., -, and *.

  • J: Xle; Leucine (L) or Isoleucine (I)
  • U: Sec; Selenocysteine
  • O: Pyl; Pyrrolysine
  • .: is not used in MultiMolecule and is reserved for future use.
  • -: is not used in MultiMolecule and is reserved for future use.
  • *: is not used in MultiMolecule and is reserved for future use.
Amino Acid Code Three letter Code Amino Acid
A Ala Alanine
C Cys Cysteine
D Asp Aspartic Acid
E Glu Glutamic Acid
F Phe Phenylalanine
G Gly Glycine
H His Histidine
I Ile Isoleucine
K Lys Lysine
L Leu Leucine
M Met Methionine
N Asn Asparagine
P Pro Proline
Q Gln Glutamine
R Arg Arginine
S Ser Serine
T Thr Threonine
V Val Valine
W Trp Tryptophan
Y Tyr Tyrosine
X Xaa Any amino acid
Z Glx Glutamine (Q) or Glutamic acid (E)
B Asx Aspartic acid (D) or Asparagine (N)
J Xle Leucine (L) or Isoleucine (I)
U Sec Selenocysteine
O Pyl Pyrrolysine
. Not Used
* *** Not Used
- Not Used

IUPAC Alphabet

IUPAC amino acid code is a standard amino acid code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent Protein sequences.

The IUPAC amino acid code consists of three additional symbols to Streamline Alphabet, B, Z, and X.

Amino Acid Code Three letter Code Amino Acid
A Ala Alanine
B Asx Aspartic acid (D) or Asparagine (N)
C Cys Cysteine
D Asp Aspartic Acid
E Glu Glutamic Acid
F Phe Phenylalanine
G Gly Glycine
H His Histidine
I Ile Isoleucine
K Lys Lysine
L Leu Leucine
M Met Methionine
N Asn Asparagine
P Pro Proline
Q Gln Glutamine
R Arg Arginine
S Ser Serine
T Thr Threonine
V Val Valine
W Trp Tryptophan
Y Tyr Tyrosine
X Xaa Any amino acid
Z Glx Glutamine (Q) or Glutamic acid (E)

Streamline Alphabet

The streamline alphabet is a simplified version of the standard alphabet.

Amino Acid Code Three letter Code Amino Acid
A Ala Alanine
C Cys Cysteine
D Asp Aspartic Acid
E Glu Glutamic Acid
F Phe Phenylalanine
G Gly Glycine
H His Histidine
I Ile Isoleucine
K Lys Lysine
L Leu Leucine
M Met Methionine
N Asn Asparagine
P Pro Proline
Q Gln Glutamine
R Arg Arginine
S Ser Serine
T Thr Threonine
V Val Valine
W Trp Tryptophan
Y Tyr Tyrosine
X Xaa Any amino acid