ProteinTokenizer¶
ProteinTokenizer is smart, it tokenizes raw amino acids into tokens, no matter if the input is in uppercase or lowercase, and with or without special tokens.
By default, ProteinTokenizer
uses the standard alphabet.
multimolecule.tokenisers.ProteinTokenizer
¶
Bases: Tokenizer
Tokenizer for Protein sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
>>> from multimolecule import ProteinTokenizer
>>> tokenizer = ProteinTokenizer()
>>> tokenizer('ACDEFGHIKLMNPQRSTVWYXZBJUO')["input_ids"]
[1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 2]
>>> tokenizer('<pad><cls><eos><unk><mask><null>.*-')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 32, 33, 34, 2]
>>> tokenizer('manlgcwmlv')["input_ids"]
[1, 16, 6, 17, 15, 11, 7, 24, 16, 15, 23, 2]
Source code in multimolecule/tokenisers/protein/tokenization_protein.py
MultiMolecule provides a set of predefined alphabets for tokenization.
Standard Alphabet¶
The standard alphabet is an extended version of the IUPAC alphabet.
This extension includes six additional symbols to the IUPAC alphabet, J
, U
, O
, .
, -
, and *
.
J
: Xle; Leucine (L) or Isoleucine (I)U
: Sec; SelenocysteineO
: Pyl; Pyrrolysine.
: is not used in MultiMolecule and is reserved for future use.-
: is not used in MultiMolecule and is reserved for future use.*
: is not used in MultiMolecule and is reserved for future use.
Amino Acid Code | Three letter Code | Amino Acid |
---|---|---|
A | Ala | Alanine |
C | Cys | Cysteine |
D | Asp | Aspartic Acid |
E | Glu | Glutamic Acid |
F | Phe | Phenylalanine |
G | Gly | Glycine |
H | His | Histidine |
I | Ile | Isoleucine |
K | Lys | Lysine |
L | Leu | Leucine |
M | Met | Methionine |
N | Asn | Asparagine |
P | Pro | Proline |
Q | Gln | Glutamine |
R | Arg | Arginine |
S | Ser | Serine |
T | Thr | Threonine |
V | Val | Valine |
W | Trp | Tryptophan |
Y | Tyr | Tyrosine |
X | Xaa | Any amino acid |
Z | Glx | Glutamine (Q) or Glutamic acid (E) |
B | Asx | Aspartic acid (D) or Asparagine (N) |
J | Xle | Leucine (L) or Isoleucine (I) |
U | Sec | Selenocysteine |
O | Pyl | Pyrrolysine |
. | … | Not Used |
* | *** | Not Used |
- | — | Not Used |
IUPAC Alphabet¶
IUPAC amino acid code is a standard amino acid code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent Protein sequences.
The IUPAC amino acid code consists of three additional symbols to Streamline Alphabet, B
, Z
, and X
.
Amino Acid Code | Three letter Code | Amino Acid |
---|---|---|
A | Ala | Alanine |
B | Asx | Aspartic acid (D) or Asparagine (N) |
C | Cys | Cysteine |
D | Asp | Aspartic Acid |
E | Glu | Glutamic Acid |
F | Phe | Phenylalanine |
G | Gly | Glycine |
H | His | Histidine |
I | Ile | Isoleucine |
K | Lys | Lysine |
L | Leu | Leucine |
M | Met | Methionine |
N | Asn | Asparagine |
P | Pro | Proline |
Q | Gln | Glutamine |
R | Arg | Arginine |
S | Ser | Serine |
T | Thr | Threonine |
V | Val | Valine |
W | Trp | Tryptophan |
Y | Tyr | Tyrosine |
X | Xaa | Any amino acid |
Z | Glx | Glutamine (Q) or Glutamic acid (E) |
Streamline Alphabet¶
The streamline alphabet is a simplified version of the standard alphabet.
Amino Acid Code | Three letter Code | Amino Acid |
---|---|---|
A | Ala | Alanine |
C | Cys | Cysteine |
D | Asp | Aspartic Acid |
E | Glu | Glutamic Acid |
F | Phe | Phenylalanine |
G | Gly | Glycine |
H | His | Histidine |
I | Ile | Isoleucine |
K | Lys | Lysine |
L | Leu | Leucine |
M | Met | Methionine |
N | Asn | Asparagine |
P | Pro | Proline |
Q | Gln | Glutamine |
R | Arg | Arginine |
S | Ser | Serine |
T | Thr | Threonine |
V | Val | Valine |
W | Trp | Tryptophan |
Y | Tyr | Tyrosine |
X | Xaa | Any amino acid |