DnaTokenizer¶
DnaTokenizer is smart, it tokenizes raw DNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses T (Thymine) or U (Uracil), and with or without special tokens. It also supports tokenization into nmers and codons, so you don’t have to write complex code to preprocess your data.
By default, DnaTokenizer
uses the standard alphabet.
If nmers
is greater than 1
, or codon
is set to True
, it will instead use the streamline alphabet.
multimolecule.tokenisers.DnaTokenizer
¶
Bases: Tokenizer
Tokenizer for DNA sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
alphabet |
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
nmers |
int
|
Size of kmer to tokenize. |
1
|
codon |
bool
|
Whether to tokenize into codons. |
False
|
replace_U_with_T |
bool
|
Whether to replace U with T. |
True
|
do_upper_case |
bool
|
Whether to convert input to uppercase. |
True
|
Examples:
>>> from multimolecule import DnaTokenizer
>>> tokenizer = DnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = DnaTokenizer(replace_U_with_T=False)
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = DnaTokenizer(nmers=3)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 21, 81, 6, 8, 19, 71, 2]
>>> tokenizer = DnaTokenizer(codon=True)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 6, 71, 2]
>>> tokenizer('tataaagtaa')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
MultiMolecule provides a set of predefined alphabets for tokenization.
Standard Alphabet¶
The standard alphabet is an extended version of the IUPAC alphabet.
This extension includes two additional symbols to the IUPAC alphabet, X
and *
.
X
: Any base; is slightly different fromN
which represents Unknown base. In automatic word embedding conversion, theX
will be initialized as the mean ofA
,C
,G
, andT
, whileN
will not be further processed.*
: is not used in MultiMolecule and is reserved for future use.
gap
Note that we use .
to represent a gap in the sequence.
While -
exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.
Code | Represents |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T | Thymine |
N | Unknown |
R | A or G |
Y | C or T |
S | C or G |
W | A or T |
K | G or T |
M | A or C |
B | C, G, or T |
D | A, G, or T |
H | A, C, or T |
V | A, C, or G |
. | Gap |
X | Any |
* | Not Used |
- | Not Used |
IUPAC Alphabet¶
IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent DNA sequences.
It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.
Code | Represents |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T | Thymine |
R | A or G |
Y | C or T |
S | C or G |
W | A or T |
K | G or T |
M | A or C |
B | C, G, or T |
D | A, G, or T |
H | A, C, or T |
V | A, C, or G |
N | A, C, G, or T |
. | Gap |
Note that we use .
to represent a gap in the sequence.
Streamline Alphabet¶
The streamline alphabet includes one additional symbol to the nucleobase alphabet, N
to represent unknown nucleobase.
Code | Nucleotide |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T | Thymine |
N | Unknown |
Nucleobase Alphabet¶
The nucleobase alphabet is a minimal version of the DNA alphabet that includes only the four canonical nucleotides A
, C
, G
, and T
.
Code | Nucleotide |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T | Thymine |