Skip to content

DnaTokenizer

DnaTokenizer is smart, it tokenizes raw DNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses T (Thymine) or U (Uracil), and with or without special tokens. It also supports tokenization into nmers and codons, so you don’t have to write complex code to preprocess your data.

By default, DnaTokenizer uses the standard alphabet. If nmers is greater than 1, or codon is set to True, it will instead use the streamline alphabet.

multimolecule.tokenisers.DnaTokenizer

Bases: Tokenizer

Tokenizer for DNA sequences.

Parameters:

Name Type Description Default
alphabet Alphabet | str | List[str] | None

alphabet to use for tokenization.

  • If is None, the standard RNA alphabet will be used.
  • If is a string, it should correspond to the name of a predefined alphabet. The options include
    • standard
    • iupac
    • streamline
    • nucleobase
  • If is an alphabet or a list of characters, that specific alphabet will be used.
None
nmers int

Size of kmer to tokenize.

1
codon bool

Whether to tokenize into codons.

False
replace_U_with_T bool

Whether to replace U with T.

True
do_upper_case bool

Whether to convert input to uppercase.

True

Examples:

Python Console Session
>>> from multimolecule import DnaTokenizer
>>> tokenizer = DnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = DnaTokenizer(replace_U_with_T=False)
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = DnaTokenizer(nmers=3)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 21, 81, 6, 8, 19, 71, 2]
>>> tokenizer = DnaTokenizer(codon=True)
>>> tokenizer('tataaagta')["input_ids"]
[1, 84, 6, 71, 2]
>>> tokenizer('tataaagtaa')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
Source code in multimolecule/tokenisers/dna/tokenization_dna.py
Python
class DnaTokenizer(Tokenizer):
    """
    Tokenizer for DNA sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `iupac`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.
        replace_U_with_T: Whether to replace U with T.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import DnaTokenizer
        >>> tokenizer = DnaTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGTNRYSWKMBDHV.X*-')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 2]
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer = DnaTokenizer(replace_U_with_T=False)
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 3, 2]
        >>> tokenizer = DnaTokenizer(nmers=3)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 21, 81, 6, 8, 19, 71, 2]
        >>> tokenizer = DnaTokenizer(codon=True)
        >>> tokenizer('tataaagta')["input_ids"]
        [1, 84, 6, 71, 2]
        >>> tokenizer('tataaagtaa')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        replace_U_with_T: bool = True,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            replace_U_with_T=replace_U_with_T,
            do_upper_case=do_upper_case,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.replace_U_with_T = replace_U_with_T
        self.nmers = nmers
        self.condon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        if self.replace_U_with_T:
            text = text.replace("U", "T")
        if self.condon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

MultiMolecule provides a set of predefined alphabets for tokenization.

Standard Alphabet

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes two additional symbols to the IUPAC alphabet, X and *.

  • X: Any base; is slightly different from N which represents Unknown base. In automatic word embedding conversion, the X will be initialized as the mean of A, C, G, and T, while N will not be further processed.
  • *: is not used in MultiMolecule and is reserved for future use.

gap

Note that we use . to represent a gap in the sequence.

While - exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.

Code Represents
A Adenine
C Cytosine
G Guanine
T Thymine
N Unknown
R A or G
Y C or T
S C or G
W A or T
K G or T
M A or C
B C, G, or T
D A, G, or T
H A, C, or T
V A, C, or G
. Gap
X Any
* Not Used
- Not Used

IUPAC Alphabet

IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent DNA sequences.

It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.

Code Represents
A Adenine
C Cytosine
G Guanine
T Thymine
R A or G
Y C or T
S C or G
W A or T
K G or T
M A or C
B C, G, or T
D A, G, or T
H A, C, or T
V A, C, or G
N A, C, G, or T
. Gap

Note that we use . to represent a gap in the sequence.

Streamline Alphabet

The streamline alphabet includes one additional symbol to the nucleobase alphabet, N to represent unknown nucleobase.

Code Nucleotide
A Adenine
C Cytosine
G Guanine
T Thymine
N Unknown

Nucleobase Alphabet

The nucleobase alphabet is a minimal version of the DNA alphabet that includes only the four canonical nucleotides A, C, G, and T.

Code Nucleotide
A Adenine
C Cytosine
G Guanine
T Thymine