RnaTokenizer¶

RnaTokenizer is smart, it tokenizes raw RNA nucleotides into tokens, no matter if the input is in uppercase or lowercase, uses U (Uracil) or U (Thymine), and with or without special tokens. It also supports tokenization into nmers and codons, so you don’t have to write complex code to preprocess your data.

By default, RnaTokenizer uses the standard alphabet. If nmers is greater than 1, or codon is set to True, it will instead use the streamline alphabet.

multimolecule.tokenisers.RnaTokenizer ¶

Bases: Tokenizer

Tokenizer for RNA sequences.

Parameters:

Name	Type	Description	Default
`alphabet` ¶	`Alphabet \| str \| List[str] \| None`	alphabet to use for tokenization. If is `None`, the standard RNA alphabet will be used. If is a `string`, it should correspond to the name of a predefined alphabet. The options include `standard` `extended` `streamline` `nucleobase` If is an alphabet or a list of characters, that specific alphabet will be used.	`None`
`nmers` ¶	`int`	Size of kmer to tokenize.	`1`
`codon` ¶	`bool`	Whether to tokenize into codons.	`False`
`replace_T_with_U` ¶	`bool`	Whether to replace T with U.	`True`
`do_upper_case` ¶	`bool`	Whether to convert input to uppercase.	`True`

Examples:

Python Console Session
>>> from multimolecule import RnaTokenizer
>>> tokenizer = RnaTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]
>>> tokenizer('acgu')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 9, 2]
>>> tokenizer = RnaTokenizer(replace_T_with_U=False)
>>> tokenizer('acgt')["input_ids"]
[1, 6, 7, 8, 3, 2]
>>> tokenizer = RnaTokenizer(nmers=3)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 17, 64, 49, 96, 84, 22, 2]
>>> tokenizer = RnaTokenizer(codon=True)
>>> tokenizer('uagcuuauc')["input_ids"]
[1, 83, 49, 22, 2]
>>> tokenizer('uagcuuauca')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10

Source code in multimolecule/tokenisers/rna/tokenization_rna.py

Python
class RnaTokenizer(Tokenizer):
    """
    Tokenizer for RNA sequences.

    Args:
        alphabet: alphabet to use for tokenization.

            - If is `None`, the standard RNA alphabet will be used.
            - If is a `string`, it should correspond to the name of a predefined alphabet. The options include
                + `standard`
                + `extended`
                + `streamline`
                + `nucleobase`
            - If is an alphabet or a list of characters, that specific alphabet will be used.
        nmers: Size of kmer to tokenize.
        codon: Whether to tokenize into codons.
        replace_T_with_U: Whether to replace T with U.
        do_upper_case: Whether to convert input to uppercase.

    Examples:
        >>> from multimolecule import RnaTokenizer
        >>> tokenizer = RnaTokenizer()
        >>> tokenizer('<pad><cls><eos><unk><mask><null>ACGUNRYSWKMBDHV.X*-I')["input_ids"]
        [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2]
        >>> tokenizer('acgu')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 9, 2]
        >>> tokenizer = RnaTokenizer(replace_T_with_U=False)
        >>> tokenizer('acgt')["input_ids"]
        [1, 6, 7, 8, 3, 2]
        >>> tokenizer = RnaTokenizer(nmers=3)
        >>> tokenizer('uagcuuauc')["input_ids"]
        [1, 83, 17, 64, 49, 96, 84, 22, 2]
        >>> tokenizer = RnaTokenizer(codon=True)
        >>> tokenizer('uagcuuauc')["input_ids"]
        [1, 83, 49, 22, 2]
        >>> tokenizer('uagcuuauca')["input_ids"]
        Traceback (most recent call last):
        ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 10
    """

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        alphabet: Alphabet | str | List[str] | None = None,
        nmers: int = 1,
        codon: bool = False,
        replace_T_with_U: bool = True,
        do_upper_case: bool = True,
        additional_special_tokens: List | Tuple | None = None,
        **kwargs,
    ):
        if codon and (nmers > 1 and nmers != 3):
            raise ValueError("Codon and nmers cannot be used together.")
        if codon:
            nmers = 3  # set to 3 to get correct vocab
        if not isinstance(alphabet, Alphabet):
            alphabet = get_alphabet(alphabet, nmers=nmers)
        super().__init__(
            alphabet=alphabet,
            nmers=nmers,
            codon=codon,
            replace_T_with_U=replace_T_with_U,
            do_upper_case=do_upper_case,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )
        self.replace_T_with_U = replace_T_with_U
        self.nmers = nmers
        self.codon = codon

    def _tokenize(self, text: str, **kwargs):
        if self.do_upper_case:
            text = text.upper()
        if self.replace_T_with_U:
            text = text.replace("T", "U")
        if self.codon:
            if len(text) % 3 != 0:
                raise ValueError(
                    f"length of input sequence must be a multiple of 3 for codon tokenization, but got {len(text)}"
                )
            return [text[i : i + 3] for i in range(0, len(text), 3)]
        if self.nmers > 1:
            return [text[i : i + self.nmers] for i in range(len(text) - self.nmers + 1)]  # noqa: E203
        return list(text)

MultiMolecule provides a set of predefined alphabets for tokenization.

Standard Alphabet¶

The standard alphabet is an extended version of the IUPAC alphabet. This extension includes three additional symbols to the IUPAC alphabet, I, X and *.

I: Inosine; is a post-transcriptional modification that is not a standard RNA base. Inosine is the result of a deamination reaction of adenines that is catalyzed by adenosine deaminases acting on tRNAs (ADATs)
X: Any base; is slightly different from N which represents Unknown base. In automatic word embedding conversion, the X will be initialized as the mean of A, C, G, and U, while N will not be further processed.
*: is not used in MultiMolecule and is reserved for future use.

gap

Note that we use . to represent a gap in the sequence.

While - exists in the standard alphabet, it is not used in MultiMolecule and is reserved for future use.

Code	Represents
A	Adenine
C	Cytosine
G	Guanine
U	Uracil
N	Unknown
R	A or G
Y	C or U
S	C or G
W	A or U
K	G or U
M	A or C
B	C, G, or U
D	A, G, or U
H	A, C, or U
V	A, C, or G
.	Gap
X	Any
*	Not Used
-	Not Used
I	Inosine

IUPAC Alphabet¶

IUPAC nucleotide code is a standard nucleotide code proposed by the International Union of Pure and Applied Chemistry (IUPAC) to represent RNA sequences.

It consists of 10 symbols that represent ambiguity in the nucleotide sequence and 1 symbol that represents a gap in addition to the streamline alphabet.

Code	Represents
A	Adenine
C	Cytosine
G	Guanine
U	Uracil
R	A or G
Y	C or U
S	G or C
W	A or U
K	G or U
M	A or C
B	C, G, or U
D	A, G, or U
H	A, C, or U
V	A, C, or G
N	A, C, G, or U
.	Gap

Note that we use . to represent a gap in the sequence.

Streamline Alphabet¶

The streamline alphabet includes one additional symbol to the nucleobase alphabet, N to represent unknown nucleobase.

Code	Nucleotide
A	Adenine
C	Cytosine
G	Guanine
U	Uracil
N	Unknown

Nucleobase Alphabet¶

The nucleobase alphabet is a minimal version of the RNA alphabet that includes only the four canonical nucleotides A, C, G, and U.

Code	Nucleotide
A	Adenine
C	Cytosine
G	Guanine
U	Uracil

RnaTokenizer¶

multimolecule.tokenisers.RnaTokenizer ¶

`alphabet` ¶

`nmers` ¶

`codon` ¶

`replace_T_with_U` ¶

`do_upper_case` ¶

Standard Alphabet¶

IUPAC Alphabet¶

Streamline Alphabet¶

Nucleobase Alphabet¶

RnaTokenizer¶

multimolecule.tokenisers.RnaTokenizer ¶

alphabet ¶

nmers ¶

codon ¶

replace_T_with_U ¶

do_upper_case ¶

Standard Alphabet¶

IUPAC Alphabet¶

Streamline Alphabet¶

Nucleobase Alphabet¶

`alphabet` ¶

`nmers` ¶

`codon` ¶

`replace_T_with_U` ¶

`do_upper_case` ¶