DotBracketTokenizer¶
DotBracketTokenizer provides a simple way to tokenize secondary structure in dot-bracket notation. It also supports tokenization into nmers and codons, so you don’t have to write complex code to preprocess your data.
By default, DotBracketTokenizer
uses the standard alphabet.
If nmers
is greater than 1
, or codon
is set to True
, it will instead use the streamline alphabet.
multimolecule.tokenisers.DotBracketTokenizer
¶
Bases: Tokenizer
Tokenizer for Secondary Structure sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Alphabet | str | List[str] | None
|
alphabet to use for tokenization.
|
None
|
|
int
|
Size of kmer to tokenize. |
1
|
|
bool
|
Whether to tokenize into codons. |
False
|
Examples:
>>> from multimolecule import DotBracketTokenizer
>>> tokenizer = DotBracketTokenizer()
>>> tokenizer('<pad><cls><eos><unk><mask><null>.()+,[]{}|<>-_:~$@^%*')["input_ids"]
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 2]
>>> tokenizer('(.)')["input_ids"]
[1, 7, 6, 8, 2]
>>> tokenizer('+(.)')["input_ids"]
[1, 9, 7, 6, 8, 2]
>>> tokenizer = DotBracketTokenizer(nmers=3)
>>> tokenizer('(((((+..........)))))')["input_ids"]
[1, 27, 27, 27, 29, 34, 54, 6, 6, 6, 6, 6, 6, 6, 6, 8, 16, 48, 48, 48, 2]
>>> tokenizer = DotBracketTokenizer(codon=True)
>>> tokenizer('(((((+..........)))))')["input_ids"]
[1, 27, 29, 6, 6, 6, 16, 48, 2]
>>> tokenizer('(((((+...........)))))')["input_ids"]
Traceback (most recent call last):
ValueError: length of input sequence must be a multiple of 3 for codon tokenization, but got 22
Source code in multimolecule/tokenisers/dot_bracket/tokenization_db.py
MultiMolecule provides a set of predefined alphabets for tokenization.
Standard Alphabet¶
The standard alphabet is an extended version of the Extended Dot-Bracket Notation. This extension includes most symbols from the WUSS notation for better compatibility with existing tools.
Code | Represents |
---|---|
. | unpaired |
( | internal helices of all terminal stems |
) | internal helices of all terminal stems |
+ | nick between strand |
, | unpaired in multibranch loops |
[ | internal helices that includes at least one annotated () stem |
] | internal helices that includes at least one annotated () stem |
{ | all internal helices of deeper multifurcations |
} | all internal helices of deeper multifurcations |
| | mostly paired |
< | simple terminal stems |
> | simple terminal stems |
- | bulges and interior loops |
_ | unpaired |
: | single stranded in the exterior loop |
~ | local structural alignment left regions of target and query unaligned |
$ | Not Used |
@ | Not Used |
^ | Not Used |
% | Not Used |
* | Not Used |
Extended Alphabet¶
Extended Dot-Bracket Notation is a more generalized version of the original Dot-Bracket notation may use additional pairs of brackets for annotating pseudo-knots, since different pairs of brackets are not required to be nested.
Code | Represents |
---|---|
. | unpaired |
( | internal helices of all terminal stems |
) | internal helices of all terminal stems |
+ | nick between strand |
, | unpaired in multibranch loops |
[ | internal helices that includes at least one annotated () stem |
] | internal helices that includes at least one annotated () stem |
{ | all internal helices of deeper multifurcations |
} | all internal helices of deeper multifurcations |
| | mostly paired |
< | simple terminal stems |
> | simple terminal stems |
Note that we use .
to represent a gap in the sequence.
Streamline Alphabet¶
The streamline alphabet includes one additional symbol to the nucleobase alphabet, N
to represent unknown nucleobase.
Code | Represents |
---|---|
. | unpaired |
( | internal helices of all terminal stems |
) | internal helices of all terminal stems |
+ | nick between strand |