Skip to content

Accelerate Molecular Biology Research with Machine Learning

MultiMolecule

MultiMolecule provides ready-to-use pipelines, pretrained model checkpoints, curated datasets, and training tools for RNA, DNA, and protein sequence research.

53 model families
16 datasets
10 task pipelines

MultiMolecule

What are you trying to do?

Start from the task you need: predict from a sequence, fine-tune on your data, load a pretrained model, or use a curated dataset.

Prediction

Predict from a sequence

Registered pipelines turn biological task names and input sequences into structured predictions without manual model assembly.

Python
1
2
3
4
5
6
7
8
import multimolecule
from transformers import pipeline

predict = pipeline(
    "rna-secondary-structure",
    model="multimolecule/ernierna-ss",
)
structure = predict("AUCAGCCUUCGUUCUGUAAACGG")

Training

Fine-tune on your data

The runner connects pretrained checkpoints with Hugging Face datasets or labelled local tables, using sequence and label columns to start supervised training.

Python
import multimolecule as mm


config = mm.Config(
    pretrained="multimolecule/ernierna",
    data={
        "root": "multimolecule/chanrg",
        "feature_cols": ["sequence"],
        "label_cols": ["secondary_structure"],
    },
)

runner = mm.Runner(config)
runner.train()

Models

Load a pretrained model

Model cards give checkpoint IDs, expected inputs, citations, and licenses, while Python APIs support direct model control beyond task pipelines.

Python
1
2
3
4
5
6
7
8
9
import multimolecule as mm

tokenizer = mm.RnaTokenizer.from_pretrained("multimolecule/ernierna-ss")
model = mm.AutoModelForRnaSecondaryStructurePrediction.from_pretrained(
    "multimolecule/ernierna-ss",
)

inputs = tokenizer("AUCAGCCUUCGUUCUGUAAACGG", return_tensors="pt")
outputs = model(**inputs)

Datasets

Use a curated dataset

Curated biological datasets include sequence and label fields, task metadata, source information, citations, and licenses for benchmarks, examples, and fine-tuning.

Python
1
2
3
from datasets import load_dataset

dataset = load_dataset("multimolecule/chanrg", split="train")

One stack underneath

When you need more control, the same ecosystem exposes documented resources, biological input handling, reusable model components, and execution tools for prediction, training, evaluation, and scripted use.

Execution

Pipelines, runner, and API

Pipelines provide ready task predictions, the runner manages supervised training and evaluation, and API entry points support scripts and applications.

Resources

Models and datasets with provenance

Dataset cards and model cards collect supported inputs, task names, checkpoint IDs, citations, licenses, and training metadata.

Data layer

Biological data to model-ready inputs

IO, tokenisers, and data utilities turn biological sequences, structures, and annotations into consistent inputs for pipelines, training, and evaluation.

Model layer

Reusable model building blocks

Models provide pretrained configs, AutoModel classes, checkpoints, and output contracts; modules provide backbones, heads, losses, and embeddings for custom architectures.

Community

  • Google Group

    Receive release announcements, migration notes, and design RFCs without following every issue.

    Subscribe to announcements

  • Discourse

    Ask which pipeline, model, or dataset fits a biological problem; share configs, request models, and discuss model components.

    Join discussion