Skip to content

Accelerate Molecular Biology Research with Machine Learning

MultiMolecule

MultiMolecule is a modular Python ecosystem for end-to-end biomolecular machine learning: it brings together task-aware pipelines, integrated model and dataset hubs for DNA, RNA and protein tasks, reusable neural modules with sequence tokenisers and biological I/O, and a unified runner, CLI and API for training, evaluation and inference.

48 model families
16 datasets
10 task pipelines

MultiMolecule

What are you trying to do?

MultiMolecule covers four common entry points: task-level prediction, model fine-tuning, direct use of pretrained models, and curated biological datasets.

Prediction

Run task-level predictions

Registered pipelines turn biological task names and input sequences into structured predictions without manual model assembly.

Python
1
2
3
4
5
6
7
8
import multimolecule
from transformers import pipeline

predict = pipeline(
    "rna-secondary-structure",
    model="multimolecule/ernierna-ss",
)
structure = predict("AUCAGCCUUCGUUCUGUAAACGG")

Training

Fine-tune pretrained models

The runner connects pretrained models with Hugging Face datasets or labelled local tables, using sequence and label columns to build task-aware batches.

Python
import multimolecule as mm


config = mm.Config(
    pretrained="multimolecule/ernierna",
    data={
        "root": "multimolecule/chanrg",
        "feature_cols": ["sequence"],
        "label_cols": ["secondary_structure"],
    },
)

runner = mm.Runner(config)
runner.train()

Models

Use pretrained models

Documented pretrained models are available for Python-level control beyond task pipelines. Model cards give checkpoint IDs, expected inputs, citations, and licenses.

Python
1
2
3
4
5
6
7
8
9
import multimolecule as mm

tokenizer = mm.RnaTokenizer.from_pretrained("multimolecule/ernierna-ss")
model = mm.AutoModelForRnaSecondaryStructurePrediction.from_pretrained(
    "multimolecule/ernierna-ss",
)

inputs = tokenizer("AUCAGCCUUCGUUCUGUAAACGG", return_tensors="pt")
outputs = model(**inputs)

Datasets

Use curated datasets

Curated biological datasets include sequence and label fields, task metadata, source information, citations, and licenses for benchmarks, examples, and fine-tuning.

Python
1
2
3
from datasets import load_dataset

dataset = load_dataset("multimolecule/chanrg", split="train")

One stack underneath

MultiMolecule provides the same layers behind these entry points: documented resources, biological input handling, reusable model components, and execution tools for prediction, training, evaluation, and scripted use.

Execution

Task-level entry points

Pipelines provide ready task predictions, the runner manages supervised training and evaluation, and API entry points support scripts and applications.

Resources

Documented resources

Dataset cards and model cards collect supported inputs, task names, model checkpoints, citations, licenses, and training metadata.

Data layer

Biological files to trainable batches

IO reads biological sequence and structure formats, tokenisers encode molecules, and data abstractions infer task fields and prepare runner-ready batches.

Model layer

Reusable model components

Models provide pretrained configs, AutoModel classes, checkpoints, and output contracts; modules provide backbones, heads, losses, and embeddings for custom architectures.

Community

  • Google Group

    Receive release announcements, migration notes, and design RFCs without following every issue.

    Subscribe to announcements

  • Discourse

    Ask which pipeline, model, or dataset fits a biological problem; share configs, request models, and discuss model components.

    Join discussion