跳转至

models

models 提供了一系列预训练模型。

模型类

🤗 transformers 库当中,模型类的名字有时可以引起误解。 尽管这些类支持回归和分类任务,但它们的名字通常包含 xxxForSequenceClassification,这可能暗示它们只能用于分类。

为了避免这种歧义,MultiMolecule 提供了一系列模型类,这些类的名称清晰、直观,反映了它们的预期用途:

  • multimolecule.AutoModelForSequencePrediction: 序列预测
  • multimolecule.AutoModelForTokenPrediction: 令牌预测
  • multimolecule.AutoModelForContactPrediction: 接触预测

每个模型都支持回归和分类任务,为广泛的应用提供了灵活性和精度。

接触预测

接触预测为序列中的每一对令牌分配一个标签。 最常见的接触预测任务之一是蛋白质距离图预测。 蛋白质距离图预测试图找到三维蛋白质结构中所有可能的氨基酸残基对之间的距离

核苷酸预测

Token Classification 类似,但如果模型配置中定义了 <bos><eos> 令牌,则将其移除。

<bos><eos> 令牌

在 MultiMolecule 提供的分词器中,<bos> 令牌指向 <cls> 令牌,<sep> 令牌指向 <eos> 令牌。

使用

使用 multimolecule.AutoModel 构建

Python
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

# For additional terms and clarifications, please refer to our License FAQ at:
# <https://multimolecule.danling.org/about/license-faq>.


from transformers import AutoTokenizer

from multimolecule import AutoModelForSequencePrediction

model = AutoModelForSequencePrediction.from_pretrained("multimolecule/rnafm")
tokenizer = AutoTokenizer.from_pretrained("multimolecule/rnafm")

sequence = "UAGCGUAUCAGACUGAUGUUG"
output = model(**tokenizer(sequence, return_tensors="pt"))

直接访问

所有模型可以通过 from_pretrained 方法直接加载。

Python
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

# For additional terms and clarifications, please refer to our License FAQ at:
# <https://multimolecule.danling.org/about/license-faq>.


from multimolecule.models import RnaFmForTokenPrediction, RnaTokenizer

model = RnaFmForTokenPrediction.from_pretrained("multimolecule/rnafm")
tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnafm")

sequence = "UAGCGUAUCAGACUGAUGUUG"
output = model(**tokenizer(sequence, return_tensors="pt"))

使用 transformers.AutoModel 构建

虽然我们为模型类使用了不同的命名约定,但模型仍然注册到相应的 transformers.AutoModel 中。

Python
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

# For additional terms and clarifications, please refer to our License FAQ at:
# <https://multimolecule.danling.org/about/license-faq>.


from transformers import AutoModelForSequenceClassification, AutoTokenizer

import multimolecule  # noqa: F401

model = AutoModelForSequenceClassification.from_pretrained("multimolecule/mrnafm")
tokenizer = AutoTokenizer.from_pretrained("multimolecule/mrnafm")

sequence = "UAGCGUAUCAGACUGAUGUUG"
output = model(**tokenizer(sequence, return_tensors="pt"))

使用前先 import multimolecule

请注意,在使用 transformers.AutoModel 构建模型之前,必须先 import multimolecule。 模型的注册在 multimolecule 包中完成,模型在 transformers 包中不可用。

如果在使用 transformers.AutoModel 之前未 import multimolecule,将会引发以下错误:

Python
ValueError: The checkpoint you are trying to load has model type `rnafm` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

初始化一个香草模型

你也可以使用模型类初始化一个基础模型。

Python
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

# For additional terms and clarifications, please refer to our License FAQ at:
# <https://multimolecule.danling.org/about/license-faq>.


from multimolecule.models import RnaFmConfig, RnaFmForTokenPrediction, RnaTokenizer

config = RnaFmConfig()
model = RnaFmForTokenPrediction(config)
tokenizer = RnaTokenizer()

sequence = "UAGCGUAUCAGACUGAUGUUG"
output = model(**tokenizer(sequence, return_tensors="pt"))

可用模型

脱氧核糖核酸(DNA)

核糖核酸(RNA)