Enformer#
The Enformer model is a large model trained on bulk ENCODE and FANTOM DNase, ChIP-seq, and CAGE data from a wide variety of human and mouse tissues. It predicts 896 bins of 128bp, corresponding to the core 114688 bp of the input sequence.
It was originally provided based on the Sonnet package, and its weights and architecture have been ported to CREsted.
The model was trained on sequences tiled across the genome, which can be downloaded from the original authors’ Google Cloud bucket.
The original model has a shared trunk and two organism-specific heads, which are provided as two specific models for human and mouse here, resulting in models enformer_human and enformer_mouse.
The model is a CNN+Transformer model using the enformer() architecture.
Details of the data and the model can be found in the original publication.
Warning
The Enformer architecture uses custom layers that are serialized inside the CREsted package. To ensure that the model is loaded correctly, make sure that CREsted is imported before loading the model.
If it still refuses to load, add AttentionPool1D and MultiheadAttention as custom objects, as in the example.
Citation
Avsec, Ž., Agarwal, V., Visentin, D. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 18, 1196–1203 (2021). https://doi.org/10.1038/s41592-021-01252-x
License
The original model is licensed under the Apache License, version 2.0.
Usage#
1import crested
2import keras
3
4# download model
5model_path, output_names = crested.get_model("enformer_human")
6
7# load model
8model = keras.models.load_model(model_path, compile=False)
9
10# load the model with custom_objects as fallback
11# model = keras.models.load_model(
12# model_path,
13# custom_objects={
14# 'MultiheadAttention': MultiheadAttention
15# }
16# )
17
18# make predictions
19sequence = "A" * 196608
20predictions = crested.tl.predict(sequence, model)
21print(predictions.shape)