Enhancer code analysis#
In this notebook we will go over how to obtain cell type characteristic sequence patterns from CREsted models, or any other model, using CREsted functionality and the extra packages from crested[motif]: tfmodisco-lite and memesuite-lite.
Obtaining contribution scores per model class and running tfmodisco-lite#
Before we can do any analysis, we need to calculate the contribution scores for cell type-specific regions. These regions hold the most sequence information relevant for cell type identity. From those, we can run tfmodisco-lite.
import anndata as ad
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import crested
import keras
import pickle
Load data and CREsted model#
We’ll load the dataset and model from the the introduction notebook:
# Set the genome
genome = crested.Genome("mm10/genome.fa", "mm10/genome.chrom.sizes")
crested.register_genome(genome) # Register the genome so that it's automatically used in every function
2026-02-20T13:22:01.718526+0100 INFO Genome genome registered.
# Read in the anndata
adata = ad.read_h5ad("crested/mouse_cortex.h5ad")
# Load a trained model
model_path = "mouse_biccn/finetuned_model/checkpoints/02.keras" # change to your model path
model = keras.models.load_model(model_path, compile=False)
Select the most informative regions per cell type#
To obtain cell type-characteristic patterns, we need to calculate contribution scores on highly specific regions. For this purpose, we’ve included a preprocessing function crested.pp.sort_and_filter_regions_on_specificity() to keep the top k most specific regions per cell type that you can use to filter your data before running modisco.
There are three options to select the top regions: either purely based on peak height, purely based on predictions, or on their combination. Here we show how to use the combination of both (which we recommend, see Johansen & Kempynck et al., 2025).
# Store predictions for all our regions in the anndata object
predictions = crested.tl.predict(adata, model)
adata.layers["biccn_model"] = predictions.T
2026-02-18T22:04:41.666517+0100 INFO Lazily importing module crested.tl. This could take a second...
7/4274 ━━━━━━━━━━━━━━━━━━━━ 1:55 27ms/step
4272/4274 ━━━━━━━━━━━━━━━━━━━━ 0s 28ms/step
4274/4274 ━━━━━━━━━━━━━━━━━━━━ 145s 29ms/step
# Calculate the average of the ground truth and predictions
adata.layers['combined'] = (adata.X + adata.layers["biccn_model"])/2
We can use crested.pl.qc.sort_and_filter_cutoff() to look at the gini score distributions for the different regions per class.
Here, we take the top 2000 most specific regions, but top 500 or 1000 regions would give similar results.
%matplotlib inline
crested.pl.qc.sort_and_filter_cutoff(adata, model_name="combined", cutoffs=[500, 1000, 2000], max_k=5000)
2026-02-18T22:07:53.298189+0100 INFO Lazily importing module crested.pl. This could take a second...
# Filter most informative regions per class
top_k = 2000
adata_filtered = crested.pp.sort_and_filter_regions_on_specificity(adata, model_name="combined", top_k=top_k, method="gini", inplace=False)
adata_filtered
2026-02-18T22:09:56.457403+0100 INFO After sorting and filtering, kept 38000 regions.
AnnData object with n_obs × n_vars = 19 × 38000
obs: 'file_path'
var: 'chr', 'start', 'end', 'target_start', 'target_end', 'split', 'Class name', 'rank', 'gini_score'
obsm: 'weights'
layers: 'biccn_model', 'combined'
adata_filtered.var
| chr | start | end | target_start | target_end | split | Class name | rank | gini_score | |
|---|---|---|---|---|---|---|---|---|---|
| region | |||||||||
| chr10:45499432-45501546 | chr10 | 45499432 | 45501546 | 45499989 | 45500989 | val | Astro | 1 | 0.830273 |
| chr5:9868290-9870404 | chr5 | 9868290 | 9870404 | 9868847 | 9869847 | train | Astro | 2 | 0.819487 |
| chr2:65274604-65276718 | chr2 | 65274604 | 65276718 | 65275161 | 65276161 | train | Astro | 3 | 0.818493 |
| chrX:23135863-23137977 | chrX | 23135863 | 23137977 | 23136420 | 23137420 | train | Astro | 4 | 0.816922 |
| chr3:115410072-115412186 | chr3 | 115410072 | 115412186 | 115410629 | 115411629 | train | Astro | 5 | 0.815808 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| chr3:49982863-49984977 | chr3 | 49982863 | 49984977 | 49983420 | 49984420 | train | Vip | 1996 | 0.547537 |
| chr5:97165626-97167740 | chr5 | 97165626 | 97167740 | 97166183 | 97167183 | train | Vip | 1997 | 0.547428 |
| chr1:19158560-19160674 | chr1 | 19158560 | 19160674 | 19159117 | 19160117 | train | Vip | 1998 | 0.547314 |
| chr19:12142425-12144539 | chr19 | 12142425 | 12144539 | 12142982 | 12143982 | train | Vip | 1999 | 0.547256 |
| chr17:49808507-49810621 | chr17 | 49808507 | 49810621 | 49809064 | 49810064 | train | Vip | 2000 | 0.547255 |
38000 rows × 9 columns
Calculating contribution scores per class#
Now you can calculate the contribution scores for all the regions in your filtered AnnData.
By default, the contribution scores are calculated using the expected integrated gradients method, but you can change this to simple integrated gradients to speed up the calculation. We’ve found anecdotally that this has a very minor effect on the quality of the contribution scores, while speeding up the calculation significantly.
crested.tl.contribution_scores_specific(
input=adata_filtered,
target_idx=None, # We calculate for all classes
model=model,
output_dir="modisco_results_ft_2000",
method="integrated_grad"
)
Running tfmodisco-lite#
When this is done, you can run TF-MoDISco-lite on the saved contribution scores to find motifs that are important for the classification/regression task.
You could use the tfmodisco package directly to do this, or you could use the crested.tl.modisco.tfmodisco() function which is essentially a wrapper around the tfmodisco package.
Note that from here on, you can use contribution scores from any model trained in any framework, as this analysis just requires a set of one hot encoded sequences and contribution scores per cell type stored in the same directory.
If you don’t have tomtom available on the command line, set report=False in crested.tl.modisco.tfmodisco(). You won’t get an html report matching the motifs to their closest match in the motif database, but you’ll still get a collection of clustered seqlets for use in the rest of the notebook.
meme_db, motif_to_tf_file = crested.get_motif_db()
# run tfmodisco on the contribution scores
crested.tl.modisco.tfmodisco(
window=1000,
output_dir="modisco_results_ft_2000",
contrib_dir="modisco_results_ft_2000",
report=False, # Optional, will match patterns to motif MEME database - deactivate if you don't have TOMTOM read on the command line
meme_db=meme_db, # File to MEME database - not needed if not generating reports
max_seqlets=20000,
)
2026-02-18T22:48:10.031481+0100 INFO No class names provided, using all found in the contribution directory: ['Pvalb', 'Lamp5', 'L5ET', 'Oligo', 'L6IT', 'L2_3IT', 'Endo', 'Sst', 'L5IT', 'Sncg', 'Astro', 'VLMC', 'OPC', 'L6CT', 'Vip', 'L5_6NP', 'L6b', 'Micro_PVM', 'SstChodl']
2026-02-18T22:48:10.527970+0100 INFO Running modisco for class: Pvalb
Using 11070 positive seqlets
Extracted 1602 negative seqlets
2026-02-18T22:55:24.463964+0100 INFO Running modisco for class: Lamp5
Using 11127 positive seqlets
Extracted 1617 negative seqlets
2026-02-18T23:02:18.775590+0100 INFO Running modisco for class: L5ET
Using 12954 positive seqlets
Extracted 1638 negative seqlets
2026-02-18T23:10:50.207266+0100 INFO Running modisco for class: Oligo
Using 13245 positive seqlets
Extracted 3037 negative seqlets
2026-02-18T23:20:09.298092+0100 INFO Running modisco for class: L6IT
Using 15015 positive seqlets
2026-02-18T23:29:28.185590+0100 INFO Running modisco for class: L2_3IT
Using 15115 positive seqlets
Extracted 2734 negative seqlets
2026-02-18T23:39:27.206960+0100 INFO Running modisco for class: Endo
Using 7876 positive seqlets
Extracted 1529 negative seqlets
2026-02-18T23:44:16.357842+0100 INFO Running modisco for class: Sst
Using 13196 positive seqlets
Extracted 2422 negative seqlets
2026-02-18T23:53:55.514966+0100 INFO Running modisco for class: L5IT
Using 14035 positive seqlets
Extracted 2655 negative seqlets
2026-02-19T00:04:25.113868+0100 INFO Running modisco for class: Sncg
Using 11203 positive seqlets
Analysis of cell-type specific sequence patterns#
Once you have obtained your modisco results, you can plot the the found patterns using the crested.pl.modisco.modisco_results() function.
%matplotlib inline
top_k = 1000
crested.pl.modisco.modisco_results(
classes=["Astro", "L5ET", "Vip", "Oligo"],
contribution="positive",
contribution_dir="modisco_results_ft_2000",
num_seq=top_k,
y_max=0.15,
viz="pwm",
) # You can also visualize in 'pwm' format
2026-02-19T10:17:00.047868+0100 INFO Starting genomic contributions plot for classes: ['Astro', 'L5ET', 'Vip', 'Oligo']
Matching patterns across cell types#
Since we have calculated per cell type the patterns independently of each other, we do not know quantitatively how and if they overlap. It can be interesting to get an overview of which patterns are found across multiple cell types, how important they are, and if there are unique patterns only found in a small selection of classes. Therefore, we have made a pattern clustering algorithm, which starts from the results of tfmodisco-lite, and returns a pattern matrix, which contains the importance of the clustered patterns per cell type, and a pattern dictionary, describing all clustered patterns.
First, we’ll obtain the modisco files per class by using crested.tl.modisco.match_h5_files_to_classes() using our selected classes.
# First we obtain the resulting modisco files per class
matched_files = crested.tl.modisco.match_h5_files_to_classes(
contribution_dir="modisco_results_ft_2000", classes=list(adata.obs_names)
)
2026-02-19T11:48:01.680547+0100 INFO Lazily importing module crested.tl. This could take a second...
A quick pairwise comparison#
Before we do any pattern clustering, we can check for each independent pattern how similar it is to all the other patterns using memesuite-lite.
We use crested.tl.modisco.calculate_tomtom_similarity_per_pattern(). This function returns a pairwise similarity matrix of every unique pattern, together with a list of ids and a dictionary containing additional information per pattern.
sim_matrix, pattern_ids, pattern_dict = crested.tl.modisco.calculate_tomtom_similarity_per_pattern(
matched_files=matched_files, trim_ic_threshold=0.025, verbose=True
)
Reading file modisco_results_ft_2000/Astro_modisco_results.h5
Reading file modisco_results_ft_2000/Endo_modisco_results.h5
Reading file modisco_results_ft_2000/L2_3IT_modisco_results.h5
Reading file modisco_results_ft_2000/L5ET_modisco_results.h5
Reading file modisco_results_ft_2000/L5IT_modisco_results.h5
Reading file modisco_results_ft_2000/L5_6NP_modisco_results.h5
Reading file modisco_results_ft_2000/L6CT_modisco_results.h5
Reading file modisco_results_ft_2000/L6IT_modisco_results.h5
Reading file modisco_results_ft_2000/L6b_modisco_results.h5
Reading file modisco_results_ft_2000/Lamp5_modisco_results.h5
Reading file modisco_results_ft_2000/Micro_PVM_modisco_results.h5
Reading file modisco_results_ft_2000/OPC_modisco_results.h5
Reading file modisco_results_ft_2000/Oligo_modisco_results.h5
Reading file modisco_results_ft_2000/Pvalb_modisco_results.h5
Reading file modisco_results_ft_2000/Sncg_modisco_results.h5
Reading file modisco_results_ft_2000/Sst_modisco_results.h5
Reading file modisco_results_ft_2000/SstChodl_modisco_results.h5
Reading file modisco_results_ft_2000/VLMC_modisco_results.h5
Reading file modisco_results_ft_2000/Vip_modisco_results.h5
Total patterns: 548
We can now visualize this similarities using crested.pl.modisco.clustermap_tomtom_similarities(). We will subset the often large similarity matrix to a relevant subset. First, we will look at a pattern and find other patterns similar to it in other cell types. We add some additional group labels for visualization purposes.
nn = {"Astro", "Endo", "Micro_PVM", "OPC", "Oligo", "VLMC"}
exc = {"L2_3IT", "L5ET", "L5IT", "L5_6NP", "L6CT", "L6IT", "L6b"}
inh = {"Lamp5", "Pvalb", "Sncg", "Sst", "SstChodl", "Vip"}
groups, groups_2 = [], []
for id in pattern_ids:
ct = "_".join(id.split("_")[:-3])
groups.append(
"Non-neuronal"
if ct in nn
else "Excitatory"
if ct in exc
else "Inhibitory"
if ct in inh
else (_ for _ in ()).throw(ValueError(f"Unknown class: {ct}"))
)
groups_2.append(ct)
unique_cats = pd.unique(groups_2)
group_colors_2 = {cat: mcolors.to_hex(plt.get_cmap("tab20", len(unique_cats))(i)) for i, cat in enumerate(unique_cats)}
group_colors = {"Non-neuronal": "skyblue", "Excitatory": "salmon", "Inhibitory": "green"}
%matplotlib inline
crested.pl.modisco.clustermap_tomtom_similarities(
sim_matrix=sim_matrix,
ids=pattern_ids,
pattern_dict=pattern_dict,
group_info=[(groups, group_colors), (groups_2, group_colors_2)], # Grouping labels
query_id="Vip_pos_patterns_1", # Find patterns similar to this one
threshold=3, # TOMTOM similarity threshold, we take the -log10(pval)
min_seqlets=100, # Add a minimum amount of seqlets to take the most relevant patterns
)
2026-02-19T11:49:10.404031+0100 INFO Lazily importing module crested.pl. This could take a second...
We can also use this to look at all the different patterns in a subset of cell types, and see if we can find interesting groups of similar motifs.
crested.pl.modisco.clustermap_tomtom_similarities(
sim_matrix=sim_matrix,
ids=pattern_ids,
pattern_dict=pattern_dict,
group_info=[(groups, group_colors), (groups_2, group_colors_2)], # Grouping labels
class_names=["Lamp5", "Pvalb", "Sncg", "Sst", "SstChodl", "Vip"], # Subset of classes to use patterns from
min_seqlets=300, # Add a minimum amount of seqlets to take the most relevant patterns
)
<seaborn.matrix.ClusterGrid at 0x145cc3bbcd90>
Pattern clustering across cell types#
Since many patterns are similar and can be matched across cell types, we can also cluster them into groups and compare the importance of that matched motif over all cell types. We cluster all separate patterns using crested.tl.modisco.process_patterns() and create a pattern matrix with crested.tl.modisco.create_pattern_matrix().
# Then we cluster matching patterns, and define a pattern matrix [#classes, #patterns] describing their importance
all_patterns = crested.tl.modisco.process_patterns(
matched_files,
sim_threshold=4.25, # The similarity threshold used for matching patterns. We take the -log10(pval), pval obtained through TOMTOM matching from memesuite-lite
trim_ic_threshold=0.05, # Information content (IC) threshold on which to trim patterns
discard_ic_threshold=0.2, # IC threshold used for discarding single instance patterns
verbose=True, # Useful for doing sanity checks on matching patterns
)
pattern_matrix = crested.tl.modisco.create_pattern_matrix(
classes=list(adata.obs_names),
all_patterns=all_patterns,
normalize=False,
pattern_parameter="seqlet_count_log",
)
pattern_matrix.shape
# Optional: save the matched patterns if you want to re-use them later
with open("modisco_results_ft_2000/all_patterns.pkl", 'wb') as f:
pickle.dump(all_patterns, f)
Now we can plot a clustermap of cell types/classes and patterns, where the classes are clustered
purely on pattern importance with crested.tl.modisco.generate_nucleotide_sequences() and crested.pl.modisco.clustermap()
pat_seqs = crested.tl.modisco.generate_nucleotide_sequences(all_patterns)
crested.pl.modisco.clustermap(
pattern_matrix,
list(adata.obs_names),
width=25,
height=4.2,
pat_seqs=pat_seqs,
grid=True,
dendrogram_ratio=(0.03, 0.15),
importance_threshold=5,
)
<seaborn.matrix.ClusterGrid at 0x145cbfbf0110>
If you have the horizontal space for it, you can also add the PWM/contribution logos to the x-axis.
crested.pl.modisco.clustermap_with_pwm_logos(
pattern_matrix,
list(adata.obs_names),
pattern_dict=all_patterns,
width=50,
height=4.2,
grid=True,
dendrogram_ratio=(0.03, 0.15),
importance_threshold=5,
logo_x_multiplier=1,
logo_height_fraction=0.35,
logo_y_padding=0.25,
)
<seaborn.matrix.ClusterGrid at 0x145cb7f73890>
We can also subset to classes we are interested in and want to compare in more detail.
crested.pl.modisco.clustermap_with_pwm_logos(
pattern_matrix,
classes=list(adata.obs_names),
pattern_dict=all_patterns,
subset=["Astro", "OPC", "Oligo"],
width=10,
height=3,
grid=True,
logo_height_fraction=0.35,
logo_y_padding=0.3,
)
2026-02-19T13:59:06.619497+0100 WARNING Argument `figsize` is deprecated since version 2.0.0; please use width and height instead.
<seaborn.matrix.ClusterGrid at 0x145caf7822d0>
crested.pl.modisco.clustermap_with_pwm_logos(
pattern_matrix,
classes=list(adata.obs_names),
subset=["L2_3IT", "L5ET", "L5IT", "L5_6NP", "L6CT", "L6IT", "L6b"],
pattern_dict=all_patterns,
width=15,
height=3,
grid=True,
logo_height_fraction=0.35,
logo_y_padding=0.3,
importance_threshold=4,
)
2026-02-19T13:59:25.718974+0100 WARNING Argument `figsize` is deprecated since version 2.0.0; please use width and height instead.
<seaborn.matrix.ClusterGrid at 0x145cad750350>
Additional pattern insights#
It’s always interesting to investigate specific patterns that show in the clustermap above. Here there are some example functions with which to do that.
Plotting patterns based on their indices can be done with crested.pl.modisco.selected_instances():
pattern_indices = [53]
crested.pl.modisco.selected_instances(
all_patterns, pattern_indices
) # The pattern that is shown is the most representative pattern of the cluster with the highest average information content (IC)
We can also do a check of pattern similarity:
idx1 = 1
idx2 = 5
sim = crested.tl.modisco.pattern_similarity(all_patterns, idx1, idx2)
print("Pattern similarity is " + str(sim))
crested.pl.modisco.selected_instances(all_patterns, [idx1, idx2])
Pattern similarity is 0.7527711866474396
We can plot all the instances of patterns in the same cluster with crested.pl.modisco.class_instances():
crested.pl.modisco.class_instances(all_patterns, 0)
If you want to find out in which pattern cluster a certain pattern is from your modisco results, you can use the crested.tl.modisco.find_pattern() function.
idx = crested.tl.modisco.find_pattern("OPC_neg_patterns_0", all_patterns)
if idx is not None:
print("Pattern index is " + str(idx))
crested.pl.modisco.class_instances(all_patterns, idx, class_representative=True)
Pattern index is 0
Finally, we can also plot the similarity between all patterns with crested.tl.modisco.calculate_similarity_matrix() and crested.pl.modisco.similarity_heatmap():
sim_matrix, indices = crested.tl.modisco.calculate_similarity_matrix(all_patterns)
crested.pl.modisco.similarity_heatmap(sim_matrix, indices, fig_size=(42, 17))
2026-02-19T14:39:15.500681+0100 WARNING `fig_size` is deprecated since version 2.0.0; please use arguments `width` and `height` instead.
Matching patterns to TF candidates from scRNA-seq data [Optional]#
To understand the actual transcription factor (TF) candidates binding to the characteristic patterns/potential binding sites per cell type, we can propose potential candidates through scRNA-seq data and a TF-motif collection file.
This analysis requires that you ran tfmodisco-lite with the report function such that each pattern has potential MEME database hits and that you have multiome data. The names in the motif database should match those in the TF-motif collection file.
meme_db, motif_to_tf_file = crested.get_motif_db()
If you haven’t run this yet and using crested.tl.modisco.tfmodisco did not work due to lack of access to TOMTOM, versions of modiscolite above v2.4.0 also support using memelite with argument ttl=True, meaning you can generate the reports this way:
# If you don't have the patterns loaded, load here
with open('modisco_results_ft_2000/all_patterns.pkl', 'rb') as f:
all_patterns = pickle.load(f)
Load scRNA-seq data#
Load scRNA seq data and calculate mean expression per cell type using crested.tl.modisco.calculate_mean_expression_per_cell_type().
file_path = "crested/Mouse_rna.h5ad" # Locate h5 file containing scRNAseq data
cell_type_column = "subclass_Bakken_2022"
mean_expression_df = crested.tl.modisco.calculate_mean_expression_per_cell_type(
file_path, cell_type_column, cpm_normalize=True
)
2026-02-20T13:22:02.547632+0100 INFO Lazily importing module crested.tl. This could take a second...
Please make sure that the classes in the RNA file match those used in CREsted, and rename mean_expression_df’s index if not:
# Rename classes that don't match exactly (due to / being cleaned up to _, etc)
# Here they're in the same (alphabetical) order anyway so we can just zip them
class_mapping = dict(zip(mean_expression_df.index, adata.obs_names, strict=True))
print(class_mapping)
mean_expression_df.index = mean_expression_df.index.map(class_mapping)
{'Astro': 'Astro', 'Endo': 'Endo', 'L2/3 IT': 'L2_3IT', 'L5 ET': 'L5ET', 'L5 IT': 'L5IT', 'L5/6 NP': 'L5_6NP', 'L6 CT': 'L6CT', 'L6 IT': 'L6IT', 'L6b': 'L6b', 'Lamp5': 'Lamp5', 'Micro-PVM': 'Micro_PVM', 'OPC': 'OPC', 'Oligo': 'Oligo', 'Pvalb': 'Pvalb', 'Sncg': 'Sncg', 'Sst': 'Sst', 'Sst Chodl': 'SstChodl', 'VLMC': 'VLMC', 'Vip': 'Vip'}
crested.pl.modisco.tf_expression_per_cell_type(mean_expression_df, ["Nfia", "Spi1", "Mef2c"])
2026-02-20T13:22:12.004209+0100 INFO Lazily importing module crested.pl. This could take a second...
Generating pattern to database motif dictionary#
classes = list(adata.obs_names)
contribution_dir = "modisco_results_ft_2000"
html_paths = crested.tl.modisco.generate_html_paths(all_patterns, classes, contribution_dir)
# p_val threshold to only select significant matches
pattern_match_dict = crested.tl.modisco.find_pattern_matches(
all_patterns, html_paths, p_val_thr=0.05
)
Loading TF-motif database#
motif_to_tf_df = crested.tl.modisco.read_motif_to_tf_file(motif_to_tf_file)
motif_to_tf_df
| logo | Motif_name | Cluster | Human_Direct_annot | Human_Orthology_annot | Mouse_Direct_annot | Mouse_Orthology_annot | Fly_Direct_annot | Fly_Orthology_annot | Cluster_Human_Direct_annot | Cluster_Human_Orthology_annot | Cluster_Mouse_Direct_annot | Cluster_Mouse_Orthology_annot | Cluster_Fly_Direct_annot | Cluster_Fly_Orthology_annot | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | <img src="https://motifcollections.aertslab.org/v10/logos/bergman__Adf1.png" height="52" alt="bergman__Adf1"></img> | bergman__Adf1 | NaN | NaN | NaN | NaN | NaN | Adf1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | <img src="https://motifcollections.aertslab.org/v10/logos/bergman__Aef1.png" height="52" alt="bergman__Aef1"></img> | bergman__Aef1 | NaN | NaN | NaN | NaN | NaN | Aef1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | <img src="https://motifcollections.aertslab.org/v10/logos/bergman__ap.png" height="52" alt="bergman__ap"></img> | bergman__ap | NaN | NaN | NaN | NaN | NaN | ap | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | <img src="https://motifcollections.aertslab.org/v10/logos/elemento__ACCTTCA.png" height="52" alt="elemento__ACCTTCA"></img> | elemento__ACCTTCA | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | <img src="https://motifcollections.aertslab.org/v10/logos/bergman__bcd.png" height="52" alt="bergman__bcd"></img> | bergman__bcd | NaN | NaN | NaN | NaN | NaN | bcd | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 17990 | <img src="https://motifcollections.aertslab.org/v10/logos/elemento__CAAGGAG.png" height="52" alt="elemento__CAAGGAG"></img> | elemento__CAAGGAG | 98.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 17991 | <img src="https://motifcollections.aertslab.org/v10/logos/elemento__TCCTTGC.png" height="52" alt="elemento__TCCTTGC"></img> | elemento__TCCTTGC | 98.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 17992 | <img src="https://motifcollections.aertslab.org/v10/logos/swissregulon__hs__ZNF274.png" height="52" alt="swissregulon__hs__ZNF274"></img> | swissregulon__hs__ZNF274 | 99.1 | ZNF274 | NaN | NaN | Zfp369, Zfp110 | NaN | NaN | ZNF274 | NaN | NaN | Zfp369, Zfp110 | NaN | NaN |
| 17993 | <img src="https://motifcollections.aertslab.org/v10/logos/swissregulon__sacCer__THI2.png" height="52" alt="swissregulon__sacCer__THI2"></img> | swissregulon__sacCer__THI2 | 99.2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 17994 | <img src="https://motifcollections.aertslab.org/v10/logos/jaspar__MA0407.1.png" height="52" alt="jaspar__MA0407.1"></img> | jaspar__MA0407.1 | 99.2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17995 rows × 15 columns
Matching patterns to TF candidates#
We calculate a pattern-tf by cell type matrix which contains the imporatance of each pattern linked to a TF per cell type using crested.tl.modisco.create_pattern_tf_dict() and crested.tl.modisco.create_tf_ct_matrix()
cols = [
"Mouse_Direct_annot",
"Mouse_Orthology_annot",
"Cluster_Mouse_Direct_annot",
"Cluster_Mouse_Orthology_annot",
]
pattern_tf_dict, all_tfs = crested.tl.modisco.create_pattern_tf_dict(
pattern_match_dict, motif_to_tf_df, all_patterns, cols
)
tf_ct_matrix, tf_pattern_annots = crested.tl.modisco.create_tf_ct_matrix(
pattern_tf_dict,
all_patterns,
mean_expression_df,
classes,
log_transform=True,
normalize_pattern_importances=False,
normalize_gex=True,
min_tf_gex=0.95,
importance_threshold=5.5,
pattern_parameter="seqlet_count_log",
filter_correlation=True,
verbose=True,
zscore_threshold=1.5,
correlation_threshold=0.35,
)
Total columns before threshold filtering: 2845
Total columns after threshold filtering: 300
Total columns removed: 2545
Total columns before correlation filtering: 300
Total columns after correlation filtering: 169
Total columns removed: 131
Finally, we can plot a clustermap of potential pattern-TF matches and their importance per cell type with crested.pl.modisco.clustermap_tf_motif()
crested.pl.modisco.clustermap_tf_motif(
tf_ct_matrix,
heatmap_dim="contrib",
dot_dim="gex",
class_labels=classes,
pattern_labels=tf_pattern_annots,
width=35,
height=6,
cluster_rows=True,
cluster_columns=False,
xtick_rotation=90,
)
crested.pl.modisco.clustermap_tf_motif(
tf_ct_matrix,
heatmap_dim="contrib",
dot_dim="gex",
class_labels=classes,
subset_classes=["Lamp5", "Sncg", "Vip", "Pvalb", "Sst", "SstChodl"],
pattern_labels=tf_pattern_annots,
width=15,
height=6,
cluster_rows=False,
cluster_columns=False,
xtick_rotation=90,
)