crested.tl.modisco.calculate_tomtom_similarity_per_pattern#

crested.tl.modisco.calculate_tomtom_similarity_per_pattern(matched_files, trim_ic_threshold=0.05, use_ppm=False, background_freqs=None, verbose=False)#

Compute pairwise similarity between all trimmed patterns across matched HDF5 files using TOMTOM.

This function reads in motif patterns from HDF5 files (e.g., from a TF-MoDISco pipeline), trims them based on information content, converts them to PPMs, and computes a full pairwise similarity matrix using TOMTOM. It also returns pattern metadata, including the contribution scores and the number of seqlets per pattern.

Parameters:

matched_files (dict[str, str | list[str] | None]) – Dictionary mapping cell type names (or class names) to HDF5 file paths or list of paths containing TF-MoDISco results. A value of None indicates no data for that cell type.
trim_ic_threshold (float (default: 0.05)) – Threshold for trimming low-information-content ends of patterns. Defaults to 0.05.
verbose (bool (default: False)) – If True, prints progress messages.

Return type:

tuple[ndarray, list[str], dict[str, dict]]

Returns:

similarity_matrix

A 2D square NumPy array of shape (N, N), where N is the number of trimmed patterns across all cell types. Each entry [i, j] contains the TOMTOM similarity score (-log10 p-value) between pattern i and pattern j.

all_pattern_ids

A list of unique pattern identifiers, corresponding to the rows and columns in similarity_matrix.

pattern_dict

A dictionary mapping each pattern ID to a dictionary containing:

’contrib_scores’: the contribution score matrix (for visualization),
’n_seqlets’: the number of seqlets contributing to the pattern.

Notes

Patterns are first trimmed using _read_and_trim_patterns.
PPMs are computed using _pattern_to_ppm and inserted into each pattern dictionary.
Similarity is computed using match_score_patterns, which uses TOMTOM under the hood.
The function assumes the presence of external dependencies like _read_and_trim_patterns, _pattern_to_ppm, and match_score_patterns, typically from a motif analysis library.

crested.tl.modisco.calculate_tomtom_similarity_per_pattern

Contents

crested.tl.modisco.calculate_tomtom_similarity_per_pattern#