crested.tl.modisco.calculate_tomtom_similarity_per_pattern#
- crested.tl.modisco.calculate_tomtom_similarity_per_pattern(matched_files, trim_ic_threshold=0.05, use_ppm=False, background_freqs=None, verbose=False)#
Compute pairwise similarity between all trimmed patterns across matched HDF5 files using TOMTOM.
This function reads in motif patterns from HDF5 files (e.g., from a TF-MoDISco pipeline), trims them based on information content, converts them to PPMs, and computes a full pairwise similarity matrix using TOMTOM. It also returns pattern metadata, including the contribution scores and the number of seqlets per pattern.
- Parameters:
matched_files (
dict[str,str|list[str] |None]) – Dictionary mapping cell type names (or class names) to HDF5 file paths or list of paths containing TF-MoDISco results. A value of None indicates no data for that cell type.trim_ic_threshold (
float(default:0.05)) – Threshold for trimming low-information-content ends of patterns. Defaults to 0.05.verbose (
bool(default:False)) – If True, prints progress messages.
- Return type:
- Returns:
- similarity_matrix
A 2D square NumPy array of shape (N, N), where N is the number of trimmed patterns across all cell types. Each entry [i, j] contains the TOMTOM similarity score (-log10 p-value) between pattern i and pattern j.
- all_pattern_ids
A list of unique pattern identifiers, corresponding to the rows and columns in
similarity_matrix.- pattern_dict
- A dictionary mapping each pattern ID to a dictionary containing:
’contrib_scores’: the contribution score matrix (for visualization),
’n_seqlets’: the number of seqlets contributing to the pattern.
Notes
Patterns are first trimmed using
_read_and_trim_patterns.PPMs are computed using
_pattern_to_ppmand inserted into each pattern dictionary.Similarity is computed using
match_score_patterns, which uses TOMTOM under the hood.The function assumes the presence of external dependencies like
_read_and_trim_patterns,_pattern_to_ppm, andmatch_score_patterns, typically from a motif analysis library.