crested.tl.modisco.calculate_tomtom_similarity_per_pattern

crested.tl.modisco.calculate_tomtom_similarity_per_pattern#

crested.tl.modisco.calculate_tomtom_similarity_per_pattern(matched_files, trim_ic_threshold=0.05, use_ppm=False, background_freqs=None, verbose=False)#

Compute pairwise similarity between all trimmed patterns across matched HDF5 files using TOMTOM.

This function reads in motif patterns from HDF5 files (e.g., from a TF-MoDISco pipeline), trims them based on information content, converts them to PPMs, and computes a full pairwise similarity matrix using TOMTOM. It also returns pattern metadata, including the contribution scores and the number of seqlets per pattern.

Parameters:
  • matched_files (dict[str, str | list[str] | None]) – Dictionary mapping cell type names (or class names) to HDF5 file paths or list of paths containing TF-MoDISco results. A value of None indicates no data for that cell type.

  • trim_ic_threshold (float (default: 0.05)) – Threshold for trimming low-information-content ends of patterns. Defaults to 0.05.

  • verbose (bool (default: False)) – If True, prints progress messages.

Return type:

tuple[ndarray, list[str], dict[str, dict]]

Returns:

similarity_matrix

A 2D square NumPy array of shape (N, N), where N is the number of trimmed patterns across all cell types. Each entry [i, j] contains the TOMTOM similarity score (-log10 p-value) between pattern i and pattern j.

all_pattern_ids

A list of unique pattern identifiers, corresponding to the rows and columns in similarity_matrix.

pattern_dict
A dictionary mapping each pattern ID to a dictionary containing:
  • ’contrib_scores’: the contribution score matrix (for visualization),

  • ’n_seqlets’: the number of seqlets contributing to the pattern.

Notes

  • Patterns are first trimmed using _read_and_trim_patterns.

  • PPMs are computed using _pattern_to_ppm and inserted into each pattern dictionary.

  • Similarity is computed using match_score_patterns, which uses TOMTOM under the hood.

  • The function assumes the presence of external dependencies like _read_and_trim_patterns, _pattern_to_ppm, and match_score_patterns, typically from a motif analysis library.