crested.utils.calculate_nucleotide_distribution#
- crested.utils.calculate_nucleotide_distribution(input, genome=None, per_position=False, n_regions=None)#
Calculate the nucleotide distribution of a genome in a set of regions or sequences.
- Parameters:
input (
str|list[str] |ndarray|AnnData) – Input data to calculate the ACGT distribution of. Can be a (list of) sequence(s), a (list of) region name(s), a matrix of one hot encodings (N, L, 4), or an AnnData object with region names as its var_names.genome (
Genome|str|PathLike|None(default:None)) – The genome object or path to the genome fasta file. Required if input is a region or AnnData.per_position (
bool(default:False)) – If True, calculate the nucleotide distribution per position in the sequence instead of over the whole sequence.n_regions (
int|None(default:None)) – Randomly sample n_regions from the input. If None, all inputs are used. This is useful for large datasets to speed up the calculation.
- Return type:
- Returns:
The nucleotide distribution as an array of floats (4,) in order A, C, G, T if per_position is False. Else, it returns an array of shape (L, 4) with the nucleotide distribution per position.