crested.pp.train_val_test_split#
- crested.pp.train_val_test_split(adata, strategy='region', val_size=0.1, test_size=0.1, val_chroms=None, test_chroms=None, shuffle=True, random_state=None, inplace=True)#
Add ‘train/val/test’ split column to AnnData object.
Adds a new column
splitto the.varDataFrame of the AnnData object, indicating whether each sample should be part of the training, validation, or test set based on the chosen splitting strategy.Note
Model training always requires a
splitcolumn in the.varDataFrame.- Parameters:
adata (
AnnData) – AnnData object to which the ‘train/val/test’ split column will be added.strategy (
Literal['region','chr','chr_auto'] (default:'region')) –strategy of split. Either ‘region’, ‘chr’ or ‘chr_auto’. If ‘chr’ or ‘chr_auto’, the anndata’s var_names should contain the chromosome name at the start, followed by a
:(e.g. I:2000-2500 or chr3:10-20:+).region: Split randomly on region indices.
chr: Split based on provided chromosomes.
chr_auto: Automatically select chromosomes for val and test sets based on val and test size.
If strategy ‘chr’, it’s also possible to provide the same chromosome(s) to both val_chroms and test_chroms. In this case, the regions will be divided evenly between the two sets.
val_size (
float(default:0.1)) – Proportion of the training dataset to include in the validation split.test_size (
float(default:0.1)) – Proportion of the dataset to include in the test split.val_chroms (
str|list[str] (default:None)) – List of chromosomes or single chromosome to include in the validation set. Required if strategy=’chr’.test_chroms (
str|list[str] (default:None)) – List of chromosomes or single chromosome to include in the test set. Required if strategy=’chr’.shuffle (
bool(default:True)) – Whether or not to shuffle the data before splitting (when strategy=’region’).random_state (
None|int(default:None)) – Random_state affects the ordering of the indices when shuffling in regions or auto splitting on chromosomes.inplace (
bool(default:True)) – Perform computation and modifyadatain-place or return a resulting copy of theadatainstead.
- Return type:
- Returns:
If
inplace=True(default), modifies the anndata in-place and doesn’t return anything. Ifinplace=False, returns the AnnData object with the [‘split’] column added to.var.
Examples
>>> crested.train_val_test_split( ... adata, ... strategy="region", ... val_size=0.1, ... test_size=0.1, ... shuffle=True, ... random_state=42, ... )
>>> crested.train_val_test_split( ... adata, ... strategy="chr", ... val_chroms=["chr1", "chr2"], ... test_chroms=["chr3", "chr4"], ... )