Topic classification#
We can use the outputs of pycisTopic to train a model to predict topic probabilities for a given sequence.
Since we plan on adding detailed use cases describing topic classification later on, we will only provide a brief overview of the workflow here. Refer to the introductory notebook for a more detailed explanation of the CREsted workflow.
Import data#
For this tutorial, we will use the mouse BICCN dataset. We will use the preprocessed, binarized outputs of pycisTopic as input data for the topic classification model.
To train a topic classification model, we need the following data:
A folder containing BED files per topic (output of pycisTopic).
A genome fasta and optionally a chromosome sizes file.
import crested
# Set the genome
genome = crested.Genome("mm10/genome.fa", "mm10/genome.chrom.sizes")
crested.register_genome(genome) # Register the genome so that it's automatically used in every function
2026-02-16T15:04:11.542386+0100 INFO Genome genome registered.
# Download the tutorial data
beds_folder, regions_file = crested.get_dataset("mouse_cortex_bed")
We can import a folder of BED files using the crested.import_beds() function.
This will return an Anndata object with the regions as .var and the bed file names as .obs (here: our Topics).
In this case, the adata.X values are binary, representing whether that region is associated with a topic or not.
# Import the beds into an AnnData object - the regions file is optional for import_beds
adata = crested.import_beds(beds_folder=beds_folder, regions_file=regions_file)
adata
2026-02-16T15:03:14.757905+0100 WARNING Chromsizes file not provided. Will not check if regions are within chromosomes
2026-02-16T15:03:15.642825+0100 INFO Reading bed files from /staging/leuven/stg_00002/lcb/cblaauw/data/mouse_biccn/beds.tar.gz.untar and using /staging/leuven/stg_00002/lcb/cblaauw/data/mouse_biccn/consensus_peaks_biccn.bed as var_names...
2026-02-16T15:03:29.218412+0100 WARNING 107610 consensus regions are not open in any class. Removing them from the AnnData object. Disable this behavior by setting 'remove_empty_regions=False'
AnnData object with n_obs × n_vars = 80 × 439383
obs: 'file_path', 'n_open_regions'
var: 'n_classes', 'chr', 'start', 'end'
We have 80 classes (topics) and 439386 regions in the dataset.
Preprocessing#
For topic classification there is little preprocessing to be performed compared to peak regression.
The data does not need to be normalized since the values are binary and we don’t filter any regions on specificity since by nature of topic modelling the selected regions should already be ‘meaningful’ regions.
You could change the width of the regions, but we tend to keep the regions at 500bp for topic classification.
The only preprocessing step we need to perform is to split the data into training and testing sets.
# Standard train/val/test split
crested.pp.train_val_test_split(adata, strategy="chr", val_chroms=["chr8", "chr10"], test_chroms=["chr9", "chr18"])
print(adata.var["split"].value_counts())
2026-02-16T15:03:29.634609+0100 INFO Lazily importing module crested.pp. This could take a second...
split
train 354013
val 45113
test 40257
Name: count, dtype: int64
Model training#
Model training has the same workflow as peak regression. The only differences are:
We select a different model architecture. Since we’re training on 500bp regions we don’t need the dilated convolutions of the dilated CNN.
We select a different config, since we’re monitoring other metrics and are using a different loss for classification.
# Datamodule
datamodule = crested.tl.data.AnnDataModule(
adata,
batch_size=128, # lower this if you encounter OOM errors
max_stochastic_shift=3, # optional augmentation
always_reverse_complement=True, # default True. Will double the effective size of the training dataset.
)
# Architecture: we will use the DeepTopic CNN model
model_architecture = crested.tl.zoo.deeptopic_cnn(seq_len=500, num_classes=80)
# Config: we will use the default topic classification config (binary cross entropy loss and AUC/ROC metrics)
config = crested.tl.default_configs("topic_classification")
print(config)
2026-02-16T15:04:21.343827+0100 INFO Lazily importing module crested.tl. This could take a second...
TaskConfig(optimizer=<keras.src.optimizers.adam.Adam object at 0x14ba081b86e0>, loss=<LossFunctionWrapper(<function binary_crossentropy at 0x14ba02aa5080>, kwargs={'from_logits': False, 'label_smoothing': 0.0, 'axis': -1})>, metrics=[<AUC name=auROC>, <AUC name=auPR>, <CategoricalAccuracy name=categorical_accuracy>])
Set up the trainer object and train the model:
trainer = crested.tl.Crested(
data=datamodule,
model=model_architecture,
config=config,
project_name="mouse_biccn", # change to your liking
run_name="topic_classification",
logger='wandb', # or 'tensorboard', None
)
trainer.fit(epochs=100)
Evaluation and prediction#
Evaluation and prediction are the same as peak regression.
The next steps you could take are to:
Evaluate the model on the test set.
Predict topic probabilities for a given sequence or region.
Run tfmodisco to find motifs associated with each topic.
Generate synthetic sequences for each topic using in silico evolution.
Plot contribution scores per topic for interesting regions or sequences.
Refer to the introduction notebook for more details.