Skip to content

Architecture

Layer separation

SeqChain enforces strict layer separation. Each layer has a clear responsibility, and cross-layer imports are forbidden (enforced by tests/test_architecture.py).

Layer Package Role
REST API api/ Thin HTTP handlers — one library call each
Recipes recipes/ Domain workflows: CRISPR, Tn-seq, barcodes
Operations operations/ Protocols + implementations (Map, Annotate, Filter, Score, Discover, Quantify)
Transform transform/ Track → Track pure functions
Compare compare/ Track × Track → Track, protocol-typed
Primitives primitives/ Pure functions, zero deps, no I/O
I/O io/ Genome, track, reads loading + export

Layers form a strict linear hierarchy. Each layer may only import from layers below it in rank. The ordering is defined in tests/layer_hierarchy.yaml and enforced generically by tests/test_architecture.py.

Import rules

Rank Layer Can import from
8 API recipes, operations, compare, transform, track, I/O
7 Recipes operations, compare, transform, track, primitives, I/O
6 Compare operations, transform, track, primitives
5 Operations transform, track, primitives, I/O
4 Transform track, primitives
3 Track region, primitives
2 Primitives region
1 Region (nothing)

Skip-layer rules add further constraints (e.g. API must not import Primitives). These are also declared in the YAML and enforced by the same test.

Universal types

Region

Every pipeline step takes Regions and returns Regions. A Region is a frozen dataclass:

@dataclass(frozen=True, slots=True)
class Region:
    chrom: str
    start: int
    end: int
    strand: str = "."
    score: float = 0.0
    name: str = ""
    tags: dict = field(default_factory=dict)

The tags dict is the extension point — operations enrich it with spacer, gene, fold_change, chromatin_states, etc. No separate Hit/AnnotatedHit/ScoredHit types.

Track

Track is a Protocol (structural typing, not inheritance). Any object with the right methods satisfies it:

  • signal_at(chrom, start, end) — summary value for a region
  • regions(chrom, start, end) — iterate overlapping Regions
  • map_scores(fn) — apply a function to every score
  • filter_entries(fn) — keep entries matching a predicate
  • scores() — iterate all scores
  • with_scores(scores) — replace scores

Three implementations:

Type Use case signal_at regions
IntervalTrack Peaks, genes, guides Overlap-weighted average Segment-tree query
SignalTrack BigWig per-base data Mean signal Always empty
TableTrack Barcode counts Always 0.0 Always empty

The three processing layers

Transform: Track → Track

Thirteen pure functions organized by family:

  • missingfill_missing, replace_zeros, drop_missing, drop_zeros
  • scaleadd_pseudocount, log_transform, clamp, normalize, standardize, rank
  • thresholdbinarize, filter_by_score
  • binbin_summarize (SignalTrack → IntervalTrack)

No protocols, no isinstance. Every function calls Track primitives (map_scores, filter_entries, with_scores) and returns a new Track.

Compare: Track x Track → Track

Differential analysis via module-level functions. interval_fold_change() aligns two tracks by region name or binned overlap and computes raw fold change.

The alignment primitive (align_tracks) pairs regions across two tracks and reports which track each region appears in. Comparison functions consume aligned pairs.

Operations: raw data → Tracks

Module-level functions and protocols:

Category Functions What it does
Map regex_map(), bowtie_map() Find patterns in sequences
Annotate annotate_features(), annotate_chromatin() Enrich Regions with context
Quantify count_barcodes(), summarize_features() Count barcodes or summarize features
Discover heuristic_discover(), tnseq_discover() Auto-detect read structure
Score score_mismatches(), score_off_targets() Score variant effects
Filter filter_by_overlap() Filter by geometric overlap (recipe)

Design principles

  • Prefer functions over classes — Most operations are module-level functions. No remaining protocols in operations.
  • No inheritance — Protocols (structural typing) only.
  • Immutability — Frozen dataclasses, dataclasses.replace() for updates.
  • Sequences are strings — No Bio.Seq. Biopython is I/O only.
  • Chromatin states are mark combinations — No gene names. Gene features are a separate layer.