Architecture¶
Layer separation¶
SeqChain enforces strict layer separation. Each layer has a clear
responsibility, and cross-layer imports are forbidden (enforced by
tests/test_architecture.py).
| Layer | Package | Role |
|---|---|---|
| REST API | api/ |
Thin HTTP handlers — one library call each |
| Recipes | recipes/ |
Domain workflows: CRISPR, Tn-seq, barcodes |
| Operations | operations/ |
Protocols + implementations (Map, Annotate, Filter, Score, Discover, Quantify) |
| Transform | transform/ |
Track → Track pure functions |
| Compare | compare/ |
Track × Track → Track, protocol-typed |
| Primitives | primitives/ |
Pure functions, zero deps, no I/O |
| I/O | io/ |
Genome, track, reads loading + export |
Layers form a strict linear hierarchy. Each layer may only import from
layers below it in rank. The ordering is defined in
tests/layer_hierarchy.yaml and enforced generically by
tests/test_architecture.py.
Import rules¶
| Rank | Layer | Can import from |
|---|---|---|
| 8 | API | recipes, operations, compare, transform, track, I/O |
| 7 | Recipes | operations, compare, transform, track, primitives, I/O |
| 6 | Compare | operations, transform, track, primitives |
| 5 | Operations | transform, track, primitives, I/O |
| 4 | Transform | track, primitives |
| 3 | Track | region, primitives |
| 2 | Primitives | region |
| 1 | Region | (nothing) |
Skip-layer rules add further constraints (e.g. API must not import Primitives). These are also declared in the YAML and enforced by the same test.
Universal types¶
Region¶
Every pipeline step takes Regions and returns Regions. A Region is a frozen
dataclass:
@dataclass(frozen=True, slots=True)
class Region:
chrom: str
start: int
end: int
strand: str = "."
score: float = 0.0
name: str = ""
tags: dict = field(default_factory=dict)
The tags dict is the extension point — operations enrich it with
spacer, gene, fold_change, chromatin_states, etc. No separate
Hit/AnnotatedHit/ScoredHit types.
Track¶
Track is a Protocol (structural typing, not inheritance). Any object
with the right methods satisfies it:
signal_at(chrom, start, end)— summary value for a regionregions(chrom, start, end)— iterate overlapping Regionsmap_scores(fn)— apply a function to every scorefilter_entries(fn)— keep entries matching a predicatescores()— iterate all scoreswith_scores(scores)— replace scores
Three implementations:
| Type | Use case | signal_at |
regions |
|---|---|---|---|
IntervalTrack |
Peaks, genes, guides | Overlap-weighted average | Segment-tree query |
SignalTrack |
BigWig per-base data | Mean signal | Always empty |
TableTrack |
Barcode counts | Always 0.0 | Always empty |
The three processing layers¶
Transform: Track → Track¶
Thirteen pure functions organized by family:
- missing —
fill_missing,replace_zeros,drop_missing,drop_zeros - scale —
add_pseudocount,log_transform,clamp,normalize,standardize,rank - threshold —
binarize,filter_by_score - bin —
bin_summarize(SignalTrack → IntervalTrack)
No protocols, no isinstance. Every function calls Track primitives
(map_scores, filter_entries, with_scores) and returns a new Track.
Compare: Track x Track → Track¶
Differential analysis via module-level functions. interval_fold_change()
aligns two tracks by region name or binned overlap and computes raw fold
change.
The alignment primitive (align_tracks) pairs regions across two tracks
and reports which track each region appears in. Comparison functions
consume aligned pairs.
Operations: raw data → Tracks¶
Module-level functions and protocols:
| Category | Functions | What it does |
|---|---|---|
| Map | regex_map(), bowtie_map() |
Find patterns in sequences |
| Annotate | annotate_features(), annotate_chromatin() |
Enrich Regions with context |
| Quantify | count_barcodes(), summarize_features() |
Count barcodes or summarize features |
| Discover | heuristic_discover(), tnseq_discover() |
Auto-detect read structure |
| Score | score_mismatches(), score_off_targets() |
Score variant effects |
| Filter | filter_by_overlap() |
Filter by geometric overlap (recipe) |
Design principles¶
- Prefer functions over classes — Most operations are module-level functions. No remaining protocols in operations.
- No inheritance — Protocols (structural typing) only.
- Immutability — Frozen dataclasses,
dataclasses.replace()for updates. - Sequences are strings — No
Bio.Seq. Biopython is I/O only. - Chromatin states are mark combinations — No gene names. Gene features are a separate layer.