Data Flow: FASTQ to Gene-Level Differential¶

This document maps the complete data transformation pipeline from raw sequencing reads to gene-level chromatin annotations. Every type boundary, every intermediate representation, and every function that bridges them is documented here.

Overview¶

FASTQ files (disk)
    │
    ▼
ProcessPipeline.run()           recipes/process.py
    │
    ├──► BAM (disk)              samtools-sorted, deduplicated
    ├──► BigWig (disk)           RPKM-normalized signal
    ├──► narrowPeak (disk)       called peaks
    └──► counts.txt (disk)       gene-level read counts
    │
    ▼
I/O loaders                     io/tracks.py
    │
    ├──► SignalTrack             from BigWig — per-base continuous signal
    ├──► IntervalTrack           from narrowPeak — discrete peak regions
    └──► TableTrack              from counts — gene-name → count mapping
    │
    ▼
Transform layer                 transform/
    │
    ├──► bin_summarize()         SignalTrack → IntervalTrack (200bp bins)
    ├──► binarize()              scores → 0/1 (mark present/absent)
    ├──► normalize()             CPM, fraction, median-ratio
    └──► log_transform()         log2 for fold-change interpretation
    │
    ▼
Compare layer                   compare/
    │
    └──► interval_fold_change()   Track × Track → IntervalTrack
             per-window fold_change, log2fc tags
    │
    ▼
Chromatin annotation            operations/annotate/chromatin.py
    │
    └──► annotate_chromatin()    multiple binned tracks → state-labeled IntervalTrack
             e.g. "open_active", "closed"
    │
    ▼
Gene-level aggregation          operations/quantify/summarize.py
    │
    └──► summarize_features()    Track × gene features → IntervalTrack
             gene Regions enriched with signal as score
    │
    ▼
Gene-level comparison           compare/fold_change.py
    │
    └──► interval_fold_change()   IntervalTrack × IntervalTrack → IntervalTrack
             (mode="exact")      gene Regions tagged with fold_change, log2fc
    │
    ▼
Chromatin transition comparison operations/annotate/chromatin.py
    │
    └──► annotate_chromatin()    two binned IntervalTracks as "marks"
             (2-condition)       → transition-state-labeled IntervalTrack
                                 stable_occupied, nucleosome_lost, etc.
    │
    ▼
Track export                    io/tracks.py
    │
    ├──► write_json()            IntervalTrack → JSON (browser tracks)
    ├──► write_bed()             IntervalTrack → BED6
    └──► write_wig()             IntervalTrack → variableStep WIG

Stage 1: Processing (FASTQ → disk artifacts)¶

Module: seqchain.recipes.process.ProcessPipeline

Input types:

Parameter	Type	Example
`reads1`	`str` (file path)	`SRR6495880.fastq.gz`
`reads2`	`str \\| None`	`None` (single-end)
`reference`	`str` (file path)	`sacCer3.fa`
`genome_size`	`str`	`"12e6"` (yeast)
`preset`	`ProcessPreset`	loaded from `chipseq.yaml`

Output type: ProcessResult (frozen dataclass)

Field	Type	Content
`bam`	`Path`	Deduplicated, coordinate-sorted BAM
`flagstat`	`dict[str, int]`	`{"total": 11228795, "mapped": 11228795, ...}`
`signal_track`	`SignalTrack \\| None`	Loaded from BigWig if signal enabled
`peak_track`	`IntervalTrack \\| None`	Loaded from narrowPeak if peaks enabled
`count_track`	`TableTrack \\| None`	Loaded from featureCounts if counts enabled
`bigwig_path`	`Path \\| None`	Disk path to BigWig
`peaks_path`	`Path \\| None`	Disk path to narrowPeak
`counts_path`	`Path \\| None`	Disk path to counts

Pipeline steps (controlled by YAML preset):

reads.fastq.gz
    │
    ▼  bowtie2 --very-sensitive -U reads.fq.gz | samtools sort
sample.sorted.bam
    │
    ▼  samtools sort -n → fixmate → sort → markdup → index
sample.dedup.bam + sample.dedup.bam.bai
    │
    ▼  samtools flagstat
sample.flagstat                     dict[str, int]
    │
    ├──► (optional) samtools view -h | awk (fragment size filter)
    │    sample.filtered.bam        only for MNase-Seq (120-180bp)
    │
    ├──► macs3 callpeak --nomodel --extsize 147
    │    sample_peaks.narrowPeak    BED6+4 format
    │
    ├──► bamCoverage --normalizeUsing RPKM --binSize 10
    │    sample.bw                  BigWig binary format
    │
    └──► featureCounts -t exon      only for RNA-Seq
         sample.counts.txt          TSV: gene_id → count

Presets (YAML, organism-agnostic):

Preset	Peaks	Signal	Counts	Fragment filter	Notes
`chipseq`	yes (`--nomodel`)	yes (RPKM)	no	no	SE histone/TF
`mnase`	no	yes (RPKM, `--MNase`)	no	yes (120-180bp)	Mono-nucleosomal
`rnaseq`	no	yes (RPKM)	yes (exon)	no	Requires GTF

Resumability: Every step checks if its output file exists before running. Re-running the pipeline skips completed steps.

Stage 2: Loading (disk → Track objects)¶

Module: seqchain.io.tracks

Three loaders, three Track types:

BigWig → SignalTrack¶

from seqchain.io.tracks import load_bigwig

signal = load_bigwig("sample.bw")
# Returns: SignalTrack backed by pyBigWig file handle

SignalTrack API:

Method	Signature	Returns
`signal_at`	`(chrom, start, end) → float`	Mean signal in window
`raw_signal`	`(chrom, start, end) → list[float]`	Per-base values
`has_chrom`	`(chrom) → bool`	Chromosome presence check
`regions`	`(chrom, start, end) → Iterator[Region]`	Always empty (not interval-based)

SignalTrack is read-only continuous signal. It does not support map_scores(), filter_entries(), scores(), or with_scores() — these raise NotImplementedError with the hint: "Use bin_summarize() to convert to IntervalTrack first."

narrowPeak → IntervalTrack¶

from seqchain.io.tracks import load_narrowpeak

peaks = load_narrowpeak("sample_peaks.narrowPeak")
# Returns: IntervalTrack of Region objects (segment-tree indexed)

IntervalTrack API:

Method	Signature	Returns
`signal_at`	`(chrom, start, end) → float`	Overlap-weighted average of region scores
`regions`	`(chrom, start, end) → Iterator[Region]`	All overlapping Regions
`map_scores`	`(fn) → IntervalTrack`	New track with transformed scores
`filter_entries`	`(fn) → IntervalTrack`	New track with filtered regions
`scores`	`() → Iterator[float]`	All region scores
`with_scores`	`(scores) → IntervalTrack`	Replace all scores
`__len__`	`() → int`	Number of regions

Each Region in the IntervalTrack has:

Region(
    chrom="NC_001133.9",    # chromosome
    start=1000,             # 0-based inclusive
    end=1500,               # 0-based exclusive
    score=45.2,             # peak score (from narrowPeak column 7)
    name="SRR6495880_peak_1",
    tags={"signal_value": 12.5, "p_value": 8.3, "q_value": 6.1},
)

featureCounts → TableTrack¶

# Loaded internally by ProcessPipeline._load_featurecounts()
count_track = TableTrack("counts", {"YAL001C": 1523.0, "YAL002W": 87.0, ...})

TableTrack API:

Method	Signature	Returns
`signal_at`	`(chrom, start, end) → float`	Always 0.0 (non-geographic)
`regions`	`(chrom, start, end) → Iterator[Region]`	Always empty
`map_scores`	`(fn) → TableTrack`	New track with transformed values
`filter_entries`	`(fn) → TableTrack`	New track with filtered entries
`scores`	`() → Iterator[float]`	All count values
`with_scores`	`(scores) → TableTrack`	Replace all values
`get`	`(key) → float`	Lookup by gene name
`keys`	`() → Iterator[str]`	All gene names

Stage 3: Transform (Track → Track)¶

Module: seqchain.transform

All functions are pure: they take a Track, return a new Track. The original is never modified.

Critical path: BigWig → binned intervals¶

from seqchain.transform import bin_summarize

# chrom_sizes can be extracted from the BigWig itself:
chrom_sizes = signal_track.chrom_sizes()
# {"NC_001133.9": 230218, "NC_001134.8": 813184, ...}

binned = bin_summarize(signal_track, bin_size=200, chrom_sizes=chrom_sizes)
# Returns: IntervalTrack with one Region per 200bp window
# ~60,794 Regions for sacCer3 (12.16 Mb / 200 bp)

bin_summarize is the bridge between continuous signal and discrete intervals. After this conversion, all IntervalTrack operations work.

Method	Works on	Description
`"mean"`	Any Track	`signal_at()` per bin (default)
`"max"`	SignalTrack only	Max of `raw_signal()` per bin
`"min"`	SignalTrack only	Min of `raw_signal()` per bin
`"sum"`	SignalTrack only	Sum of `raw_signal()` per bin

Normalization and scaling¶

from seqchain.transform import (
    add_pseudocount,
    log_transform,
    normalize,
    clamp,
    standardize,
    rank,
)

# Add pseudocount before fold-change (avoid 0/0)
track = add_pseudocount(track, value=1.0)

# Log2 transform
track = log_transform(track, base=2)

# CPM normalization
track = normalize(track, method="cpm")

# Clip outliers
track = clamp(track, floor=0.0, ceiling=100.0)

# Z-score
track = standardize(track)

# Rank (ties averaged)
track = rank(track)

Thresholding¶

from seqchain.transform import binarize, filter_by_score

# Binary: present (>= cutoff) / absent
binary = binarize(track, cutoff=2.0)
# All scores become 1.0 or 0.0

# Filter: keep only high-scoring regions
filtered = filter_by_score(track, min_score=5.0)

Missing data¶

from seqchain.transform import fill_missing, drop_zeros

# Replace NaN with 0
track = fill_missing(track, value=0.0)

# Remove zero-signal bins
track = drop_zeros(track)

Stage 4: Compare (Track x Track → Track)¶

Module: seqchain.compare

Alignment¶

Before comparing, two tracks must be aligned — their regions paired up so each window has a value from both tracks. Three alignment modes are auto-detected by track type:

Input types	Alignment method
IntervalTrack x IntervalTrack	Match by exact coordinates or overlap
TableTrack x TableTrack	Key join on gene names
SignalTrack x SignalTrack	Fixed-size window walk across genome

from seqchain.compare.align import align_tracks, AlignedPair

for pair in align_tracks(ctrl_track, treat_track,
                          bin_size=200, chrom_sizes=chrom_sizes):
    # pair.region  — Region with window coordinates
    # pair.value_a — signal from control
    # pair.value_b — signal from treatment
    # pair.in_a    — True if window had data in control
    # pair.in_b    — True if window had data in treatment

Fold-change (coordinate-aligned)¶

from seqchain.compare.fold_change import interval_fold_change

fc_track = interval_fold_change(
    ctrl_signal,            # SignalTrack (Vehicle)
    treat_signal,           # SignalTrack (Rapamycin)
    label_a="Vehicle",
    label_b="Rapamycin",
    bin_size=200,
    chrom_sizes=chrom_sizes,
)

Output: IntervalTrack where each Region has these tags:

Tag	Type	Description
`fold_change`	`float`	`treatment / control`
`log2fc`	`float`	`log2(fold_change)`, NaN if undefined
`mean_a`	`float`	Control signal in window
`mean_b`	`float`	Treatment signal in window
`label_a`	`str`	`"Vehicle"`
`label_b`	`str`	`"Rapamycin"`
`in_a`	`bool`	Data present in control
`in_b`	`bool`	Data present in treatment

Edge cases: 0/0 → NaN, x/0 → inf, 0/x → 0.0.

Fold-change (key-aligned)¶

For TableTrack comparison (gene-level, barcode-level), use table_fold_change() — same arithmetic, different I/O shape:

from seqchain.compare.table_fold_change import table_fold_change

fc_table = table_fold_change(
    ctrl_gene_signal,       # TableTrack (per-gene signal, Vehicle)
    treat_gene_signal,      # TableTrack (per-gene signal, Rapamycin)
    label_a="Vehicle",
    label_b="Rapamycin",
)
# Returns: TableTrack mapping gene name → fold-change value
fc_table.get("YAL001C")    # → 0.85 (gene lost H3 on Esa1 depletion)

interval_fold_change() and table_fold_change share the same _fold_change / _log2fc math, but differ in alignment: coordinate-aligned vs key-aligned. Both use align_tracks() for pairing.

Real-data validation¶

From examples/process_chipseq.py (Esa1AA Vehicle vs Rapamycin, H3):

60,794 windows (200 bp bins across sacCer3)
Median fold change: 0.951
Median log2FC: -0.071
Windows with |log2FC| > 1 up:    1,079
Windows with |log2FC| > 1 down:  7,271

Global H3 loss on Esa1 depletion (expected: Esa1 is a histone acetyltransferase whose depletion destabilizes nucleosomes).

Stage 5: Chromatin annotation (multiple tracks → state labels)¶

Module: seqchain.operations.annotate.chromatin

annotate_chromatin() takes multiple binned tracks (one per ChIP target or accessibility assay) and assigns a chromatin state label to each genomic window.

Configuration (YAML)¶

# presets/yeast_chromatin.yaml
type: chromatin
name: "Yeast 7-state chromatin"
bin_size: 200
marks:
  ATAC:
    threshold: 2.0
  H3K4me3:
    threshold: 1.5
  H3K36me3:
    threshold: 1.5
states:
  - name: open_active
    requires: {ATAC: true, H3K4me3: true, H3K36me3: false}
  - name: open_full
    requires: {ATAC: true, H3K4me3: true, H3K36me3: true}
  - name: accessible_only
    requires: {ATAC: true, H3K4me3: false, H3K36me3: false}
  - name: k4_k36
    requires: {ATAC: false, H3K4me3: true, H3K36me3: true}
  - name: k36_only
    requires: {ATAC: false, H3K4me3: false, H3K36me3: true}
  - name: k4_only
    requires: {ATAC: false, H3K4me3: true, H3K36me3: false}
  - name: closed
    requires: {ATAC: false, H3K4me3: false, H3K36me3: false}

Data flow¶

from seqchain.operations.annotate.chromatin import annotate_chromatin
from seqchain.recipes import load_chromatin_config

config = load_chromatin_config("yeast_chromatin")

chromatin_track = annotate_chromatin(
    config,
    tracks={
        "ATAC": binned_atac,        # IntervalTrack (from bin_summarize)
        "H3K4me3": binned_h3k4me3,  # IntervalTrack
        "H3K36me3": binned_h3k36me3,# IntervalTrack
    },
    chroms=chrom_sizes,
)

Process per 200bp window:

Query signal_at() on each track for the window
Threshold: signal >= mark threshold → present, else absent
Match presence pattern against states in priority order (first match wins)
Merge adjacent same-state windows on the same chromosome

Output: IntervalTrack of state-labeled Regions:

Region("NC_001133.9", 0, 2000, name="closed")
Region("NC_001133.9", 2000, 4400, name="open_active")
Region("NC_001133.9", 4400, 5000, name="k4_only")

Adapting to PRJNA430229¶

The demo config uses ATAC/H3K4me3/H3K36me3. PRJNA430229 has different marks. A custom config would use:

PRJNA430229 target	Chromatin meaning
H3	Total histone occupancy (nucleosome positioning)
H3K9ac	Active promoter / enhancer mark
H4ac	Active chromatin mark
RNAPII	Transcription (proxy for accessibility)

Stage 6: Gene-level aggregation¶

Module: seqchain.operations.quantify.summarize

summarize_features()¶

Pure function: Track × Iterable[Region] → IntervalTrack. For each feature Region, calls track.signal_at(chrom, start, end) and returns a copy of the Region with its score set to the signal value. All other fields (chrom, start, end, strand, name, tags) are preserved.

Data stays in the Region world. The output IntervalTrack is composable with every other operation: filter, annotate, score, compare.

from seqchain.operations.quantify.summarize import summarize_features

# gene features from GenBank
genome = load_genbank("sacCer3.gbff")
genes = genome.genes()

gene_track = summarize_features(binned_track, genes)
# Returns: IntervalTrack of gene Regions with signal as score
# gene_track._regions[0].name → "YAL001C"
# gene_track._regions[0].score → 512.3 (overlap-weighted avg RPKM)
# gene_track._regions[0].chrom → "NC_001133.9"  (coordinates preserved)

Behavior:

All features included (unnamed features get name="")
Duplicate names with different coordinates: both kept (Regions are distinguished by coordinates, not just name)
Signal value is IntervalTrack.signal_at() — overlap-weighted average of region scores within the feature's coordinates
Features on chromosomes with no data → score 0.0
Output is composable: map_scores(), filter_entries(), signal_at(), interval_fold_change() all work

Note on chromosome names: GenBank files use unversioned accessions (NC_001133) while FASTA/BigWig may use versioned ones (NC_001133.9). Features must be remapped to match the track's chromosome names before calling summarize_features(). The workbench (examples/process_chipseq.py) demonstrates this with a simple prefix-matching dict.

Stage 7: Gene-level comparison¶

Module: seqchain.compare.fold_change

After summarizing per-gene signal for two conditions, compute per-gene fold change using the same interval_fold_change() that handles window-level comparison. With mode="exact", it pairs gene Regions by identical coordinates:

from seqchain.compare.fold_change import interval_fold_change

fc_track = interval_fold_change(
    ctrl_gene_track,        # IntervalTrack (Vehicle, gene Regions)
    treat_gene_track,       # IntervalTrack (Rapamycin, gene Regions)
    label_a="Vehicle",
    label_b="Rapamycin",
    mode="exact",
)

# Output: IntervalTrack of gene Regions with fold_change/log2fc tags
# fc_track._regions[0].name → "YAL001C"
# fc_track._regions[0].tags["fold_change"] → 0.85
# fc_track._regions[0].chrom → "NC_001133.9"  (coordinates preserved)

Output: IntervalTrack of gene Regions tagged with fold_change, log2fc, mean_a, mean_b, label_a, label_b, in_a, in_b. Gene coordinates and names are preserved — data never leaves the Region world.

Edge cases: Same as window-level: 0/0 → NaN, x/0 → inf, 0/x → 0.0.

Note: table_fold_change() remains available for key-aligned comparison on TableTracks (e.g. barcode counting). For gene-level comparison, prefer interval_fold_change() since genes are Regions with coordinates.

Stage 8: Chromatin transition comparison¶

Module: seqchain.operations.annotate.chromatin (same annotate_chromatin())

annotate_chromatin() is generic — it classifies binary presence/absence patterns across any set of "marks." Conditions and marks are structurally identical: both are named IntervalTracks fed as a dict. Treating two conditions as two marks produces transition labels instead of state labels.

from seqchain.operations.annotate.chromatin import (
    annotate_chromatin, ChromatinConfig, TrackThreshold, StateDefinition,
)

config = ChromatinConfig(
    organism="S. cerevisiae",
    bin_size=200,
    thresholds=[
        TrackThreshold("Esa1AA_Veh_H3", threshold_veh),
        TrackThreshold("Esa1AA_Rapa_H3", threshold_rapa),
    ],
    states=[
        StateDefinition("stable_occupied",
            {"Esa1AA_Veh_H3": "present", "Esa1AA_Rapa_H3": "present"}),
        StateDefinition("nucleosome_lost",
            {"Esa1AA_Veh_H3": "present", "Esa1AA_Rapa_H3": "absent"}),
        StateDefinition("nucleosome_gained",
            {"Esa1AA_Veh_H3": "absent", "Esa1AA_Rapa_H3": "present"}),
        StateDefinition("stable_depleted",
            {"Esa1AA_Veh_H3": "absent", "Esa1AA_Rapa_H3": "absent"}),
    ],
    default_state="unclassified",
)

transitions = annotate_chromatin(
    config,
    tracks={"Esa1AA_Veh_H3": ctrl_binned, "Esa1AA_Rapa_H3": treat_binned},
    chroms=chrom_sizes,
)

Output: IntervalTrack of transition-state-labeled Regions:

Region("NC_001133.9", 0, 4400, name="stable_occupied")
Region("NC_001133.9", 4400, 4600, name="nucleosome_lost")
Region("NC_001133.9", 4600, 4800, name="stable_occupied")

Key insight: Zero new code. The same annotate_chromatin() function, the same binarize-then-classify logic, the same merge step. The only difference is the config: marks are condition labels, states are transition labels. The function doesn't know or care.

Stage 9: Track export¶

Module: seqchain.io.tracks

write_json()¶

from seqchain.io.tracks import write_json

write_json(binned_track, "signal.json")

Serializes any IntervalTrack to a JSON array of Region dicts. All tracks use the same schema:

{
  "chrom": "NC_001133.9",
  "start": 1806,
  "end": 2169,
  "strand": "-",
  "score": 1.826,
  "name": "YAL068C",
  "tags": {"gene": "PAU8", "fold_change": 1.826, "log2fc": 0.869}
}

Regions sorted by (chrom, start)
NaN and Inf values → null (JSON safety)
Tags sanitized recursively (floats and lists of floats)
Coordinates: 0-based, half-open [start, end)

write_bed() / write_wig()¶

from seqchain.io.tracks import write_bed, write_wig

write_bed(track, "output.bed")    # BED6: chrom, start, end, name, score, strand
write_wig(track, "output.wig")    # variableStep WIG: 1-based, integer scores

MANIFEST.md¶

The workbench (examples/process_chipseq.py) generates a machine-readable provenance manifest alongside the track files. Documents: sample processing stats, track inventory with region counts and resolution, full per-track transformation chains, track schema, and reproduction instructions.

Type transformation summary¶

From	To	Function	Module
FASTQ (disk)	BAM + BigWig + narrowPeak (disk)	`ProcessPipeline.run()`	`recipes/process`
BigWig (disk)	`SignalTrack`	`load_bigwig()`	`io/tracks`
narrowPeak (disk)	`IntervalTrack`	`load_narrowpeak()`	`io/tracks`
counts.txt (disk)	`TableTrack`	`_load_featurecounts()`	`recipes/process`
`SignalTrack`	`IntervalTrack` (binned)	`bin_summarize()`	`transform/bin`
`IntervalTrack`	`IntervalTrack` (scored)	`map_scores()`	`track.py`
`IntervalTrack`	`IntervalTrack` (filtered)	`filter_entries()`	`track.py`
`IntervalTrack`	`IntervalTrack` (binary)	`binarize()`	`transform/threshold`
`IntervalTrack`	`IntervalTrack` (scaled)	`normalize()`, `log_transform()`	`transform/scale`
`Track` x `Track`	`IntervalTrack` (fold-change)	`interval_fold_change()`	`compare/fold_change`
Multiple `IntervalTrack`	`IntervalTrack` (state-labeled)	`annotate_chromatin()`	`operations/annotate/chromatin`
`Track` + gene features	`IntervalTrack` (per-gene signal)	`summarize_features()`	`operations/quantify/summarize`
`IntervalTrack` x `IntervalTrack`	`IntervalTrack` (gene-level FC)	`interval_fold_change(mode="exact")`	`compare/fold_change`
`TableTrack` x `TableTrack`	`TableTrack` (key-aligned FC)	`table_fold_change()`	`compare/table_fold_change`
Two `IntervalTrack` (conditions)	`IntervalTrack` (transition-labeled)	`annotate_chromatin()`	`operations/annotate/chromatin`
`IntervalTrack`	JSON (disk)	`write_json()`	`io/tracks`
`IntervalTrack`	BED6 (disk)	`write_bed()`	`io/tracks`
`IntervalTrack`	WIG (disk)	`write_wig()`	`io/tracks`

Validated end-to-end paths¶

These paths have been run on real data from PRJNA430229:

Path 1: Process → Load (single sample)¶

SRR6495880.fastq.gz
    → ProcessPipeline(chipseq, genome_size="12e6")
    → SRR6495880.bw (5.7 MB)
    → load_bigwig() → SignalTrack

Validated: 11.2M reads, 100% mapping, 1001 peaks.

Path 2: Process → Load → Compare (two conditions)¶

SRR6495880 (Esa1AA_Veh_H3)     → BigWig → SignalTrack ─┐
                                                         ├─► interval_fold_change()
SRR6495881 (Esa1AA_Rapa_H3)    → BigWig → SignalTrack ─┘
                                                         │
                                              IntervalTrack (60,794 windows)
                                              tagged: fold_change, log2fc

Validated: median log2FC = -0.07, 7271 windows down, 1079 up.

Path 3: Process → Load → bin_summarize (SignalTrack → IntervalTrack)¶

SRR6495880.bw → load_bigwig() → SignalTrack
                                     │
                              bin_summarize(200bp)
                                     │
                              IntervalTrack (60,794 bins)
                              99.3% non-zero (H3 = core histone)
                              median RPKM: 505 (Veh), 459 (Rapa)

Validated. This is the bridge between continuous signal and the discrete interval world. All downstream operations (binarize, annotate_chromatin(), gene-level aggregation) consume IntervalTrack.

Path 4a: bin_summarize → binarize (score → 0/1)¶

IntervalTrack (60,794 bins, median RPKM ~505)
         │
   binarize(cutoff=median/4)
         │
IntervalTrack (60,794 bins, scores 0.0 or 1.0)
    Vehicle:   98.6% present, 1.4% absent (874 bins)
    Rapamycin: 96.7% present, 3.3% absent (2,023 bins)

Validated. Data-driven cutoff (median/4 ≈ 126 RPKM for Vehicle, 115 for Rapamycin). H3 is a core histone — nearly everything is present. Esa1 depletion doubles the number of absent bins (874 → 2,023), consistent with nucleosome destabilization.

Path 4b: bin_summarize → annotate_chromatin() (state labeling)¶

IntervalTrack (60,794 bins)
         │
   annotate_chromatin(
       config=2-state: nucleosome_occupied / nucleosome_depleted,
       threshold=median/4 RPKM
   )
         │
IntervalTrack (state-labeled, merged adjacent same-state)
    Vehicle:   403 regions (204 occupied, 199 depleted)
    Rapamycin: 1,286 regions (648 occupied, 638 depleted)

Validated. Key findings:

annotate_chromatin() binarizes tracks internally using binarize() from Transform, then delegates classification to classify_truth_table(). Thresholds can be explicit (per-track config) or computed at runtime via a threshold function. Transform sits below Operations in the layer hierarchy, so this import is architecturally legal.
Merging is the core value: 60,794 bins → 403 merged regions (Vehicle). Adjacent same-state windows collapse into contiguous domains.
Biological signal: Esa1 depletion creates 3x more regions (403 → 1,286) — the genome fragments into smaller domains as nucleosomes are lost in scattered locations.
Depleted bases: 173,612 bp (Vehicle) → 403,891 bp (Rapamycin). 2.3x more nucleosome-depleted genomic territory.

Path 5: bin_summarize → summarize_features → IntervalTrack (per-gene signal)¶

IntervalTrack (60,794 bins, RPKM scores)
         │
   summarize_features(track, genome.genes())
         │
IntervalTrack (6,477 gene Regions with signal as score)
    Coordinates, names, strand, tags all preserved
    Vehicle:   median 513 RPKM, 20 genes zero-signal
    Rapamycin: median 507 RPKM, 29 genes zero-signal

Validated. Key findings:

6,477 genes from sacCer3 GenBank (GCF_000146045.2)
Region in, Region out: gene Regions retain their coordinates, names, strand, and tags — only score is updated with signal
Chromosome name remapping: GenBank uses NC_001133, BigWig uses NC_001133.9. Simple prefix-match dict resolves this.
Top signal genes: rDNA-adjacent non-coding loci on chrXII (YNCL0026C, YLR154C-G) — expected H3 enrichment
Zero-signal genes: mitochondrial tRNAs (tH(GUG)Q, tA(UGC)Q) — minimal nucleosome occupancy on mtDNA
Composable output: result is a standard IntervalTrack — works with map_scores(), filter_entries(), interval_fold_change(), annotate_chromatin(), and all other operations

Path 6: summarize_features → interval_fold_change() (gene-level FC)¶

IntervalTrack (Vehicle, 6,477 gene Regions) ─┐
                                              ├─► interval_fold_change()
IntervalTrack (Rapamycin, 6,477 gene Regions)┘    (mode="exact")
                                              │
                              IntervalTrack (6,477 gene Regions)
                              tagged: fold_change, log2fc, mean_a, mean_b
                              Median FC: 1.006
                              871 genes up (FC > 1.5)
                              1,158 genes down (FC < 0.67)

Validated. Key findings:

Same function, different granularity: interval_fold_change() handles both 200bp window comparison (Stage 4) and gene-level comparison (Stage 7). mode="exact" pairs genes by identical coordinates instead of overlap.
Median FC = 1.006: global H3 near-constant (core histone)
Asymmetric: more genes lose H3 than gain it (1,158 down vs 871 up). Expected — Esa1 is a HAT, depletion destabilizes nucleosomes.
Top increased: mitochondrial genes (Q-prefix, tRNAs) — low baseline signal creates noisy ratios
Top decreased: mitochondrial tRNAs at FC=0.000 — had trace signal in Vehicle, none in Rapamycin
Data stays typed: gene Regions carry coordinates, names, and fold-change tags through the entire pipeline. No conversion to dict keys or text.

Path 7: bin_summarize × 2 → annotate_chromatin() (transition comparison)¶

IntervalTrack (Vehicle, 60,794 bins) ─┐
                                      ├─► annotate_chromatin()
IntervalTrack (Rapamycin, 60,794 bins)┘   (conditions as "marks")
                                      │
                              IntervalTrack (1,558 transition domains)
                              names: stable_occupied, nucleosome_lost,
                                     nucleosome_gained, stable_depleted

Validated. Key findings:

Zero new code: same annotate_chromatin() that does single-condition state calling. Config swaps mark names for condition names, states for transition labels.
271 kb nucleosome loss vs 41 kb gain (6.5:1 ratio). Esa1 depletion causes asymmetric nucleosome destabilization.
1,558 domains merged from 60,794 bins. 96.3% of the genome is stable_occupied (nucleosome present in both conditions).
Distributed across all 16 nuclear chromosomes (1.2-4.9% loss per chromosome), absent from mitochondrial DNA.

Path 8: All tracks → write_json (browser export)¶

8 IntervalTracks (signal, chromatin, gene_signal, fold_change, transitions)
         │
   write_json() per track
         │
8 JSON files + MANIFEST.md
    signal:              9.2 MB each (60,794 regions)
    chromatin:           64-206 KB (403-1,286 domains)
    gene_signal:         1.5 MB each (6,477 genes)
    gene_fold_change:    3.1 MB (6,477 genes)
    transitions:         244 KB (1,558 domains)

Validated. All 8 files use the same Region JSON schema. A track viewer that can render one can render all eight. NaN/Inf safely serialized as null.