Skip to content

Data Flow: FASTQ to Gene-Level Differential

This document maps the complete data transformation pipeline from raw sequencing reads to gene-level chromatin annotations. Every type boundary, every intermediate representation, and every function that bridges them is documented here.

Overview

FASTQ files (disk)
ProcessPipeline.run()           recipes/process.py
    ├──► BAM (disk)              samtools-sorted, deduplicated
    ├──► BigWig (disk)           RPKM-normalized signal
    ├──► narrowPeak (disk)       called peaks
    └──► counts.txt (disk)       gene-level read counts
I/O loaders                     io/tracks.py
    ├──► SignalTrack             from BigWig — per-base continuous signal
    ├──► IntervalTrack           from narrowPeak — discrete peak regions
    └──► TableTrack              from counts — gene-name → count mapping
Transform layer                 transform/
    ├──► bin_summarize()         SignalTrack → IntervalTrack (200bp bins)
    ├──► binarize()              scores → 0/1 (mark present/absent)
    ├──► normalize()             CPM, fraction, median-ratio
    └──► log_transform()         log2 for fold-change interpretation
Compare layer                   compare/
    └──► interval_fold_change()   Track × Track → IntervalTrack
             per-window fold_change, log2fc tags
Chromatin annotation            operations/annotate/chromatin.py
    └──► annotate_chromatin()    multiple binned tracks → state-labeled IntervalTrack
             e.g. "open_active", "closed"
Gene-level aggregation          operations/quantify/summarize.py
    └──► summarize_features()    Track × gene features → IntervalTrack
             gene Regions enriched with signal as score
Gene-level comparison           compare/fold_change.py
    └──► interval_fold_change()   IntervalTrack × IntervalTrack → IntervalTrack
             (mode="exact")      gene Regions tagged with fold_change, log2fc
Chromatin transition comparison operations/annotate/chromatin.py
    └──► annotate_chromatin()    two binned IntervalTracks as "marks"
             (2-condition)       → transition-state-labeled IntervalTrack
                                 stable_occupied, nucleosome_lost, etc.
Track export                    io/tracks.py
    ├──► write_json()            IntervalTrack → JSON (browser tracks)
    ├──► write_bed()             IntervalTrack → BED6
    └──► write_wig()             IntervalTrack → variableStep WIG

Stage 1: Processing (FASTQ → disk artifacts)

Module: seqchain.recipes.process.ProcessPipeline

Input types:

Parameter Type Example
reads1 str (file path) SRR6495880.fastq.gz
reads2 str \| None None (single-end)
reference str (file path) sacCer3.fa
genome_size str "12e6" (yeast)
preset ProcessPreset loaded from chipseq.yaml

Output type: ProcessResult (frozen dataclass)

Field Type Content
bam Path Deduplicated, coordinate-sorted BAM
flagstat dict[str, int] {"total": 11228795, "mapped": 11228795, ...}
signal_track SignalTrack \| None Loaded from BigWig if signal enabled
peak_track IntervalTrack \| None Loaded from narrowPeak if peaks enabled
count_track TableTrack \| None Loaded from featureCounts if counts enabled
bigwig_path Path \| None Disk path to BigWig
peaks_path Path \| None Disk path to narrowPeak
counts_path Path \| None Disk path to counts

Pipeline steps (controlled by YAML preset):

reads.fastq.gz
    ▼  bowtie2 --very-sensitive -U reads.fq.gz | samtools sort
sample.sorted.bam
    ▼  samtools sort -n → fixmate → sort → markdup → index
sample.dedup.bam + sample.dedup.bam.bai
    ▼  samtools flagstat
sample.flagstat                     dict[str, int]
    ├──► (optional) samtools view -h | awk (fragment size filter)
    │    sample.filtered.bam        only for MNase-Seq (120-180bp)
    ├──► macs3 callpeak --nomodel --extsize 147
    │    sample_peaks.narrowPeak    BED6+4 format
    ├──► bamCoverage --normalizeUsing RPKM --binSize 10
    │    sample.bw                  BigWig binary format
    └──► featureCounts -t exon      only for RNA-Seq
         sample.counts.txt          TSV: gene_id → count

Presets (YAML, organism-agnostic):

Preset Peaks Signal Counts Fragment filter Notes
chipseq yes (--nomodel) yes (RPKM) no no SE histone/TF
mnase no yes (RPKM, --MNase) no yes (120-180bp) Mono-nucleosomal
rnaseq no yes (RPKM) yes (exon) no Requires GTF

Resumability: Every step checks if its output file exists before running. Re-running the pipeline skips completed steps.


Stage 2: Loading (disk → Track objects)

Module: seqchain.io.tracks

Three loaders, three Track types:

BigWig → SignalTrack

from seqchain.io.tracks import load_bigwig

signal = load_bigwig("sample.bw")
# Returns: SignalTrack backed by pyBigWig file handle

SignalTrack API:

Method Signature Returns
signal_at (chrom, start, end) → float Mean signal in window
raw_signal (chrom, start, end) → list[float] Per-base values
has_chrom (chrom) → bool Chromosome presence check
regions (chrom, start, end) → Iterator[Region] Always empty (not interval-based)

SignalTrack is read-only continuous signal. It does not support map_scores(), filter_entries(), scores(), or with_scores() — these raise NotImplementedError with the hint: "Use bin_summarize() to convert to IntervalTrack first."

narrowPeak → IntervalTrack

from seqchain.io.tracks import load_narrowpeak

peaks = load_narrowpeak("sample_peaks.narrowPeak")
# Returns: IntervalTrack of Region objects (segment-tree indexed)

IntervalTrack API:

Method Signature Returns
signal_at (chrom, start, end) → float Overlap-weighted average of region scores
regions (chrom, start, end) → Iterator[Region] All overlapping Regions
map_scores (fn) → IntervalTrack New track with transformed scores
filter_entries (fn) → IntervalTrack New track with filtered regions
scores () → Iterator[float] All region scores
with_scores (scores) → IntervalTrack Replace all scores
__len__ () → int Number of regions

Each Region in the IntervalTrack has:

Region(
    chrom="NC_001133.9",    # chromosome
    start=1000,             # 0-based inclusive
    end=1500,               # 0-based exclusive
    score=45.2,             # peak score (from narrowPeak column 7)
    name="SRR6495880_peak_1",
    tags={"signal_value": 12.5, "p_value": 8.3, "q_value": 6.1},
)

featureCounts → TableTrack

# Loaded internally by ProcessPipeline._load_featurecounts()
count_track = TableTrack("counts", {"YAL001C": 1523.0, "YAL002W": 87.0, ...})

TableTrack API:

Method Signature Returns
signal_at (chrom, start, end) → float Always 0.0 (non-geographic)
regions (chrom, start, end) → Iterator[Region] Always empty
map_scores (fn) → TableTrack New track with transformed values
filter_entries (fn) → TableTrack New track with filtered entries
scores () → Iterator[float] All count values
with_scores (scores) → TableTrack Replace all values
get (key) → float Lookup by gene name
keys () → Iterator[str] All gene names

Stage 3: Transform (Track → Track)

Module: seqchain.transform

All functions are pure: they take a Track, return a new Track. The original is never modified.

Critical path: BigWig → binned intervals

from seqchain.transform import bin_summarize

# chrom_sizes can be extracted from the BigWig itself:
chrom_sizes = signal_track.chrom_sizes()
# {"NC_001133.9": 230218, "NC_001134.8": 813184, ...}

binned = bin_summarize(signal_track, bin_size=200, chrom_sizes=chrom_sizes)
# Returns: IntervalTrack with one Region per 200bp window
# ~60,794 Regions for sacCer3 (12.16 Mb / 200 bp)

bin_summarize is the bridge between continuous signal and discrete intervals. After this conversion, all IntervalTrack operations work.

Method Works on Description
"mean" Any Track signal_at() per bin (default)
"max" SignalTrack only Max of raw_signal() per bin
"min" SignalTrack only Min of raw_signal() per bin
"sum" SignalTrack only Sum of raw_signal() per bin

Normalization and scaling

from seqchain.transform import (
    add_pseudocount,
    log_transform,
    normalize,
    clamp,
    standardize,
    rank,
)

# Add pseudocount before fold-change (avoid 0/0)
track = add_pseudocount(track, value=1.0)

# Log2 transform
track = log_transform(track, base=2)

# CPM normalization
track = normalize(track, method="cpm")

# Clip outliers
track = clamp(track, floor=0.0, ceiling=100.0)

# Z-score
track = standardize(track)

# Rank (ties averaged)
track = rank(track)

Thresholding

from seqchain.transform import binarize, filter_by_score

# Binary: present (>= cutoff) / absent
binary = binarize(track, cutoff=2.0)
# All scores become 1.0 or 0.0

# Filter: keep only high-scoring regions
filtered = filter_by_score(track, min_score=5.0)

Missing data

from seqchain.transform import fill_missing, drop_zeros

# Replace NaN with 0
track = fill_missing(track, value=0.0)

# Remove zero-signal bins
track = drop_zeros(track)

Stage 4: Compare (Track x Track → Track)

Module: seqchain.compare

Alignment

Before comparing, two tracks must be aligned — their regions paired up so each window has a value from both tracks. Three alignment modes are auto-detected by track type:

Input types Alignment method
IntervalTrack x IntervalTrack Match by exact coordinates or overlap
TableTrack x TableTrack Key join on gene names
SignalTrack x SignalTrack Fixed-size window walk across genome
from seqchain.compare.align import align_tracks, AlignedPair

for pair in align_tracks(ctrl_track, treat_track,
                          bin_size=200, chrom_sizes=chrom_sizes):
    # pair.region  — Region with window coordinates
    # pair.value_a — signal from control
    # pair.value_b — signal from treatment
    # pair.in_a    — True if window had data in control
    # pair.in_b    — True if window had data in treatment

Fold-change (coordinate-aligned)

from seqchain.compare.fold_change import interval_fold_change

fc_track = interval_fold_change(
    ctrl_signal,            # SignalTrack (Vehicle)
    treat_signal,           # SignalTrack (Rapamycin)
    label_a="Vehicle",
    label_b="Rapamycin",
    bin_size=200,
    chrom_sizes=chrom_sizes,
)

Output: IntervalTrack where each Region has these tags:

Tag Type Description
fold_change float treatment / control
log2fc float log2(fold_change), NaN if undefined
mean_a float Control signal in window
mean_b float Treatment signal in window
label_a str "Vehicle"
label_b str "Rapamycin"
in_a bool Data present in control
in_b bool Data present in treatment

Edge cases: 0/0 → NaN, x/0 → inf, 0/x → 0.0.

Fold-change (key-aligned)

For TableTrack comparison (gene-level, barcode-level), use table_fold_change() — same arithmetic, different I/O shape:

from seqchain.compare.table_fold_change import table_fold_change

fc_table = table_fold_change(
    ctrl_gene_signal,       # TableTrack (per-gene signal, Vehicle)
    treat_gene_signal,      # TableTrack (per-gene signal, Rapamycin)
    label_a="Vehicle",
    label_b="Rapamycin",
)
# Returns: TableTrack mapping gene name → fold-change value
fc_table.get("YAL001C")    # → 0.85 (gene lost H3 on Esa1 depletion)

interval_fold_change() and table_fold_change share the same _fold_change / _log2fc math, but differ in alignment: coordinate-aligned vs key-aligned. Both use align_tracks() for pairing.

Real-data validation

From examples/process_chipseq.py (Esa1AA Vehicle vs Rapamycin, H3):

60,794 windows (200 bp bins across sacCer3)
Median fold change: 0.951
Median log2FC: -0.071
Windows with |log2FC| > 1 up:    1,079
Windows with |log2FC| > 1 down:  7,271

Global H3 loss on Esa1 depletion (expected: Esa1 is a histone acetyltransferase whose depletion destabilizes nucleosomes).


Stage 5: Chromatin annotation (multiple tracks → state labels)

Module: seqchain.operations.annotate.chromatin

annotate_chromatin() takes multiple binned tracks (one per ChIP target or accessibility assay) and assigns a chromatin state label to each genomic window.

Configuration (YAML)

# presets/yeast_chromatin.yaml
type: chromatin
name: "Yeast 7-state chromatin"
bin_size: 200
marks:
  ATAC:
    threshold: 2.0
  H3K4me3:
    threshold: 1.5
  H3K36me3:
    threshold: 1.5
states:
  - name: open_active
    requires: {ATAC: true, H3K4me3: true, H3K36me3: false}
  - name: open_full
    requires: {ATAC: true, H3K4me3: true, H3K36me3: true}
  - name: accessible_only
    requires: {ATAC: true, H3K4me3: false, H3K36me3: false}
  - name: k4_k36
    requires: {ATAC: false, H3K4me3: true, H3K36me3: true}
  - name: k36_only
    requires: {ATAC: false, H3K4me3: false, H3K36me3: true}
  - name: k4_only
    requires: {ATAC: false, H3K4me3: true, H3K36me3: false}
  - name: closed
    requires: {ATAC: false, H3K4me3: false, H3K36me3: false}

Data flow

from seqchain.operations.annotate.chromatin import annotate_chromatin
from seqchain.recipes import load_chromatin_config

config = load_chromatin_config("yeast_chromatin")

chromatin_track = annotate_chromatin(
    config,
    tracks={
        "ATAC": binned_atac,        # IntervalTrack (from bin_summarize)
        "H3K4me3": binned_h3k4me3,  # IntervalTrack
        "H3K36me3": binned_h3k36me3,# IntervalTrack
    },
    chroms=chrom_sizes,
)

Process per 200bp window:

  1. Query signal_at() on each track for the window
  2. Threshold: signal >= mark threshold → present, else absent
  3. Match presence pattern against states in priority order (first match wins)
  4. Merge adjacent same-state windows on the same chromosome

Output: IntervalTrack of state-labeled Regions:

Region("NC_001133.9", 0, 2000, name="closed")
Region("NC_001133.9", 2000, 4400, name="open_active")
Region("NC_001133.9", 4400, 5000, name="k4_only")

Adapting to PRJNA430229

The demo config uses ATAC/H3K4me3/H3K36me3. PRJNA430229 has different marks. A custom config would use:

PRJNA430229 target Chromatin meaning
H3 Total histone occupancy (nucleosome positioning)
H3K9ac Active promoter / enhancer mark
H4ac Active chromatin mark
RNAPII Transcription (proxy for accessibility)

Stage 6: Gene-level aggregation

Module: seqchain.operations.quantify.summarize

summarize_features()

Pure function: Track × Iterable[Region] → IntervalTrack. For each feature Region, calls track.signal_at(chrom, start, end) and returns a copy of the Region with its score set to the signal value. All other fields (chrom, start, end, strand, name, tags) are preserved.

Data stays in the Region world. The output IntervalTrack is composable with every other operation: filter, annotate, score, compare.

from seqchain.operations.quantify.summarize import summarize_features

# gene features from GenBank
genome = load_genbank("sacCer3.gbff")
genes = genome.genes()

gene_track = summarize_features(binned_track, genes)
# Returns: IntervalTrack of gene Regions with signal as score
# gene_track._regions[0].name → "YAL001C"
# gene_track._regions[0].score → 512.3 (overlap-weighted avg RPKM)
# gene_track._regions[0].chrom → "NC_001133.9"  (coordinates preserved)

Behavior:

  • All features included (unnamed features get name="")
  • Duplicate names with different coordinates: both kept (Regions are distinguished by coordinates, not just name)
  • Signal value is IntervalTrack.signal_at() — overlap-weighted average of region scores within the feature's coordinates
  • Features on chromosomes with no data → score 0.0
  • Output is composable: map_scores(), filter_entries(), signal_at(), interval_fold_change() all work

Note on chromosome names: GenBank files use unversioned accessions (NC_001133) while FASTA/BigWig may use versioned ones (NC_001133.9). Features must be remapped to match the track's chromosome names before calling summarize_features(). The workbench (examples/process_chipseq.py) demonstrates this with a simple prefix-matching dict.


Stage 7: Gene-level comparison

Module: seqchain.compare.fold_change

After summarizing per-gene signal for two conditions, compute per-gene fold change using the same interval_fold_change() that handles window-level comparison. With mode="exact", it pairs gene Regions by identical coordinates:

from seqchain.compare.fold_change import interval_fold_change

fc_track = interval_fold_change(
    ctrl_gene_track,        # IntervalTrack (Vehicle, gene Regions)
    treat_gene_track,       # IntervalTrack (Rapamycin, gene Regions)
    label_a="Vehicle",
    label_b="Rapamycin",
    mode="exact",
)

# Output: IntervalTrack of gene Regions with fold_change/log2fc tags
# fc_track._regions[0].name → "YAL001C"
# fc_track._regions[0].tags["fold_change"] → 0.85
# fc_track._regions[0].chrom → "NC_001133.9"  (coordinates preserved)

Output: IntervalTrack of gene Regions tagged with fold_change, log2fc, mean_a, mean_b, label_a, label_b, in_a, in_b. Gene coordinates and names are preserved — data never leaves the Region world.

Edge cases: Same as window-level: 0/0 → NaN, x/0 → inf, 0/x → 0.0.

Note: table_fold_change() remains available for key-aligned comparison on TableTracks (e.g. barcode counting). For gene-level comparison, prefer interval_fold_change() since genes are Regions with coordinates.


Stage 8: Chromatin transition comparison

Module: seqchain.operations.annotate.chromatin (same annotate_chromatin())

annotate_chromatin() is generic — it classifies binary presence/absence patterns across any set of "marks." Conditions and marks are structurally identical: both are named IntervalTracks fed as a dict. Treating two conditions as two marks produces transition labels instead of state labels.

from seqchain.operations.annotate.chromatin import (
    annotate_chromatin, ChromatinConfig, TrackThreshold, StateDefinition,
)

config = ChromatinConfig(
    organism="S. cerevisiae",
    bin_size=200,
    thresholds=[
        TrackThreshold("Esa1AA_Veh_H3", threshold_veh),
        TrackThreshold("Esa1AA_Rapa_H3", threshold_rapa),
    ],
    states=[
        StateDefinition("stable_occupied",
            {"Esa1AA_Veh_H3": "present", "Esa1AA_Rapa_H3": "present"}),
        StateDefinition("nucleosome_lost",
            {"Esa1AA_Veh_H3": "present", "Esa1AA_Rapa_H3": "absent"}),
        StateDefinition("nucleosome_gained",
            {"Esa1AA_Veh_H3": "absent", "Esa1AA_Rapa_H3": "present"}),
        StateDefinition("stable_depleted",
            {"Esa1AA_Veh_H3": "absent", "Esa1AA_Rapa_H3": "absent"}),
    ],
    default_state="unclassified",
)

transitions = annotate_chromatin(
    config,
    tracks={"Esa1AA_Veh_H3": ctrl_binned, "Esa1AA_Rapa_H3": treat_binned},
    chroms=chrom_sizes,
)

Output: IntervalTrack of transition-state-labeled Regions:

Region("NC_001133.9", 0, 4400, name="stable_occupied")
Region("NC_001133.9", 4400, 4600, name="nucleosome_lost")
Region("NC_001133.9", 4600, 4800, name="stable_occupied")

Key insight: Zero new code. The same annotate_chromatin() function, the same binarize-then-classify logic, the same merge step. The only difference is the config: marks are condition labels, states are transition labels. The function doesn't know or care.


Stage 9: Track export

Module: seqchain.io.tracks

write_json()

from seqchain.io.tracks import write_json

write_json(binned_track, "signal.json")

Serializes any IntervalTrack to a JSON array of Region dicts. All tracks use the same schema:

{
  "chrom": "NC_001133.9",
  "start": 1806,
  "end": 2169,
  "strand": "-",
  "score": 1.826,
  "name": "YAL068C",
  "tags": {"gene": "PAU8", "fold_change": 1.826, "log2fc": 0.869}
}
  • Regions sorted by (chrom, start)
  • NaN and Inf values → null (JSON safety)
  • Tags sanitized recursively (floats and lists of floats)
  • Coordinates: 0-based, half-open [start, end)

write_bed() / write_wig()

from seqchain.io.tracks import write_bed, write_wig

write_bed(track, "output.bed")    # BED6: chrom, start, end, name, score, strand
write_wig(track, "output.wig")    # variableStep WIG: 1-based, integer scores

MANIFEST.md

The workbench (examples/process_chipseq.py) generates a machine-readable provenance manifest alongside the track files. Documents: sample processing stats, track inventory with region counts and resolution, full per-track transformation chains, track schema, and reproduction instructions.


Type transformation summary

From To Function Module
FASTQ (disk) BAM + BigWig + narrowPeak (disk) ProcessPipeline.run() recipes/process
BigWig (disk) SignalTrack load_bigwig() io/tracks
narrowPeak (disk) IntervalTrack load_narrowpeak() io/tracks
counts.txt (disk) TableTrack _load_featurecounts() recipes/process
SignalTrack IntervalTrack (binned) bin_summarize() transform/bin
IntervalTrack IntervalTrack (scored) map_scores() track.py
IntervalTrack IntervalTrack (filtered) filter_entries() track.py
IntervalTrack IntervalTrack (binary) binarize() transform/threshold
IntervalTrack IntervalTrack (scaled) normalize(), log_transform() transform/scale
Track x Track IntervalTrack (fold-change) interval_fold_change() compare/fold_change
Multiple IntervalTrack IntervalTrack (state-labeled) annotate_chromatin() operations/annotate/chromatin
Track + gene features IntervalTrack (per-gene signal) summarize_features() operations/quantify/summarize
IntervalTrack x IntervalTrack IntervalTrack (gene-level FC) interval_fold_change(mode="exact") compare/fold_change
TableTrack x TableTrack TableTrack (key-aligned FC) table_fold_change() compare/table_fold_change
Two IntervalTrack (conditions) IntervalTrack (transition-labeled) annotate_chromatin() operations/annotate/chromatin
IntervalTrack JSON (disk) write_json() io/tracks
IntervalTrack BED6 (disk) write_bed() io/tracks
IntervalTrack WIG (disk) write_wig() io/tracks

Validated end-to-end paths

These paths have been run on real data from PRJNA430229:

Path 1: Process → Load (single sample)

SRR6495880.fastq.gz
    → ProcessPipeline(chipseq, genome_size="12e6")
    → SRR6495880.bw (5.7 MB)
    → load_bigwig() → SignalTrack

Validated: 11.2M reads, 100% mapping, 1001 peaks.

Path 2: Process → Load → Compare (two conditions)

SRR6495880 (Esa1AA_Veh_H3)     → BigWig → SignalTrack ─┐
                                                         ├─► interval_fold_change()
SRR6495881 (Esa1AA_Rapa_H3)    → BigWig → SignalTrack ─┘
                                              IntervalTrack (60,794 windows)
                                              tagged: fold_change, log2fc

Validated: median log2FC = -0.07, 7271 windows down, 1079 up.

Path 3: Process → Load → bin_summarize (SignalTrack → IntervalTrack)

SRR6495880.bw → load_bigwig() → SignalTrack
                              bin_summarize(200bp)
                              IntervalTrack (60,794 bins)
                              99.3% non-zero (H3 = core histone)
                              median RPKM: 505 (Veh), 459 (Rapa)

Validated. This is the bridge between continuous signal and the discrete interval world. All downstream operations (binarize, annotate_chromatin(), gene-level aggregation) consume IntervalTrack.

Path 4a: bin_summarize → binarize (score → 0/1)

IntervalTrack (60,794 bins, median RPKM ~505)
   binarize(cutoff=median/4)
IntervalTrack (60,794 bins, scores 0.0 or 1.0)
    Vehicle:   98.6% present, 1.4% absent (874 bins)
    Rapamycin: 96.7% present, 3.3% absent (2,023 bins)

Validated. Data-driven cutoff (median/4 ≈ 126 RPKM for Vehicle, 115 for Rapamycin). H3 is a core histone — nearly everything is present. Esa1 depletion doubles the number of absent bins (874 → 2,023), consistent with nucleosome destabilization.

Path 4b: bin_summarize → annotate_chromatin() (state labeling)

IntervalTrack (60,794 bins)
   annotate_chromatin(
       config=2-state: nucleosome_occupied / nucleosome_depleted,
       threshold=median/4 RPKM
   )
IntervalTrack (state-labeled, merged adjacent same-state)
    Vehicle:   403 regions (204 occupied, 199 depleted)
    Rapamycin: 1,286 regions (648 occupied, 638 depleted)

Validated. Key findings:

  • annotate_chromatin() binarizes tracks internally using binarize() from Transform, then delegates classification to classify_truth_table(). Thresholds can be explicit (per-track config) or computed at runtime via a threshold function. Transform sits below Operations in the layer hierarchy, so this import is architecturally legal.
  • Merging is the core value: 60,794 bins → 403 merged regions (Vehicle). Adjacent same-state windows collapse into contiguous domains.
  • Biological signal: Esa1 depletion creates 3x more regions (403 → 1,286) — the genome fragments into smaller domains as nucleosomes are lost in scattered locations.
  • Depleted bases: 173,612 bp (Vehicle) → 403,891 bp (Rapamycin). 2.3x more nucleosome-depleted genomic territory.

Path 5: bin_summarize → summarize_features → IntervalTrack (per-gene signal)

IntervalTrack (60,794 bins, RPKM scores)
   summarize_features(track, genome.genes())
IntervalTrack (6,477 gene Regions with signal as score)
    Coordinates, names, strand, tags all preserved
    Vehicle:   median 513 RPKM, 20 genes zero-signal
    Rapamycin: median 507 RPKM, 29 genes zero-signal

Validated. Key findings:

  • 6,477 genes from sacCer3 GenBank (GCF_000146045.2)
  • Region in, Region out: gene Regions retain their coordinates, names, strand, and tags — only score is updated with signal
  • Chromosome name remapping: GenBank uses NC_001133, BigWig uses NC_001133.9. Simple prefix-match dict resolves this.
  • Top signal genes: rDNA-adjacent non-coding loci on chrXII (YNCL0026C, YLR154C-G) — expected H3 enrichment
  • Zero-signal genes: mitochondrial tRNAs (tH(GUG)Q, tA(UGC)Q) — minimal nucleosome occupancy on mtDNA
  • Composable output: result is a standard IntervalTrack — works with map_scores(), filter_entries(), interval_fold_change(), annotate_chromatin(), and all other operations

Path 6: summarize_features → interval_fold_change() (gene-level FC)

IntervalTrack (Vehicle, 6,477 gene Regions) ─┐
                                              ├─► interval_fold_change()
IntervalTrack (Rapamycin, 6,477 gene Regions)┘    (mode="exact")
                              IntervalTrack (6,477 gene Regions)
                              tagged: fold_change, log2fc, mean_a, mean_b
                              Median FC: 1.006
                              871 genes up (FC > 1.5)
                              1,158 genes down (FC < 0.67)

Validated. Key findings:

  • Same function, different granularity: interval_fold_change() handles both 200bp window comparison (Stage 4) and gene-level comparison (Stage 7). mode="exact" pairs genes by identical coordinates instead of overlap.
  • Median FC = 1.006: global H3 near-constant (core histone)
  • Asymmetric: more genes lose H3 than gain it (1,158 down vs 871 up). Expected — Esa1 is a HAT, depletion destabilizes nucleosomes.
  • Top increased: mitochondrial genes (Q-prefix, tRNAs) — low baseline signal creates noisy ratios
  • Top decreased: mitochondrial tRNAs at FC=0.000 — had trace signal in Vehicle, none in Rapamycin
  • Data stays typed: gene Regions carry coordinates, names, and fold-change tags through the entire pipeline. No conversion to dict keys or text.

Path 7: bin_summarize × 2 → annotate_chromatin() (transition comparison)

IntervalTrack (Vehicle, 60,794 bins) ─┐
                                      ├─► annotate_chromatin()
IntervalTrack (Rapamycin, 60,794 bins)┘   (conditions as "marks")
                              IntervalTrack (1,558 transition domains)
                              names: stable_occupied, nucleosome_lost,
                                     nucleosome_gained, stable_depleted

Validated. Key findings:

  • Zero new code: same annotate_chromatin() that does single-condition state calling. Config swaps mark names for condition names, states for transition labels.
  • 271 kb nucleosome loss vs 41 kb gain (6.5:1 ratio). Esa1 depletion causes asymmetric nucleosome destabilization.
  • 1,558 domains merged from 60,794 bins. 96.3% of the genome is stable_occupied (nucleosome present in both conditions).
  • Distributed across all 16 nuclear chromosomes (1.2-4.9% loss per chromosome), absent from mitochondrial DNA.

Path 8: All tracks → write_json (browser export)

8 IntervalTracks (signal, chromatin, gene_signal, fold_change, transitions)
   write_json() per track
8 JSON files + MANIFEST.md
    signal:              9.2 MB each (60,794 regions)
    chromatin:           64-206 KB (403-1,286 domains)
    gene_signal:         1.5 MB each (6,477 genes)
    gene_fold_change:    3.1 MB (6,477 genes)
    transitions:         244 KB (1,558 domains)

Validated. All 8 files use the same Region JSON schema. A track viewer that can render one can render all eight. NaN/Inf safely serialized as null.