Data Flow: FASTQ to Gene-Level Differential¶
This document maps the complete data transformation pipeline from raw sequencing reads to gene-level chromatin annotations. Every type boundary, every intermediate representation, and every function that bridges them is documented here.
Overview¶
FASTQ files (disk)
│
▼
ProcessPipeline.run() recipes/process.py
│
├──► BAM (disk) samtools-sorted, deduplicated
├──► BigWig (disk) RPKM-normalized signal
├──► narrowPeak (disk) called peaks
└──► counts.txt (disk) gene-level read counts
│
▼
I/O loaders io/tracks.py
│
├──► SignalTrack from BigWig — per-base continuous signal
├──► IntervalTrack from narrowPeak — discrete peak regions
└──► TableTrack from counts — gene-name → count mapping
│
▼
Transform layer transform/
│
├──► bin_summarize() SignalTrack → IntervalTrack (200bp bins)
├──► binarize() scores → 0/1 (mark present/absent)
├──► normalize() CPM, fraction, median-ratio
└──► log_transform() log2 for fold-change interpretation
│
▼
Compare layer compare/
│
└──► interval_fold_change() Track × Track → IntervalTrack
per-window fold_change, log2fc tags
│
▼
Chromatin annotation operations/annotate/chromatin.py
│
└──► annotate_chromatin() multiple binned tracks → state-labeled IntervalTrack
e.g. "open_active", "closed"
│
▼
Gene-level aggregation operations/quantify/summarize.py
│
└──► summarize_features() Track × gene features → IntervalTrack
gene Regions enriched with signal as score
│
▼
Gene-level comparison compare/fold_change.py
│
└──► interval_fold_change() IntervalTrack × IntervalTrack → IntervalTrack
(mode="exact") gene Regions tagged with fold_change, log2fc
│
▼
Chromatin transition comparison operations/annotate/chromatin.py
│
└──► annotate_chromatin() two binned IntervalTracks as "marks"
(2-condition) → transition-state-labeled IntervalTrack
stable_occupied, nucleosome_lost, etc.
│
▼
Track export io/tracks.py
│
├──► write_json() IntervalTrack → JSON (browser tracks)
├──► write_bed() IntervalTrack → BED6
└──► write_wig() IntervalTrack → variableStep WIG
Stage 1: Processing (FASTQ → disk artifacts)¶
Module: seqchain.recipes.process.ProcessPipeline
Input types:
| Parameter | Type | Example |
|---|---|---|
reads1 |
str (file path) |
SRR6495880.fastq.gz |
reads2 |
str \| None |
None (single-end) |
reference |
str (file path) |
sacCer3.fa |
genome_size |
str |
"12e6" (yeast) |
preset |
ProcessPreset |
loaded from chipseq.yaml |
Output type: ProcessResult (frozen dataclass)
| Field | Type | Content |
|---|---|---|
bam |
Path |
Deduplicated, coordinate-sorted BAM |
flagstat |
dict[str, int] |
{"total": 11228795, "mapped": 11228795, ...} |
signal_track |
SignalTrack \| None |
Loaded from BigWig if signal enabled |
peak_track |
IntervalTrack \| None |
Loaded from narrowPeak if peaks enabled |
count_track |
TableTrack \| None |
Loaded from featureCounts if counts enabled |
bigwig_path |
Path \| None |
Disk path to BigWig |
peaks_path |
Path \| None |
Disk path to narrowPeak |
counts_path |
Path \| None |
Disk path to counts |
Pipeline steps (controlled by YAML preset):
reads.fastq.gz
│
▼ bowtie2 --very-sensitive -U reads.fq.gz | samtools sort
sample.sorted.bam
│
▼ samtools sort -n → fixmate → sort → markdup → index
sample.dedup.bam + sample.dedup.bam.bai
│
▼ samtools flagstat
sample.flagstat dict[str, int]
│
├──► (optional) samtools view -h | awk (fragment size filter)
│ sample.filtered.bam only for MNase-Seq (120-180bp)
│
├──► macs3 callpeak --nomodel --extsize 147
│ sample_peaks.narrowPeak BED6+4 format
│
├──► bamCoverage --normalizeUsing RPKM --binSize 10
│ sample.bw BigWig binary format
│
└──► featureCounts -t exon only for RNA-Seq
sample.counts.txt TSV: gene_id → count
Presets (YAML, organism-agnostic):
| Preset | Peaks | Signal | Counts | Fragment filter | Notes |
|---|---|---|---|---|---|
chipseq |
yes (--nomodel) |
yes (RPKM) | no | no | SE histone/TF |
mnase |
no | yes (RPKM, --MNase) |
no | yes (120-180bp) | Mono-nucleosomal |
rnaseq |
no | yes (RPKM) | yes (exon) | no | Requires GTF |
Resumability: Every step checks if its output file exists before running. Re-running the pipeline skips completed steps.
Stage 2: Loading (disk → Track objects)¶
Module: seqchain.io.tracks
Three loaders, three Track types:
BigWig → SignalTrack¶
from seqchain.io.tracks import load_bigwig
signal = load_bigwig("sample.bw")
# Returns: SignalTrack backed by pyBigWig file handle
SignalTrack API:
| Method | Signature | Returns |
|---|---|---|
signal_at |
(chrom, start, end) → float |
Mean signal in window |
raw_signal |
(chrom, start, end) → list[float] |
Per-base values |
has_chrom |
(chrom) → bool |
Chromosome presence check |
regions |
(chrom, start, end) → Iterator[Region] |
Always empty (not interval-based) |
SignalTrack is read-only continuous signal. It does not support
map_scores(), filter_entries(), scores(), or with_scores() —
these raise NotImplementedError with the hint: "Use bin_summarize()
to convert to IntervalTrack first."
narrowPeak → IntervalTrack¶
from seqchain.io.tracks import load_narrowpeak
peaks = load_narrowpeak("sample_peaks.narrowPeak")
# Returns: IntervalTrack of Region objects (segment-tree indexed)
IntervalTrack API:
| Method | Signature | Returns |
|---|---|---|
signal_at |
(chrom, start, end) → float |
Overlap-weighted average of region scores |
regions |
(chrom, start, end) → Iterator[Region] |
All overlapping Regions |
map_scores |
(fn) → IntervalTrack |
New track with transformed scores |
filter_entries |
(fn) → IntervalTrack |
New track with filtered regions |
scores |
() → Iterator[float] |
All region scores |
with_scores |
(scores) → IntervalTrack |
Replace all scores |
__len__ |
() → int |
Number of regions |
Each Region in the IntervalTrack has:
Region(
chrom="NC_001133.9", # chromosome
start=1000, # 0-based inclusive
end=1500, # 0-based exclusive
score=45.2, # peak score (from narrowPeak column 7)
name="SRR6495880_peak_1",
tags={"signal_value": 12.5, "p_value": 8.3, "q_value": 6.1},
)
featureCounts → TableTrack¶
# Loaded internally by ProcessPipeline._load_featurecounts()
count_track = TableTrack("counts", {"YAL001C": 1523.0, "YAL002W": 87.0, ...})
TableTrack API:
| Method | Signature | Returns |
|---|---|---|
signal_at |
(chrom, start, end) → float |
Always 0.0 (non-geographic) |
regions |
(chrom, start, end) → Iterator[Region] |
Always empty |
map_scores |
(fn) → TableTrack |
New track with transformed values |
filter_entries |
(fn) → TableTrack |
New track with filtered entries |
scores |
() → Iterator[float] |
All count values |
with_scores |
(scores) → TableTrack |
Replace all values |
get |
(key) → float |
Lookup by gene name |
keys |
() → Iterator[str] |
All gene names |
Stage 3: Transform (Track → Track)¶
Module: seqchain.transform
All functions are pure: they take a Track, return a new Track. The original is never modified.
Critical path: BigWig → binned intervals¶
from seqchain.transform import bin_summarize
# chrom_sizes can be extracted from the BigWig itself:
chrom_sizes = signal_track.chrom_sizes()
# {"NC_001133.9": 230218, "NC_001134.8": 813184, ...}
binned = bin_summarize(signal_track, bin_size=200, chrom_sizes=chrom_sizes)
# Returns: IntervalTrack with one Region per 200bp window
# ~60,794 Regions for sacCer3 (12.16 Mb / 200 bp)
bin_summarize is the bridge between continuous signal and discrete intervals. After this conversion, all IntervalTrack operations work.
| Method | Works on | Description |
|---|---|---|
"mean" |
Any Track | signal_at() per bin (default) |
"max" |
SignalTrack only | Max of raw_signal() per bin |
"min" |
SignalTrack only | Min of raw_signal() per bin |
"sum" |
SignalTrack only | Sum of raw_signal() per bin |
Normalization and scaling¶
from seqchain.transform import (
add_pseudocount,
log_transform,
normalize,
clamp,
standardize,
rank,
)
# Add pseudocount before fold-change (avoid 0/0)
track = add_pseudocount(track, value=1.0)
# Log2 transform
track = log_transform(track, base=2)
# CPM normalization
track = normalize(track, method="cpm")
# Clip outliers
track = clamp(track, floor=0.0, ceiling=100.0)
# Z-score
track = standardize(track)
# Rank (ties averaged)
track = rank(track)
Thresholding¶
from seqchain.transform import binarize, filter_by_score
# Binary: present (>= cutoff) / absent
binary = binarize(track, cutoff=2.0)
# All scores become 1.0 or 0.0
# Filter: keep only high-scoring regions
filtered = filter_by_score(track, min_score=5.0)
Missing data¶
from seqchain.transform import fill_missing, drop_zeros
# Replace NaN with 0
track = fill_missing(track, value=0.0)
# Remove zero-signal bins
track = drop_zeros(track)
Stage 4: Compare (Track x Track → Track)¶
Module: seqchain.compare
Alignment¶
Before comparing, two tracks must be aligned — their regions paired up so each window has a value from both tracks. Three alignment modes are auto-detected by track type:
| Input types | Alignment method |
|---|---|
| IntervalTrack x IntervalTrack | Match by exact coordinates or overlap |
| TableTrack x TableTrack | Key join on gene names |
| SignalTrack x SignalTrack | Fixed-size window walk across genome |
from seqchain.compare.align import align_tracks, AlignedPair
for pair in align_tracks(ctrl_track, treat_track,
bin_size=200, chrom_sizes=chrom_sizes):
# pair.region — Region with window coordinates
# pair.value_a — signal from control
# pair.value_b — signal from treatment
# pair.in_a — True if window had data in control
# pair.in_b — True if window had data in treatment
Fold-change (coordinate-aligned)¶
from seqchain.compare.fold_change import interval_fold_change
fc_track = interval_fold_change(
ctrl_signal, # SignalTrack (Vehicle)
treat_signal, # SignalTrack (Rapamycin)
label_a="Vehicle",
label_b="Rapamycin",
bin_size=200,
chrom_sizes=chrom_sizes,
)
Output: IntervalTrack where each Region has these tags:
| Tag | Type | Description |
|---|---|---|
fold_change |
float |
treatment / control |
log2fc |
float |
log2(fold_change), NaN if undefined |
mean_a |
float |
Control signal in window |
mean_b |
float |
Treatment signal in window |
label_a |
str |
"Vehicle" |
label_b |
str |
"Rapamycin" |
in_a |
bool |
Data present in control |
in_b |
bool |
Data present in treatment |
Edge cases: 0/0 → NaN, x/0 → inf, 0/x → 0.0.
Fold-change (key-aligned)¶
For TableTrack comparison (gene-level, barcode-level), use
table_fold_change() — same arithmetic, different I/O shape:
from seqchain.compare.table_fold_change import table_fold_change
fc_table = table_fold_change(
ctrl_gene_signal, # TableTrack (per-gene signal, Vehicle)
treat_gene_signal, # TableTrack (per-gene signal, Rapamycin)
label_a="Vehicle",
label_b="Rapamycin",
)
# Returns: TableTrack mapping gene name → fold-change value
fc_table.get("YAL001C") # → 0.85 (gene lost H3 on Esa1 depletion)
interval_fold_change() and table_fold_change share the same
_fold_change / _log2fc math, but differ in alignment: coordinate-aligned
vs key-aligned. Both use align_tracks() for pairing.
Real-data validation¶
From examples/process_chipseq.py (Esa1AA Vehicle vs Rapamycin, H3):
60,794 windows (200 bp bins across sacCer3)
Median fold change: 0.951
Median log2FC: -0.071
Windows with |log2FC| > 1 up: 1,079
Windows with |log2FC| > 1 down: 7,271
Global H3 loss on Esa1 depletion (expected: Esa1 is a histone acetyltransferase whose depletion destabilizes nucleosomes).
Stage 5: Chromatin annotation (multiple tracks → state labels)¶
Module: seqchain.operations.annotate.chromatin
annotate_chromatin() takes multiple binned tracks (one per ChIP
target or accessibility assay) and assigns a chromatin state label
to each genomic window.
Configuration (YAML)¶
# presets/yeast_chromatin.yaml
type: chromatin
name: "Yeast 7-state chromatin"
bin_size: 200
marks:
ATAC:
threshold: 2.0
H3K4me3:
threshold: 1.5
H3K36me3:
threshold: 1.5
states:
- name: open_active
requires: {ATAC: true, H3K4me3: true, H3K36me3: false}
- name: open_full
requires: {ATAC: true, H3K4me3: true, H3K36me3: true}
- name: accessible_only
requires: {ATAC: true, H3K4me3: false, H3K36me3: false}
- name: k4_k36
requires: {ATAC: false, H3K4me3: true, H3K36me3: true}
- name: k36_only
requires: {ATAC: false, H3K4me3: false, H3K36me3: true}
- name: k4_only
requires: {ATAC: false, H3K4me3: true, H3K36me3: false}
- name: closed
requires: {ATAC: false, H3K4me3: false, H3K36me3: false}
Data flow¶
from seqchain.operations.annotate.chromatin import annotate_chromatin
from seqchain.recipes import load_chromatin_config
config = load_chromatin_config("yeast_chromatin")
chromatin_track = annotate_chromatin(
config,
tracks={
"ATAC": binned_atac, # IntervalTrack (from bin_summarize)
"H3K4me3": binned_h3k4me3, # IntervalTrack
"H3K36me3": binned_h3k36me3,# IntervalTrack
},
chroms=chrom_sizes,
)
Process per 200bp window:
- Query
signal_at()on each track for the window - Threshold: signal >= mark threshold → present, else absent
- Match presence pattern against states in priority order (first match wins)
- Merge adjacent same-state windows on the same chromosome
Output: IntervalTrack of state-labeled Regions:
Region("NC_001133.9", 0, 2000, name="closed")
Region("NC_001133.9", 2000, 4400, name="open_active")
Region("NC_001133.9", 4400, 5000, name="k4_only")
Adapting to PRJNA430229¶
The demo config uses ATAC/H3K4me3/H3K36me3. PRJNA430229 has different marks. A custom config would use:
| PRJNA430229 target | Chromatin meaning |
|---|---|
| H3 | Total histone occupancy (nucleosome positioning) |
| H3K9ac | Active promoter / enhancer mark |
| H4ac | Active chromatin mark |
| RNAPII | Transcription (proxy for accessibility) |
Stage 6: Gene-level aggregation¶
Module: seqchain.operations.quantify.summarize
summarize_features()¶
Pure function: Track × Iterable[Region] → IntervalTrack. For each
feature Region, calls track.signal_at(chrom, start, end) and returns
a copy of the Region with its score set to the signal value. All other
fields (chrom, start, end, strand, name, tags) are preserved.
Data stays in the Region world. The output IntervalTrack is composable with every other operation: filter, annotate, score, compare.
from seqchain.operations.quantify.summarize import summarize_features
# gene features from GenBank
genome = load_genbank("sacCer3.gbff")
genes = genome.genes()
gene_track = summarize_features(binned_track, genes)
# Returns: IntervalTrack of gene Regions with signal as score
# gene_track._regions[0].name → "YAL001C"
# gene_track._regions[0].score → 512.3 (overlap-weighted avg RPKM)
# gene_track._regions[0].chrom → "NC_001133.9" (coordinates preserved)
Behavior:
- All features included (unnamed features get
name="") - Duplicate names with different coordinates: both kept (Regions are distinguished by coordinates, not just name)
- Signal value is
IntervalTrack.signal_at()— overlap-weighted average of region scores within the feature's coordinates - Features on chromosomes with no data → score 0.0
- Output is composable:
map_scores(),filter_entries(),signal_at(),interval_fold_change()all work
Note on chromosome names: GenBank files use unversioned accessions
(NC_001133) while FASTA/BigWig may use versioned ones (NC_001133.9).
Features must be remapped to match the track's chromosome names before
calling summarize_features(). The workbench (examples/process_chipseq.py)
demonstrates this with a simple prefix-matching dict.
Stage 7: Gene-level comparison¶
Module: seqchain.compare.fold_change
After summarizing per-gene signal for two conditions, compute per-gene
fold change using the same interval_fold_change() that handles
window-level comparison. With mode="exact", it pairs gene Regions by
identical coordinates:
from seqchain.compare.fold_change import interval_fold_change
fc_track = interval_fold_change(
ctrl_gene_track, # IntervalTrack (Vehicle, gene Regions)
treat_gene_track, # IntervalTrack (Rapamycin, gene Regions)
label_a="Vehicle",
label_b="Rapamycin",
mode="exact",
)
# Output: IntervalTrack of gene Regions with fold_change/log2fc tags
# fc_track._regions[0].name → "YAL001C"
# fc_track._regions[0].tags["fold_change"] → 0.85
# fc_track._regions[0].chrom → "NC_001133.9" (coordinates preserved)
Output: IntervalTrack of gene Regions tagged with fold_change,
log2fc, mean_a, mean_b, label_a, label_b, in_a, in_b.
Gene coordinates and names are preserved — data never leaves the
Region world.
Edge cases: Same as window-level: 0/0 → NaN, x/0 → inf,
0/x → 0.0.
Note: table_fold_change() remains available for key-aligned
comparison on TableTracks (e.g. barcode counting). For gene-level
comparison, prefer interval_fold_change() since genes are Regions
with coordinates.
Stage 8: Chromatin transition comparison¶
Module: seqchain.operations.annotate.chromatin (same annotate_chromatin())
annotate_chromatin() is generic — it classifies binary presence/absence
patterns across any set of "marks." Conditions and marks are structurally
identical: both are named IntervalTracks fed as a dict. Treating two
conditions as two marks produces transition labels instead of state labels.
from seqchain.operations.annotate.chromatin import (
annotate_chromatin, ChromatinConfig, TrackThreshold, StateDefinition,
)
config = ChromatinConfig(
organism="S. cerevisiae",
bin_size=200,
thresholds=[
TrackThreshold("Esa1AA_Veh_H3", threshold_veh),
TrackThreshold("Esa1AA_Rapa_H3", threshold_rapa),
],
states=[
StateDefinition("stable_occupied",
{"Esa1AA_Veh_H3": "present", "Esa1AA_Rapa_H3": "present"}),
StateDefinition("nucleosome_lost",
{"Esa1AA_Veh_H3": "present", "Esa1AA_Rapa_H3": "absent"}),
StateDefinition("nucleosome_gained",
{"Esa1AA_Veh_H3": "absent", "Esa1AA_Rapa_H3": "present"}),
StateDefinition("stable_depleted",
{"Esa1AA_Veh_H3": "absent", "Esa1AA_Rapa_H3": "absent"}),
],
default_state="unclassified",
)
transitions = annotate_chromatin(
config,
tracks={"Esa1AA_Veh_H3": ctrl_binned, "Esa1AA_Rapa_H3": treat_binned},
chroms=chrom_sizes,
)
Output: IntervalTrack of transition-state-labeled Regions:
Region("NC_001133.9", 0, 4400, name="stable_occupied")
Region("NC_001133.9", 4400, 4600, name="nucleosome_lost")
Region("NC_001133.9", 4600, 4800, name="stable_occupied")
Key insight: Zero new code. The same annotate_chromatin() function,
the same binarize-then-classify logic, the same merge step. The only
difference is the config: marks are condition labels, states are
transition labels. The function doesn't know or care.
Stage 9: Track export¶
Module: seqchain.io.tracks
write_json()¶
Serializes any IntervalTrack to a JSON array of Region dicts. All tracks use the same schema:
{
"chrom": "NC_001133.9",
"start": 1806,
"end": 2169,
"strand": "-",
"score": 1.826,
"name": "YAL068C",
"tags": {"gene": "PAU8", "fold_change": 1.826, "log2fc": 0.869}
}
- Regions sorted by
(chrom, start) - NaN and Inf values →
null(JSON safety) - Tags sanitized recursively (floats and lists of floats)
- Coordinates: 0-based, half-open
[start, end)
write_bed() / write_wig()¶
from seqchain.io.tracks import write_bed, write_wig
write_bed(track, "output.bed") # BED6: chrom, start, end, name, score, strand
write_wig(track, "output.wig") # variableStep WIG: 1-based, integer scores
MANIFEST.md¶
The workbench (examples/process_chipseq.py) generates a machine-readable
provenance manifest alongside the track files. Documents: sample processing
stats, track inventory with region counts and resolution, full per-track
transformation chains, track schema, and reproduction instructions.
Type transformation summary¶
| From | To | Function | Module |
|---|---|---|---|
| FASTQ (disk) | BAM + BigWig + narrowPeak (disk) | ProcessPipeline.run() |
recipes/process |
| BigWig (disk) | SignalTrack |
load_bigwig() |
io/tracks |
| narrowPeak (disk) | IntervalTrack |
load_narrowpeak() |
io/tracks |
| counts.txt (disk) | TableTrack |
_load_featurecounts() |
recipes/process |
SignalTrack |
IntervalTrack (binned) |
bin_summarize() |
transform/bin |
IntervalTrack |
IntervalTrack (scored) |
map_scores() |
track.py |
IntervalTrack |
IntervalTrack (filtered) |
filter_entries() |
track.py |
IntervalTrack |
IntervalTrack (binary) |
binarize() |
transform/threshold |
IntervalTrack |
IntervalTrack (scaled) |
normalize(), log_transform() |
transform/scale |
Track x Track |
IntervalTrack (fold-change) |
interval_fold_change() |
compare/fold_change |
Multiple IntervalTrack |
IntervalTrack (state-labeled) |
annotate_chromatin() |
operations/annotate/chromatin |
Track + gene features |
IntervalTrack (per-gene signal) |
summarize_features() |
operations/quantify/summarize |
IntervalTrack x IntervalTrack |
IntervalTrack (gene-level FC) |
interval_fold_change(mode="exact") |
compare/fold_change |
TableTrack x TableTrack |
TableTrack (key-aligned FC) |
table_fold_change() |
compare/table_fold_change |
Two IntervalTrack (conditions) |
IntervalTrack (transition-labeled) |
annotate_chromatin() |
operations/annotate/chromatin |
IntervalTrack |
JSON (disk) | write_json() |
io/tracks |
IntervalTrack |
BED6 (disk) | write_bed() |
io/tracks |
IntervalTrack |
WIG (disk) | write_wig() |
io/tracks |
Validated end-to-end paths¶
These paths have been run on real data from PRJNA430229:
Path 1: Process → Load (single sample)¶
SRR6495880.fastq.gz
→ ProcessPipeline(chipseq, genome_size="12e6")
→ SRR6495880.bw (5.7 MB)
→ load_bigwig() → SignalTrack
Validated: 11.2M reads, 100% mapping, 1001 peaks.
Path 2: Process → Load → Compare (two conditions)¶
SRR6495880 (Esa1AA_Veh_H3) → BigWig → SignalTrack ─┐
├─► interval_fold_change()
SRR6495881 (Esa1AA_Rapa_H3) → BigWig → SignalTrack ─┘
│
IntervalTrack (60,794 windows)
tagged: fold_change, log2fc
Validated: median log2FC = -0.07, 7271 windows down, 1079 up.
Path 3: Process → Load → bin_summarize (SignalTrack → IntervalTrack)¶
SRR6495880.bw → load_bigwig() → SignalTrack
│
bin_summarize(200bp)
│
IntervalTrack (60,794 bins)
99.3% non-zero (H3 = core histone)
median RPKM: 505 (Veh), 459 (Rapa)
Validated. This is the bridge between continuous signal and the discrete interval world. All downstream operations (binarize, annotate_chromatin(), gene-level aggregation) consume IntervalTrack.
Path 4a: bin_summarize → binarize (score → 0/1)¶
IntervalTrack (60,794 bins, median RPKM ~505)
│
binarize(cutoff=median/4)
│
IntervalTrack (60,794 bins, scores 0.0 or 1.0)
Vehicle: 98.6% present, 1.4% absent (874 bins)
Rapamycin: 96.7% present, 3.3% absent (2,023 bins)
Validated. Data-driven cutoff (median/4 ≈ 126 RPKM for Vehicle, 115 for Rapamycin). H3 is a core histone — nearly everything is present. Esa1 depletion doubles the number of absent bins (874 → 2,023), consistent with nucleosome destabilization.
Path 4b: bin_summarize → annotate_chromatin() (state labeling)¶
IntervalTrack (60,794 bins)
│
annotate_chromatin(
config=2-state: nucleosome_occupied / nucleosome_depleted,
threshold=median/4 RPKM
)
│
IntervalTrack (state-labeled, merged adjacent same-state)
Vehicle: 403 regions (204 occupied, 199 depleted)
Rapamycin: 1,286 regions (648 occupied, 638 depleted)
Validated. Key findings:
annotate_chromatin()binarizes tracks internally usingbinarize()from Transform, then delegates classification toclassify_truth_table(). Thresholds can be explicit (per-track config) or computed at runtime via a threshold function. Transform sits below Operations in the layer hierarchy, so this import is architecturally legal.- Merging is the core value: 60,794 bins → 403 merged regions (Vehicle). Adjacent same-state windows collapse into contiguous domains.
- Biological signal: Esa1 depletion creates 3x more regions (403 → 1,286) — the genome fragments into smaller domains as nucleosomes are lost in scattered locations.
- Depleted bases: 173,612 bp (Vehicle) → 403,891 bp (Rapamycin). 2.3x more nucleosome-depleted genomic territory.
Path 5: bin_summarize → summarize_features → IntervalTrack (per-gene signal)¶
IntervalTrack (60,794 bins, RPKM scores)
│
summarize_features(track, genome.genes())
│
IntervalTrack (6,477 gene Regions with signal as score)
Coordinates, names, strand, tags all preserved
Vehicle: median 513 RPKM, 20 genes zero-signal
Rapamycin: median 507 RPKM, 29 genes zero-signal
Validated. Key findings:
- 6,477 genes from sacCer3 GenBank (GCF_000146045.2)
- Region in, Region out: gene Regions retain their coordinates,
names, strand, and tags — only
scoreis updated with signal - Chromosome name remapping: GenBank uses
NC_001133, BigWig usesNC_001133.9. Simple prefix-match dict resolves this. - Top signal genes: rDNA-adjacent non-coding loci on chrXII (YNCL0026C, YLR154C-G) — expected H3 enrichment
- Zero-signal genes: mitochondrial tRNAs (tH(GUG)Q, tA(UGC)Q) — minimal nucleosome occupancy on mtDNA
- Composable output: result is a standard IntervalTrack — works
with
map_scores(),filter_entries(),interval_fold_change(),annotate_chromatin(), and all other operations
Path 6: summarize_features → interval_fold_change() (gene-level FC)¶
IntervalTrack (Vehicle, 6,477 gene Regions) ─┐
├─► interval_fold_change()
IntervalTrack (Rapamycin, 6,477 gene Regions)┘ (mode="exact")
│
IntervalTrack (6,477 gene Regions)
tagged: fold_change, log2fc, mean_a, mean_b
Median FC: 1.006
871 genes up (FC > 1.5)
1,158 genes down (FC < 0.67)
Validated. Key findings:
- Same function, different granularity:
interval_fold_change()handles both 200bp window comparison (Stage 4) and gene-level comparison (Stage 7).mode="exact"pairs genes by identical coordinates instead of overlap. - Median FC = 1.006: global H3 near-constant (core histone)
- Asymmetric: more genes lose H3 than gain it (1,158 down vs 871 up). Expected — Esa1 is a HAT, depletion destabilizes nucleosomes.
- Top increased: mitochondrial genes (Q-prefix, tRNAs) — low baseline signal creates noisy ratios
- Top decreased: mitochondrial tRNAs at FC=0.000 — had trace signal in Vehicle, none in Rapamycin
- Data stays typed: gene Regions carry coordinates, names, and fold-change tags through the entire pipeline. No conversion to dict keys or text.
Path 7: bin_summarize × 2 → annotate_chromatin() (transition comparison)¶
IntervalTrack (Vehicle, 60,794 bins) ─┐
├─► annotate_chromatin()
IntervalTrack (Rapamycin, 60,794 bins)┘ (conditions as "marks")
│
IntervalTrack (1,558 transition domains)
names: stable_occupied, nucleosome_lost,
nucleosome_gained, stable_depleted
Validated. Key findings:
- Zero new code: same
annotate_chromatin()that does single-condition state calling. Config swaps mark names for condition names, states for transition labels. - 271 kb nucleosome loss vs 41 kb gain (6.5:1 ratio). Esa1 depletion causes asymmetric nucleosome destabilization.
- 1,558 domains merged from 60,794 bins. 96.3% of the genome is stable_occupied (nucleosome present in both conditions).
- Distributed across all 16 nuclear chromosomes (1.2-4.9% loss per chromosome), absent from mitochondrial DNA.
Path 8: All tracks → write_json (browser export)¶
8 IntervalTracks (signal, chromatin, gene_signal, fold_change, transitions)
│
write_json() per track
│
8 JSON files + MANIFEST.md
signal: 9.2 MB each (60,794 regions)
chromatin: 64-206 KB (403-1,286 domains)
gene_signal: 1.5 MB each (6,477 genes)
gene_fold_change: 3.1 MB (6,477 genes)
transitions: 244 KB (1,558 domains)
Validated. All 8 files use the same Region JSON schema. A track viewer
that can render one can render all eight. NaN/Inf safely serialized
as null.