Heuristic Discovery¶
Auto-detect barcode read structure from raw FASTQ. Samples reads, builds position-frequency tables, and identifies fixed flanks and variable barcode regions using signal-boundary detection.
heuristic
¶
Adaptive heuristic read-structure discovery.
Provides heuristic_discover() which samples reads to learn barcode
position, orientation, and flanking sequences.
Assumes symmetric paired-end barcode libraries where both R1 and R2 contain the same barcode in opposite orientations. For asymmetric libraries like Tn-seq (Sassetti protocol), R1 contains a transposon-genome junction while R2 contains a deduplication barcode.
SamplingResult
dataclass
¶
SamplingResult(read1_offset: int | None, read2_offset: int | None, read1_orient: str | None, read2_orient: str | None, valid_reads1: set[str], valid_reads2: set[str], need_swap: bool, num_chunks: int, diversity_satisfied: bool)
Output of the adaptive sampling phase.
Replaces the raw tuple returned by the old _sample_reads()
with named, documented fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
read1_offset
|
int | None
|
Dominant barcode offset in read 1. |
required |
read2_offset
|
int | None
|
Dominant barcode offset in read 2, or |
required |
read1_orient
|
str | None
|
Dominant orientation ( |
required |
read2_orient
|
str | None
|
Dominant orientation in read 2, or |
required |
valid_reads1
|
set[str]
|
Set of read 1 sequences that contained a library barcode at the dominant offset. |
required |
valid_reads2
|
set[str]
|
Set of read 2 sequences, or empty set. |
required |
need_swap
|
bool
|
Whether read files need swapping (R1 file actually contains reverse reads). |
required |
num_chunks
|
int
|
Number of chunks consumed before stopping. |
required |
diversity_satisfied
|
bool
|
Whether diversity criteria were met before the file ended. |
required |
Examples:
>>> result = SamplingResult(
... read1_offset=6, read2_offset=None,
... read1_orient="forward", read2_orient=None,
... valid_reads1=set(), valid_reads2=set(),
... need_swap=False, num_chunks=3, diversity_satisfied=True,
... )
heuristic_discover
¶
heuristic_discover(reads1: str, library: set[str], reads2: str | None = None, *, max_flank: int = 10, diversity_factor: float = 5.0, offset_dominance: float = 2.0, boundary_margin: float = 0.75) -> ReadStructure
Sample reads to learn barcode position, orientation, and flanks.
Reads library-sized chunks, slides k-mers to find barcode offsets and orientations, tracks diversity until saturation, then identifies flanking sequences.
Assumes symmetric paired-end barcode libraries where both R1 and R2 contain the same barcode in opposite orientations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reads1
|
str
|
Path to primary reads file (FASTQ or .reads). |
required |
library
|
set[str]
|
Set of known barcode sequences. |
required |
reads2
|
str | None
|
Optional path to paired-end reads file. |
None
|
max_flank
|
int
|
Maximum flank length to search for. Defaults to 10. |
10
|
diversity_factor
|
float
|
Required oversampling ratio before declaring diversity saturation. Defaults to 5.0 (5x library size). |
5.0
|
offset_dominance
|
float
|
Required ratio of top offset count to runner-up before accepting convergence. Defaults to 2.0. |
2.0
|
boundary_margin
|
float
|
Safety margin for flank boundary detection. Multiplied by alphabet_size (4 for DNA) to get the threshold ratio. Defaults to 0.75 (effective 3x for DNA). |
0.75
|
Returns:
| Type | Description |
|---|---|
ReadStructure
|
A |
Examples:
Source code in src/seqchain/operations/discover/heuristic.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
validate_flank_complementarity
¶
validate_flank_complementarity(left_fwd: str | None, right_fwd: str | None, left_rev: str | None, right_rev: str | None) -> None
Validate that paired-end flanks are reverse complements.
In paired-end data, the forward left flank should be the reverse complement of the reverse right flank (and vice versa).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
left_fwd
|
str | None
|
Forward left flank. |
required |
right_fwd
|
str | None
|
Forward right flank. |
required |
left_rev
|
str | None
|
Reverse left flank. |
required |
right_rev
|
str | None
|
Reverse right flank. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If complementarity check fails. |