Skip to content

Heuristic Discovery

Auto-detect barcode read structure from raw FASTQ. Samples reads, builds position-frequency tables, and identifies fixed flanks and variable barcode regions using signal-boundary detection.

heuristic

Adaptive heuristic read-structure discovery.

Provides heuristic_discover() which samples reads to learn barcode position, orientation, and flanking sequences.

Assumes symmetric paired-end barcode libraries where both R1 and R2 contain the same barcode in opposite orientations. For asymmetric libraries like Tn-seq (Sassetti protocol), R1 contains a transposon-genome junction while R2 contains a deduplication barcode.

SamplingResult dataclass

SamplingResult(read1_offset: int | None, read2_offset: int | None, read1_orient: str | None, read2_orient: str | None, valid_reads1: set[str], valid_reads2: set[str], need_swap: bool, num_chunks: int, diversity_satisfied: bool)

Output of the adaptive sampling phase.

Replaces the raw tuple returned by the old _sample_reads() with named, documented fields.

Parameters:

Name Type Description Default
read1_offset int | None

Dominant barcode offset in read 1.

required
read2_offset int | None

Dominant barcode offset in read 2, or None for single-end data.

required
read1_orient str | None

Dominant orientation ("forward"/"reverse") in read 1, or None if no barcodes found.

required
read2_orient str | None

Dominant orientation in read 2, or None.

required
valid_reads1 set[str]

Set of read 1 sequences that contained a library barcode at the dominant offset.

required
valid_reads2 set[str]

Set of read 2 sequences, or empty set.

required
need_swap bool

Whether read files need swapping (R1 file actually contains reverse reads).

required
num_chunks int

Number of chunks consumed before stopping.

required
diversity_satisfied bool

Whether diversity criteria were met before the file ended.

required

Examples:

>>> result = SamplingResult(
...     read1_offset=6, read2_offset=None,
...     read1_orient="forward", read2_orient=None,
...     valid_reads1=set(), valid_reads2=set(),
...     need_swap=False, num_chunks=3, diversity_satisfied=True,
... )

heuristic_discover

heuristic_discover(reads1: str, library: set[str], reads2: str | None = None, *, max_flank: int = 10, diversity_factor: float = 5.0, offset_dominance: float = 2.0, boundary_margin: float = 0.75) -> ReadStructure

Sample reads to learn barcode position, orientation, and flanks.

Reads library-sized chunks, slides k-mers to find barcode offsets and orientations, tracks diversity until saturation, then identifies flanking sequences.

Assumes symmetric paired-end barcode libraries where both R1 and R2 contain the same barcode in opposite orientations.

Parameters:

Name Type Description Default
reads1 str

Path to primary reads file (FASTQ or .reads).

required
library set[str]

Set of known barcode sequences.

required
reads2 str | None

Optional path to paired-end reads file.

None
max_flank int

Maximum flank length to search for. Defaults to 10.

10
diversity_factor float

Required oversampling ratio before declaring diversity saturation. Defaults to 5.0 (5x library size).

5.0
offset_dominance float

Required ratio of top offset count to runner-up before accepting convergence. Defaults to 2.0.

2.0
boundary_margin float

Safety margin for flank boundary detection. Multiplied by alphabet_size (4 for DNA) to get the threshold ratio. Defaults to 0.75 (effective 3x for DNA).

0.75

Returns:

Type Description
ReadStructure

A ReadStructure with learned layout.

Examples:

>>> structure = heuristic_discover("reads.fastq", barcodes)
Source code in src/seqchain/operations/discover/heuristic.py
def heuristic_discover(
    reads1: str,
    library: set[str],
    reads2: str | None = None,
    *,
    max_flank: int = 10,
    diversity_factor: float = 5.0,
    offset_dominance: float = 2.0,
    boundary_margin: float = 0.75,
) -> ReadStructure:
    """Sample reads to learn barcode position, orientation, and flanks.

    Reads library-sized chunks, slides k-mers to find barcode offsets
    and orientations, tracks diversity until saturation, then identifies
    flanking sequences.

    Assumes symmetric paired-end barcode libraries where both R1 and R2
    contain the same barcode in opposite orientations.

    Args:
        reads1: Path to primary reads file (FASTQ or .reads).
        library: Set of known barcode sequences.
        reads2: Optional path to paired-end reads file.
        max_flank: Maximum flank length to search for. Defaults to 10.
        diversity_factor: Required oversampling ratio before declaring
            diversity saturation. Defaults to 5.0 (5x library size).
        offset_dominance: Required ratio of top offset count to
            runner-up before accepting convergence. Defaults to 2.0.
        boundary_margin: Safety margin for flank boundary detection.
            Multiplied by alphabet_size (4 for DNA) to get the threshold
            ratio. Defaults to 0.75 (effective 3x for DNA).

    Returns:
        A `ReadStructure` with learned layout.

    Examples:
        >>> structure = heuristic_discover("reads.fastq", barcodes)  # doctest: +SKIP
    """
    bc_len = len(next(iter(library)))
    rev_library: set[str] = set()
    for bc in library:
        rev_library.add(reverse_complement(bc))
    is_paired = reads2 is not None

    # Phase 1: Sample to find offset, orientation, valid reads
    log.info("Sampling reads...")
    result = _sample_reads(
        reads1, reads2, library, rev_library, bc_len, is_paired,
        diversity_factor, offset_dominance,
    )
    orientation = (
        result.read1_orient if not result.need_swap
        else result.read2_orient
    )
    log.info(
        "Sampling: offset=%d, orientation=%s, %d chunks",
        result.read1_offset, orientation, result.num_chunks,
    )

    # Phase 2: Find flanking sequences from validated reads
    left_flank: str | None = None
    right_flank: str | None = None
    rev_left_flank: str | None = None
    rev_right_flank: str | None = None

    if result.valid_reads1:
        left_flank, right_flank = _find_flanks(
            result.valid_reads1, result.read1_offset, bc_len,
            max_flank, boundary_margin,
        )

    if result.valid_reads2:
        rev_left_flank, rev_right_flank = _find_flanks(
            result.valid_reads2, result.read2_offset, bc_len,
            max_flank, boundary_margin,
        )

    log.info("Flanks: %s | %s", left_flank, right_flank)

    # Phase 3: Validate flank complementarity (paired-end)
    if is_paired:
        validate_flank_complementarity(
            left_flank, right_flank, rev_left_flank, rev_right_flank,
        )

    return ReadStructure(
        barcode_offset=result.read1_offset,
        orientation=(
            result.read1_orient if not result.need_swap
            else result.read2_orient
        ),
        left_flank=left_flank,
        right_flank=right_flank,
        need_swap=result.need_swap,
        barcode_length=bc_len,
        chunks_sampled=result.num_chunks,
        diversity_satisfied=result.diversity_satisfied,
        reverse_offset=result.read2_offset,
        reverse_left_flank=rev_left_flank,
        reverse_right_flank=rev_right_flank,
    )

validate_flank_complementarity

validate_flank_complementarity(left_fwd: str | None, right_fwd: str | None, left_rev: str | None, right_rev: str | None) -> None

Validate that paired-end flanks are reverse complements.

In paired-end data, the forward left flank should be the reverse complement of the reverse right flank (and vice versa).

Parameters:

Name Type Description Default
left_fwd str | None

Forward left flank.

required
right_fwd str | None

Forward right flank.

required
left_rev str | None

Reverse left flank.

required
right_rev str | None

Reverse right flank.

required

Raises:

Type Description
ValueError

If complementarity check fails.

Source code in src/seqchain/operations/discover/heuristic.py
def validate_flank_complementarity(
    left_fwd: str | None,
    right_fwd: str | None,
    left_rev: str | None,
    right_rev: str | None,
) -> None:
    """Validate that paired-end flanks are reverse complements.

    In paired-end data, the forward left flank should be the reverse
    complement of the reverse right flank (and vice versa).

    Args:
        left_fwd: Forward left flank.
        right_fwd: Forward right flank.
        left_rev: Reverse left flank.
        right_rev: Reverse right flank.

    Raises:
        ValueError: If complementarity check fails.
    """
    errors: list[str] = []

    if left_fwd and right_rev:
        rc_right_rev = reverse_complement(right_rev)
        min_len = min(len(left_fwd), len(rc_right_rev))
        if left_fwd[-min_len:] != rc_right_rev[:min_len]:
            errors.append(
                f"Left flank '{left_fwd}' != revcomp(right_rev) '{rc_right_rev}'"
            )

    if right_fwd and left_rev:
        rc_left_rev = reverse_complement(left_rev)
        min_len = min(len(right_fwd), len(rc_left_rev))
        if right_fwd[:min_len] != rc_left_rev[:min_len]:
            errors.append(
                f"Right flank '{right_fwd}' != revcomp(left_rev) '{rc_left_rev}'"
            )

    if errors:
        raise ValueError(
            "Flank complementarity violation: " + "; ".join(errors)
        )