Tn-seq Discovery¶

Auto-detect transposon junction structure in Tn-seq reads. Identifies the transposon terminus sequence, determines genomic insert offset, and detects staggered primer libraries.

TransposonPreset `dataclass` ¶

TransposonPreset(name: str, terminus: str, insertion_motif: str | None = None)

Transposon system definition loaded from YAML.

Parameters:

Name	Type	Description	Default
`name`	`str`	Human-readable transposon name.	required
`terminus`	`str`	Inverted-repeat terminus sequence (DNA string).	required
`insertion_motif`	`str \| None`	Target-site duplication motif (e.g. `"TA"` for Himar1), or `None` for transposons without a fixed target motif (e.g. Tn5).	`None`

Examples:

>>> preset = TransposonPreset("Himar1", "TGTTA", "TA")
>>> preset.insertion_motif
'TA'

TnSeqReadStructure `dataclass` ¶

TnSeqReadStructure(junction_read: str, transposon_terminus: str, terminus_offsets: dict[int, int], dominant_offset: int | None, trim_position: int, other_read_prefix: str | None, other_read_role: str, need_swap: bool, chunks_sampled: int, diversity_satisfied: bool, insertion_motif: str | None = None)

What discovery learns about Tn-seq reads.

Unlike barcode ReadStructure where R1 and R2 are symmetric (both carry the same barcode in opposite orientations), Tn-seq reads are asymmetric. One read contains the transposon-genome junction. The other may contain adapters, barcodes for deduplication, or purely genomic sequence.

Parameters:

Name	Type	Description	Default
`junction_read`	`str`	`"read1"` or `"read2"` — which positional argument to `tnseq_discover()` contains the transposon-genome junctions.	required
`transposon_terminus`	`str`	The terminus sequence used for detection.	required
`terminus_offsets`	`dict[int, int]`	Mapping of offset position to observation count. May contain a single offset (fixed primer length) or multiple offsets (staggered primers for Illumina diversity). The stagger comes from library prep, not the transposon system.	required
`dominant_offset`	`int \| None`	Most common offset when it dominates the runner-up (fixed-offset library), or `None` when no single offset dominates (staggered library) or no terminus was found.	required
`trim_position`	`int`	Where to trim to get the genomic portion. For fixed-offset libraries: `dominant_offset + len(terminus)`. For staggered libraries: `max(common_offsets) + len(terminus)` where common offsets are those above a 1% noise floor.	required
`other_read_prefix`	`str \| None`	Constant prefix in the non-junction read, or `None` if none detected.	required
`other_read_role`	`str`	`"barcode"`, `"genomic"`, or `"unknown"`.	required
`need_swap`	`bool`	Whether the file passed as `reads1` actually contains the non-junction reads (i.e. the junction read was passed in the `reads2` position).	required
`chunks_sampled`	`int`	Number of chunks consumed during discovery.	required
`diversity_satisfied`	`bool`	Whether sampling saw enough reads to be confident in the result.	required
`insertion_motif`	`str \| None`	Target-site duplication motif from the transposon preset (e.g. `"TA"` for Himar1), or `None`.	`None`

Examples:

>>> result.junction_read
'read1'
>>> result.transposon_terminus
'TGTTA'

tnseq_discover ¶

tnseq_discover(reads1: str, reads2: str | None = None, *, transposon: str | TransposonPreset = 'himar1', max_flank: int = 30, alphabet_size: int = 4, safety_margin: float = 0.75, diversity_factor: float = 5.0, offset_dominance: float = 2.0) -> TnSeqReadStructure

Sample reads to learn Tn-seq read structure.

The transposon argument can be a preset name ("himar1", "tn5"), a TransposonPreset object, or a literal terminus DNA sequence. If the string matches a preset name (case-insensitive), the preset is loaded from YAML. Otherwise the string is treated as a raw terminus sequence with no insertion motif.

Parameters:

Name	Type	Description	Default
`reads1`	`str`	Path to primary reads file (FASTQ).	required
`reads2`	`str \| None`	Optional path to paired-end reads file.	`None`
`transposon`	`str \| TransposonPreset`	Preset name, `TransposonPreset`, or literal terminus sequence. Defaults to `"himar1"`.	`'himar1'`
`max_flank`	`int`	Maximum prefix length to search for in the non-junction read. Defaults to 30.	`30`
`alphabet_size`	`int`	Size of the sequence alphabet. Defaults to 4 (DNA).	`4`
`safety_margin`	`float`	Safety margin for signal boundary detection. Defaults to 0.75.	`0.75`
`diversity_factor`	`float`	Required oversampling ratio before declaring saturation. Defaults to 5.0.	`5.0`
`offset_dominance`	`float`	Required ratio of top offset count to runner-up. Defaults to 2.0.	`2.0`

Returns:

Type	Description
`TnSeqReadStructure`	A `TnSeqReadStructure` describing the learned layout.

Examples:

>>> result = tnseq_discover("R1.fastq", "R2.fastq")

Source code in src/seqchain/operations/discover/tnseq.py

def tnseq_discover(
    reads1: str,
    reads2: str | None = None,
    *,
    transposon: str | TransposonPreset = "himar1",
    max_flank: int = 30,
    alphabet_size: int = 4,
    safety_margin: float = 0.75,
    diversity_factor: float = 5.0,
    offset_dominance: float = 2.0,
) -> TnSeqReadStructure:
    """Sample reads to learn Tn-seq read structure.

    The *transposon* argument can be a preset name (``"himar1"``,
    ``"tn5"``), a `TransposonPreset` object, or a literal
    terminus DNA sequence. If the string matches a preset name
    (case-insensitive), the preset is loaded from YAML. Otherwise
    the string is treated as a raw terminus sequence with no
    insertion motif.

    Args:
        reads1: Path to primary reads file (FASTQ).
        reads2: Optional path to paired-end reads file.
        transposon: Preset name, `TransposonPreset`, or
            literal terminus sequence. Defaults to ``"himar1"``.
        max_flank: Maximum prefix length to search for in the
            non-junction read. Defaults to 30.
        alphabet_size: Size of the sequence alphabet. Defaults to 4
            (DNA).
        safety_margin: Safety margin for signal boundary detection.
            Defaults to 0.75.
        diversity_factor: Required oversampling ratio before declaring
            saturation. Defaults to 5.0.
        offset_dominance: Required ratio of top offset count to
            runner-up. Defaults to 2.0.

    Returns:
        A `TnSeqReadStructure` describing the learned layout.

    Examples:
        >>> result = tnseq_discover("R1.fastq", "R2.fastq")  # doctest: +SKIP
    """
    preset = _resolve_preset(transposon)
    terminus = preset.terminus
    rc_terminus = reverse_complement(terminus)
    is_paired = reads2 is not None

    # Phase 1: Classify reads — which file has junctions?
    log.info("Phase 1: Classifying reads for terminus %s...", terminus)
    r1_offsets, r2_offsets, chunks = _classify_reads(
        reads1, reads2, terminus, rc_terminus, is_paired,
        diversity_factor, offset_dominance,
    )

    r1_count = sum(r1_offsets.values())
    r2_count = sum(r2_offsets.values()) if is_paired else 0

    # Determine which file is the junction read
    if is_paired and r2_count > r1_count and is_dominant(
        r2_count, r1_count, dominance=offset_dominance,
    ):
        junction_read = "read2"
        need_swap = True
        junction_offsets = r2_offsets
        other_file = reads1
    else:
        junction_read = "read1"
        need_swap = False
        junction_offsets = r1_offsets
        other_file = reads2

    log.info(
        "Junction read: %s (%d hits vs %d), need_swap=%s",
        junction_read, max(r1_count, r2_count),
        min(r1_count, r2_count), need_swap,
    )

    # Phase 2: Learn junction structure
    log.info("Phase 2: Learning junction structure...")
    dominant_offset = _find_dominant_offset(junction_offsets, offset_dominance)
    if dominant_offset is not None:
        trim_position = dominant_offset + len(terminus)
    elif junction_offsets:
        total_junctions = sum(junction_offsets.values())
        noise_floor = total_junctions * 0.01
        best_offset = max(
            (o for o, c in junction_offsets.items() if c >= noise_floor),
            default=None,
        )
        trim_position = best_offset + len(terminus) if best_offset is not None else 0
    else:
        trim_position = 0

    # Diversity: did we see enough reads?
    total_hits = r1_count + r2_count
    diversity_satisfied = total_hits >= diversity_factor * 100

    # Phase 3: Characterize other read
    other_prefix: str | None = None
    other_role = "unknown"
    if is_paired and other_file is not None:
        log.info("Phase 3: Characterizing non-junction read...")
        other_prefix = _find_constant_prefix(
            other_file, alphabet_size=alphabet_size,
            safety_margin=safety_margin,
        )
        if other_prefix is not None:
            if len(other_prefix) < 15:
                other_role = "barcode"
            else:
                other_role = "unknown"
        else:
            other_role = "genomic"
    elif not is_paired:
        other_role = "genomic"

    log.info(
        "Result: offset=%s, trim=%d, other_prefix=%s (%s)",
        dominant_offset, trim_position,
        repr(other_prefix[:20]) if other_prefix else None, other_role,
    )

    return TnSeqReadStructure(
        junction_read=junction_read,
        transposon_terminus=terminus,
        terminus_offsets=dict(junction_offsets),
        dominant_offset=dominant_offset,
        trim_position=trim_position,
        other_read_prefix=other_prefix,
        other_read_role=other_role,
        need_swap=need_swap,
        chunks_sampled=chunks,
        diversity_satisfied=diversity_satisfied,
        insertion_motif=preset.insertion_motif,
    )

Tn-seq Discovery¶

TransposonPreset dataclass ¶

TnSeqReadStructure dataclass ¶

tnseq_discover ¶

TransposonPreset `dataclass` ¶

TnSeqReadStructure `dataclass` ¶