Skip to content

Tn-seq Discovery

Auto-detect transposon junction structure in Tn-seq reads. Identifies the transposon terminus sequence, determines genomic insert offset, and detects staggered primer libraries.

TransposonPreset dataclass

TransposonPreset(name: str, terminus: str, insertion_motif: str | None = None)

Transposon system definition loaded from YAML.

Parameters:

Name Type Description Default
name str

Human-readable transposon name.

required
terminus str

Inverted-repeat terminus sequence (DNA string).

required
insertion_motif str | None

Target-site duplication motif (e.g. "TA" for Himar1), or None for transposons without a fixed target motif (e.g. Tn5).

None

Examples:

>>> preset = TransposonPreset("Himar1", "TGTTA", "TA")
>>> preset.insertion_motif
'TA'

TnSeqReadStructure dataclass

TnSeqReadStructure(junction_read: str, transposon_terminus: str, terminus_offsets: dict[int, int], dominant_offset: int | None, trim_position: int, other_read_prefix: str | None, other_read_role: str, need_swap: bool, chunks_sampled: int, diversity_satisfied: bool, insertion_motif: str | None = None)

What discovery learns about Tn-seq reads.

Unlike barcode ReadStructure where R1 and R2 are symmetric (both carry the same barcode in opposite orientations), Tn-seq reads are asymmetric. One read contains the transposon-genome junction. The other may contain adapters, barcodes for deduplication, or purely genomic sequence.

Parameters:

Name Type Description Default
junction_read str

"read1" or "read2" — which positional argument to tnseq_discover() contains the transposon-genome junctions.

required
transposon_terminus str

The terminus sequence used for detection.

required
terminus_offsets dict[int, int]

Mapping of offset position to observation count. May contain a single offset (fixed primer length) or multiple offsets (staggered primers for Illumina diversity). The stagger comes from library prep, not the transposon system.

required
dominant_offset int | None

Most common offset when it dominates the runner-up (fixed-offset library), or None when no single offset dominates (staggered library) or no terminus was found.

required
trim_position int

Where to trim to get the genomic portion. For fixed-offset libraries: dominant_offset + len(terminus). For staggered libraries: max(common_offsets) + len(terminus) where common offsets are those above a 1% noise floor.

required
other_read_prefix str | None

Constant prefix in the non-junction read, or None if none detected.

required
other_read_role str

"barcode", "genomic", or "unknown".

required
need_swap bool

Whether the file passed as reads1 actually contains the non-junction reads (i.e. the junction read was passed in the reads2 position).

required
chunks_sampled int

Number of chunks consumed during discovery.

required
diversity_satisfied bool

Whether sampling saw enough reads to be confident in the result.

required
insertion_motif str | None

Target-site duplication motif from the transposon preset (e.g. "TA" for Himar1), or None.

None

Examples:

>>> result.junction_read
'read1'
>>> result.transposon_terminus
'TGTTA'

tnseq_discover

tnseq_discover(reads1: str, reads2: str | None = None, *, transposon: str | TransposonPreset = 'himar1', max_flank: int = 30, alphabet_size: int = 4, safety_margin: float = 0.75, diversity_factor: float = 5.0, offset_dominance: float = 2.0) -> TnSeqReadStructure

Sample reads to learn Tn-seq read structure.

The transposon argument can be a preset name ("himar1", "tn5"), a TransposonPreset object, or a literal terminus DNA sequence. If the string matches a preset name (case-insensitive), the preset is loaded from YAML. Otherwise the string is treated as a raw terminus sequence with no insertion motif.

Parameters:

Name Type Description Default
reads1 str

Path to primary reads file (FASTQ).

required
reads2 str | None

Optional path to paired-end reads file.

None
transposon str | TransposonPreset

Preset name, TransposonPreset, or literal terminus sequence. Defaults to "himar1".

'himar1'
max_flank int

Maximum prefix length to search for in the non-junction read. Defaults to 30.

30
alphabet_size int

Size of the sequence alphabet. Defaults to 4 (DNA).

4
safety_margin float

Safety margin for signal boundary detection. Defaults to 0.75.

0.75
diversity_factor float

Required oversampling ratio before declaring saturation. Defaults to 5.0.

5.0
offset_dominance float

Required ratio of top offset count to runner-up. Defaults to 2.0.

2.0

Returns:

Type Description
TnSeqReadStructure

A TnSeqReadStructure describing the learned layout.

Examples:

>>> result = tnseq_discover("R1.fastq", "R2.fastq")
Source code in src/seqchain/operations/discover/tnseq.py
def tnseq_discover(
    reads1: str,
    reads2: str | None = None,
    *,
    transposon: str | TransposonPreset = "himar1",
    max_flank: int = 30,
    alphabet_size: int = 4,
    safety_margin: float = 0.75,
    diversity_factor: float = 5.0,
    offset_dominance: float = 2.0,
) -> TnSeqReadStructure:
    """Sample reads to learn Tn-seq read structure.

    The *transposon* argument can be a preset name (``"himar1"``,
    ``"tn5"``), a `TransposonPreset` object, or a literal
    terminus DNA sequence. If the string matches a preset name
    (case-insensitive), the preset is loaded from YAML. Otherwise
    the string is treated as a raw terminus sequence with no
    insertion motif.

    Args:
        reads1: Path to primary reads file (FASTQ).
        reads2: Optional path to paired-end reads file.
        transposon: Preset name, `TransposonPreset`, or
            literal terminus sequence. Defaults to ``"himar1"``.
        max_flank: Maximum prefix length to search for in the
            non-junction read. Defaults to 30.
        alphabet_size: Size of the sequence alphabet. Defaults to 4
            (DNA).
        safety_margin: Safety margin for signal boundary detection.
            Defaults to 0.75.
        diversity_factor: Required oversampling ratio before declaring
            saturation. Defaults to 5.0.
        offset_dominance: Required ratio of top offset count to
            runner-up. Defaults to 2.0.

    Returns:
        A `TnSeqReadStructure` describing the learned layout.

    Examples:
        >>> result = tnseq_discover("R1.fastq", "R2.fastq")  # doctest: +SKIP
    """
    preset = _resolve_preset(transposon)
    terminus = preset.terminus
    rc_terminus = reverse_complement(terminus)
    is_paired = reads2 is not None

    # Phase 1: Classify reads — which file has junctions?
    log.info("Phase 1: Classifying reads for terminus %s...", terminus)
    r1_offsets, r2_offsets, chunks = _classify_reads(
        reads1, reads2, terminus, rc_terminus, is_paired,
        diversity_factor, offset_dominance,
    )

    r1_count = sum(r1_offsets.values())
    r2_count = sum(r2_offsets.values()) if is_paired else 0

    # Determine which file is the junction read
    if is_paired and r2_count > r1_count and is_dominant(
        r2_count, r1_count, dominance=offset_dominance,
    ):
        junction_read = "read2"
        need_swap = True
        junction_offsets = r2_offsets
        other_file = reads1
    else:
        junction_read = "read1"
        need_swap = False
        junction_offsets = r1_offsets
        other_file = reads2

    log.info(
        "Junction read: %s (%d hits vs %d), need_swap=%s",
        junction_read, max(r1_count, r2_count),
        min(r1_count, r2_count), need_swap,
    )

    # Phase 2: Learn junction structure
    log.info("Phase 2: Learning junction structure...")
    dominant_offset = _find_dominant_offset(junction_offsets, offset_dominance)
    if dominant_offset is not None:
        trim_position = dominant_offset + len(terminus)
    elif junction_offsets:
        total_junctions = sum(junction_offsets.values())
        noise_floor = total_junctions * 0.01
        best_offset = max(
            (o for o, c in junction_offsets.items() if c >= noise_floor),
            default=None,
        )
        trim_position = best_offset + len(terminus) if best_offset is not None else 0
    else:
        trim_position = 0

    # Diversity: did we see enough reads?
    total_hits = r1_count + r2_count
    diversity_satisfied = total_hits >= diversity_factor * 100

    # Phase 3: Characterize other read
    other_prefix: str | None = None
    other_role = "unknown"
    if is_paired and other_file is not None:
        log.info("Phase 3: Characterizing non-junction read...")
        other_prefix = _find_constant_prefix(
            other_file, alphabet_size=alphabet_size,
            safety_margin=safety_margin,
        )
        if other_prefix is not None:
            if len(other_prefix) < 15:
                other_role = "barcode"
            else:
                other_role = "unknown"
        else:
            other_role = "genomic"
    elif not is_paired:
        other_role = "genomic"

    log.info(
        "Result: offset=%s, trim=%d, other_prefix=%s (%s)",
        dominant_offset, trim_position,
        repr(other_prefix[:20]) if other_prefix else None, other_role,
    )

    return TnSeqReadStructure(
        junction_read=junction_read,
        transposon_terminus=terminus,
        terminus_offsets=dict(junction_offsets),
        dominant_offset=dominant_offset,
        trim_position=trim_position,
        other_read_prefix=other_prefix,
        other_read_role=other_role,
        need_swap=need_swap,
        chunks_sampled=chunks,
        diversity_satisfied=diversity_satisfied,
        insertion_motif=preset.insertion_motif,
    )