Tn-seq Pipeline¶

Transposon insertion profiling: Discover → Trim → Map → Count. Raw FASTQ in, IntervalTrack of insertion counts out. Supports Himar1 (TA sites) and Tn5 (any position), Bowtie or BWA alignment, and staggered primer libraries.

Validated against TRANSIT's TPP on Sassetti lab Himar1 data (Pearson r = 0.998).

TnSeqResult `dataclass` ¶

TnSeqResult(track: IntervalTrack, structure: TnSeqReadStructure, total_reads: int, junction_reads: int, mapped_reads: int, insertion_sites_hit: int, total_possible_sites: int, saturation: float)

Result of a Tn-seq processing pipeline run.

Contains an insertion-count Track plus summary statistics. The Track holds one Region per possible insertion site (e.g. every TA dinucleotide for Himar1), each with a read count in its score field and tags["count"].

Parameters:

Name	Type	Description	Default
`track`	`IntervalTrack`	IntervalTrack of insertion site Regions.	required
`structure`	`TnSeqReadStructure`	Read structure learned during discovery.	required
`total_reads`	`int`	Total reads in the junction file.	required
`junction_reads`	`int`	Reads containing the transposon terminus.	required
`mapped_reads`	`int`	Junction reads that aligned to the reference.	required
`insertion_sites_hit`	`int`	Sites with at least one read.	required
`total_possible_sites`	`int`	Total enumerable insertion sites.	required
`saturation`	`float`	Fraction of sites hit (sites_hit / total_possible).	required

Examples:

>>> result.saturation
0.42

TnSeqPipeline `dataclass` ¶

TnSeqPipeline(transposon: str = 'himar1', mapper: str = 'bowtie', mismatches: int = 1, threads: int = 1, min_trimmed_length: int = 20)

Tn-seq processing pipeline: Discover -> Trim -> Map -> Count.

Processes raw Tn-seq FASTQ files into an insertion-count Track. The Track contains one Region per possible insertion site with read counts. Export to .wig or .bed is handled by the I/O layer (write_wig(), write_bed()).

Parameters:

Name	Type	Description	Default
`transposon`	`str`	Preset name or literal terminus sequence. Defaults to `"himar1"`.	`'himar1'`
`mapper`	`str`	Alignment tool — `"bowtie"` or `"bwa"`. BWA uses `bwa aln` + `bwa samse` (same algorithm as TRANSIT's TPP). Defaults to `"bowtie"`.	`'bowtie'`
`mismatches`	`int`	Allowed mismatches for alignment. Defaults to `1`.	`1`
`threads`	`int`	Alignment threads. Defaults to `1`.	`1`
`min_trimmed_length`	`int`	Minimum length for a trimmed read to be mapped. Defaults to `20`.	`20`

Examples:

>>> pipeline = TnSeqPipeline(transposon="himar1")
>>> result = pipeline.run("R1.fq.gz", "R2.fq.gz", "ref.fna")

run ¶

run(reads1: str, reads2: str | None = None, reference: str = '') -> TnSeqResult

Run the full Tn-seq pipeline.

Parameters:

Name	Type	Description	Default
`reads1`	`str`	Path to R1 FASTQ file.	required
`reads2`	`str \| None`	Optional path to R2 FASTQ file.	`None`
`reference`	`str`	Path to reference genome FASTA.	`''`

Returns:

Type	Description
`TnSeqResult`	`TnSeqResult` with insertion-count Track and stats.

Raises:

Type	Description
`RuntimeError`	If the selected mapper is not on `$PATH`.
`FileNotFoundError`	If input files don't exist.
`ValueError`	If mapper is not `"bowtie"` or `"bwa"`.

Examples:

>>> result = TnSeqPipeline().run("R1.fq.gz", "R2.fq.gz", "H37Rv.fna")

Source code in src/seqchain/recipes/_draft/tnseq.py

def run(
    self,
    reads1: str,
    reads2: str | None = None,
    reference: str = "",
) -> TnSeqResult:
    """Run the full Tn-seq pipeline.

    Args:
        reads1: Path to R1 FASTQ file.
        reads2: Optional path to R2 FASTQ file.
        reference: Path to reference genome FASTA.

    Returns:
        `TnSeqResult` with insertion-count Track and stats.

    Raises:
        RuntimeError: If the selected mapper is not on ``$PATH``.
        FileNotFoundError: If input files don't exist.
        ValueError: If *mapper* is not ``"bowtie"`` or ``"bwa"``.

    Examples:
        >>> result = TnSeqPipeline().run("R1.fq.gz", "R2.fq.gz", "H37Rv.fna")
    """
    if self.mapper == "bowtie":
        for tool in ("bowtie", "bowtie-build"):
            if not shutil.which(tool):
                raise RuntimeError(
                    f"{tool} not found on $PATH. "
                    "Install via: mamba install bowtie"
                )
    elif self.mapper == "bwa":
        if not shutil.which("bwa"):
            raise RuntimeError(
                "bwa not found on $PATH. Install via: mamba install bwa"
            )
    else:
        raise ValueError(
            f"Unknown mapper: {self.mapper!r}. Use 'bowtie' or 'bwa'."
        )

    # Step 1: Discover read structure
    log.info("Step 1: Discovering read structure...")
    structure = tnseq_discover(reads1, reads2, transposon=self.transposon)
    log.info(
        "Discovery: junction=%s, terminus=%s, trim=%d, swap=%s",
        structure.junction_read,
        structure.transposon_terminus,
        structure.trim_position,
        structure.need_swap,
    )

    # Determine junction file path
    junction_path = reads2 if structure.need_swap else reads1
    terminus = structure.transposon_terminus
    rc_terminus = reverse_complement(terminus)
    insertion_motif = structure.insertion_motif

    # Read reference genome
    log.info("Loading reference genome from %s...", reference)
    ref_sequences = load_fasta(reference).sequences
    total_bp = sum(len(s) for s in ref_sequences.values())
    log.info(
        "Reference: %d sequence(s), %d bp", len(ref_sequences), total_bp,
    )

    with tempfile.TemporaryDirectory(prefix="seqchain_tnseq_") as tmpdir:
        # Step 2: Trim junction reads (per-read terminus search)
        log.info("Step 2: Trimming junction reads...")
        trimmed_fasta = os.path.join(tmpdir, "trimmed.fa")
        total_reads, junction_reads = self._trim_reads(
            junction_path, terminus, rc_terminus, trimmed_fasta,
        )
        log.info(
            "Trimming: %d total, %d junction (%.1f%%)",
            total_reads,
            junction_reads,
            100 * junction_reads / total_reads if total_reads > 0 else 0,
        )

        if junction_reads == 0:
            log.warning("No junction reads found — returning empty track")
            track, total_sites, _ = self._build_track(
                ref_sequences, Counter(), insertion_motif,
            )
            return TnSeqResult(
                track=track,
                structure=structure,
                total_reads=total_reads,
                junction_reads=0,
                mapped_reads=0,
                insertion_sites_hit=0,
                total_possible_sites=total_sites,
                saturation=0.0,
            )

        # Step 3: Map trimmed reads to reference
        log.info("Step 3: Mapping trimmed reads to reference (%s)...",
                 self.mapper)
        sam_file = os.path.join(tmpdir, "aligned.sam")
        ref_for_align = prepare_reference(reference, tmpdir)
        if self.mapper == "bwa":
            indexed_ref = self._bwa_index(ref_for_align, tmpdir)
            self._bwa_align(indexed_ref, trimmed_fasta, sam_file)
        else:
            index_prefix = os.path.join(tmpdir, "ref")
            build_bowtie_index(ref_for_align, index_prefix)
            run_bowtie_alignment(
                index_prefix,
                trimmed_fasta,
                sam_file,
                mismatches=self.mismatches,
                threads=self.threads,
                query_is_fasta=True,
            )

        # Step 4: Count insertions per site
        log.info("Step 4: Counting insertions...")
        motif_len = len(insertion_motif) if insertion_motif else 0
        insertion_counts, mapped_reads = self._count_insertions(
            sam_file, motif_len,
        )
        log.info("Mapping: %d reads mapped", mapped_reads)

    # Build Track with all possible insertion sites
    track, total_sites, sites_hit = self._build_track(
        ref_sequences, insertion_counts, insertion_motif,
    )

    saturation = sites_hit / total_sites if total_sites > 0 else 0.0
    log.info(
        "Result: %d/%d sites hit (%.1f%% saturation), %d mapped reads",
        sites_hit, total_sites, 100 * saturation, mapped_reads,
    )

    return TnSeqResult(
        track=track,
        structure=structure,
        total_reads=total_reads,
        junction_reads=junction_reads,
        mapped_reads=mapped_reads,
        insertion_sites_hit=sites_hit,
        total_possible_sites=total_sites,
        saturation=saturation,
    )

Tn-seq Pipeline¶

TnSeqResult dataclass ¶

TnSeqPipeline dataclass ¶

run ¶

TnSeqResult `dataclass` ¶

TnSeqPipeline `dataclass` ¶