Skip to content

CRISPR Guide Design

Complete guide-design workflow driven by nuclease presets. Four steps: scan (find PAM sites), interpret (extract spacers), score (off-target alignment), and annotate (gene overlap).

from seqchain.io.genome import load_genbank
from seqchain.recipes import load_preset
from seqchain.recipes.crispr import design_guides

genome = load_genbank("yeast.gb")
guides = design_guides(genome, load_preset("spcas9"))

Each step is individually callable — design_guides() is the default composition.

CRISPRPreset dataclass

CRISPRPreset(name: str, pam: str, spacer_len: int, pam_direction: str, description: str = '', align_pam_with_spacer: bool = False, seed_alignment: bool = False, seed_length: int | None = None, report_all_hits: bool = False, max_reported_hits: int | None = None, mismatches: int = 0)

Configuration for a CRISPR nuclease system.

Parameters:

Name Type Description Default
name str

Nuclease name (e.g. "SpCas9").

required
pam str

PAM pattern in IUPAC notation (e.g. "NGG").

required
spacer_len int

Guide spacer length in bp.

required
pam_direction str

"downstream" (Cas9-style) or "upstream" (Cas12a-style).

required
description str

Human-readable description.

''
align_pam_with_spacer bool

If True, include the PAM in the bowtie query sequence (spy alignment strategy).

False
seed_alignment bool

If True, use seed-based (-n) alignment instead of sequence-space (-v) alignment.

False
seed_length int | None

Seed length for -n mode. Defaults to None.

None
report_all_hits bool

If True, report all alignments (-a) instead of just the best (-k 1).

False
max_reported_hits int | None

Maximum alignments to report per query (-m flag). Defaults to None (no limit).

None
mismatches int

Number of allowed mismatches. Defaults to 0.

0

Examples: >>> CRISPRPreset("SpCas9", "NGG", 20, "downstream", "Cas9") CRISPRPreset(name='SpCas9', pam='NGG', spacer_len=20, pam_direction='downstream', description='Cas9', align_pam_with_spacer=False, seed_alignment=False, seed_length=None, report_all_hits=False, max_reported_hits=None, mismatches=0)

design_guides

design_guides(genome: Genome, preset: CRISPRPreset, *, chrom: str | None = None, threads: int = 1) -> Iterator[Region]

Run the full CRISPR guide design pipeline.

Pure streaming pipe: Scan → Interpret → Score → Annotate. Batching inside score_off_targets is the only materialization, and it happens inside that operation — not here.

The annotation step is a dual-sorted sweep-line via annotate_with_genes — O(genes_in_window) memory, never O(genome).

Parameters:

Name Type Description Default
genome Genome

A loaded Genome.

required
preset CRISPRPreset

Nuclease preset defining PAM, spacer length, etc.

required
chrom str | None

Chromosome to scan. Defaults to all chromosomes.

None
threads int

Number of bowtie threads for off-target scoring.

1

Yields:

Type Description
Region

Fully annotated guide Regions.

Examples:

>>> from seqchain.io.genome import load_genbank
>>> from seqchain.recipes import load_preset
>>> genome = load_genbank("yeast.gb")
>>> guides = design_guides(genome, load_preset("spcas9"))
Source code in src/seqchain/recipes/crispr.py
def design_guides(
    genome: Genome,
    preset: CRISPRPreset,
    *,
    chrom: str | None = None,
    threads: int = 1,
) -> Iterator[Region]:
    """Run the full CRISPR guide design pipeline.

    Pure streaming pipe: **Scan → Interpret → Score → Annotate**.
    Batching inside ``score_off_targets`` is the only materialization,
    and it happens inside that operation — not here.

    The annotation step is a dual-sorted sweep-line via
    ``annotate_with_genes`` — O(genes_in_window) memory, never O(genome).

    Args:
        genome: A loaded `Genome`.
        preset: Nuclease preset defining PAM, spacer length, etc.
        chrom: Chromosome to scan.  Defaults to all chromosomes.
        threads: Number of bowtie threads for off-target scoring.

    Yields:
        Fully annotated guide Regions.

    Examples:
        >>> from seqchain.io.genome import load_genbank
        >>> from seqchain.recipes import load_preset
        >>> genome = load_genbank("yeast.gb")  # doctest: +SKIP
        >>> guides = design_guides(genome, load_preset("spcas9"))  # doctest: +SKIP
    """
    from seqchain.operations.score.off_target import score_off_targets
    from seqchain.recipes.annotate import annotate_with_genes

    sequences = {chrom: genome.sequences[chrom]} if chrom else genome.sequences

    # Scan → Interpret → Score (all generators)
    hits_stream = regex_map(sequences, preset.pam, topologies=genome.topologies)
    guides_stream = interpret_guides(hits_stream, genome.sequences, preset, topologies=genome.topologies)
    scored_stream = score_off_targets(
        guides_stream, genome.sequences, preset.pam,
        spacer_len=preset.spacer_len,
        pam_direction=preset.pam_direction,
        mismatches=preset.mismatches,
        topologies=genome.topologies,
        threads=threads,
    )

    # Annotate via dual-sorted sweep-line — no IntervalIndex.build() needed
    yield from annotate_with_genes(scored_stream, genome)

interpret_guides

interpret_guides(regions: Iterable[Region], sequences: dict[str, str], preset: CRISPRPreset, *, topologies: dict[str, str] | None = None) -> Iterator[Region]

Transform PAM hits into domain-meaningful guide Regions.

Yields guide Regions with full footprint coords, coordinate hash as name, and tags including guide_id, pam_start, pam_end, spacer, guide_seq. PAM hits at chromosome boundaries are silently skipped.

Parameters:

Name Type Description Default
regions Iterable[Region]

PAM hit Regions (output of a regex_map).

required
sequences dict[str, str]

Mapping of chromosome name to DNA string.

required
preset CRISPRPreset

Nuclease preset defining spacer length and PAM direction.

required
topologies dict[str, str] | None

Mapping of chromosome name to "circular" or "linear". Defaults to linear for all chromosomes.

None

Yields:

Type Description
Region

Guide Regions.

Examples:

>>> from seqchain.recipes import load_preset
>>> preset = load_preset("spcas9")
>>> interpreted = list(interpret_guides(hits, {"chr": seq}, preset))
Source code in src/seqchain/recipes/crispr.py
def interpret_guides(
    regions: Iterable[Region],
    sequences: dict[str, str],
    preset: CRISPRPreset,
    *,
    topologies: dict[str, str] | None = None,
) -> Iterator[Region]:
    """Transform PAM hits into domain-meaningful guide Regions.

    Yields guide Regions with full footprint coords, coordinate hash
    as ``name``, and tags including ``guide_id``, ``pam_start``,
    ``pam_end``, ``spacer``, ``guide_seq``. PAM hits at chromosome
    boundaries are silently skipped.

    Args:
        regions: PAM hit Regions (output of a `regex_map`).
        sequences: Mapping of chromosome name to DNA string.
        preset: Nuclease preset defining spacer length and PAM direction.
        topologies: Mapping of chromosome name to ``"circular"`` or
            ``"linear"``. Defaults to linear for all chromosomes.

    Yields:
        Guide Regions.

    Examples:
        >>> from seqchain.recipes import load_preset
        >>> preset = load_preset("spcas9")
        >>> interpreted = list(interpret_guides(hits, {"chr": seq}, preset))  # doctest: +SKIP
    """
    spacer_len = preset.spacer_len
    pam_dir = preset.pam_direction
    topo = topologies or {}

    for region in regions:
        sequence = sequences[region.chrom]
        seq_len = len(sequence)
        circular = topo.get(region.chrom, "linear") == "circular"

        result = interpret_pam_hit(
            region, sequence, spacer_len, pam_dir, seq_len, circular,
        )
        if result is not None:
            yield result

configure_mapper

configure_mapper(preset: CRISPRPreset, *, topologies: dict[str, str] | None = None) -> Callable[..., Iterator[Region]]

Create a mapper for the preset's PAM pattern.

Returns a functools.partial(bowtie_map, ...) when alignment fields are set on the preset (e.g. mismatches > 0 or align_pam_with_spacer), otherwise a bound regex_map() function.

Parameters:

Name Type Description Default
preset CRISPRPreset

Nuclease preset defining PAM and alignment parameters.

required
topologies dict[str, str] | None

Mapping of chromosome name to "circular" or "linear". Passed through to the mapper for origin-spanning PAM detection.

None

Returns:

Type Description
Callable[..., Iterator[Region]]

A callable that takes sequences: dict[str, str] and

Callable[..., Iterator[Region]]

returns an iterator of Regions.

Examples:

>>> from seqchain.recipes import load_preset
>>> mapper = configure_mapper(load_preset("spcas9"))
>>> callable(mapper)
True
Source code in src/seqchain/recipes/crispr.py
def configure_mapper(
    preset: CRISPRPreset,
    *,
    topologies: dict[str, str] | None = None,
) -> Callable[..., Iterator[Region]]:
    """Create a mapper for the preset's PAM pattern.

    Returns a ``functools.partial(bowtie_map, ...)`` when alignment
    fields are set on the preset (e.g. ``mismatches > 0`` or
    ``align_pam_with_spacer``), otherwise a bound
    ``regex_map()`` function.

    Args:
        preset: Nuclease preset defining PAM and alignment parameters.
        topologies: Mapping of chromosome name to ``"circular"`` or
            ``"linear"``. Passed through to the mapper for
            origin-spanning PAM detection.

    Returns:
        A callable that takes ``sequences: dict[str, str]`` and
        returns an iterator of Regions.

    Examples:
        >>> from seqchain.recipes import load_preset
        >>> mapper = configure_mapper(load_preset("spcas9"))
        >>> callable(mapper)
        True
    """
    topo = topologies or {}
    needs_bowtie = (
        preset.align_pam_with_spacer
        or preset.seed_alignment
        or preset.report_all_hits
        or preset.max_reported_hits is not None
        or preset.mismatches > 0
    )
    if needs_bowtie:
        from seqchain.operations.map.bowtie import bowtie_map

        return partial(
            bowtie_map,
            pam=preset.pam,
            spacer_len=preset.spacer_len,
            pam_direction=preset.pam_direction,
            mismatches=preset.mismatches,
            align_pam_with_spacer=preset.align_pam_with_spacer,
            seed_alignment=preset.seed_alignment,
            seed_length=preset.seed_length,
            report_all_hits=preset.report_all_hits,
            max_reported_hits=preset.max_reported_hits,
            topologies=topo,
        )
    return partial(regex_map, pattern=preset.pam, topologies=topo)