Mismatch Scoring¶

Score CRISPR guide mismatch variants using a position-weighted linear model. score_mismatches() generates single-nucleotide variants for each guide and predicts their activity using a mismatch weight matrix.

mismatch ¶

Mismatch variant scorer for CRISPR guide spacers.

Predicts relative activity of single-nucleotide mismatch variants using a linear model. Matches the scoring logic from legacy/crispr_experiment/assets/mismatch.py.

score_mismatches ¶

score_mismatches(regions: Iterable[Region], weights: dict[str, float], *, num_variants: int = 10) -> Iterator[Region]

Score regions that have a spacer tag.

For each Region with a spacer tag, yields the original unchanged, then yields num_variants variant Regions with tags: variant_spacer, change_description, y_pred, mismatches.

Regions without a spacer tag pass through unchanged.

Parameters:

Name	Type	Description	Default
`regions`	`Iterable[Region]`	Input regions to score.	required
`weights`	`dict[str, float]`	Dict from `load_mismatch_weights()`. Keys: `"intercept"`, `"0"`-`"N"` (positions), `"AC"`/`"AG"`/... (substitutions), `"GC_content"`.	required
`num_variants`	`int`	Variants to select per guide. Defaults to 10.	`10`

Yields:

Type	Description
`Region`	Original region followed by variant regions.

Examples:

>>> scored = list(score_mismatches(regions, weights))

Source code in src/seqchain/operations/score/mismatch.py

def score_mismatches(
    regions: Iterable[Region],
    weights: dict[str, float],
    *,
    num_variants: int = 10,
) -> Iterator[Region]:
    """Score regions that have a ``spacer`` tag.

    For each Region with a ``spacer`` tag, yields the original
    unchanged, then yields ``num_variants`` variant Regions with
    tags: ``variant_spacer``, ``change_description``, ``y_pred``,
    ``mismatches``.

    Regions without a ``spacer`` tag pass through unchanged.

    Args:
        regions: Input regions to score.
        weights: Dict from `load_mismatch_weights()`. Keys:
            ``"intercept"``, ``"0"``-``"N"`` (positions),
            ``"AC"``/``"AG"``/... (substitutions), ``"GC_content"``.
        num_variants: Variants to select per guide. Defaults to 10.

    Yields:
        Original region followed by variant regions.

    Examples:
        >>> scored = list(score_mismatches(regions, weights))
    """
    for region in regions:
        spacer = region.tags.get("spacer")
        if spacer is None:
            yield region
            continue

        # Yield original unchanged
        yield region

        # Generate and yield variant regions
        variants = generate_variants(spacer, weights, num_variants=num_variants)
        for v in variants:
            yield replace(
                region,
                tags={
                    **region.tags,
                    "variant_spacer": v["variant"],
                    "change_description": v["change_description"],
                    "y_pred": v["y_pred"],
                    "mismatches": 1,
                },
            )

score_variant ¶

score_variant(original: str, variant: str, weights: dict[str, float]) -> float | None

Score one variant against the original spacer.

Pure function matching legacy calculate_y_pred().

Parameters:

Name	Type	Description	Default
`original`	`str`	Original spacer sequence.	required
`variant`	`str`	Variant spacer sequence.	required
`weights`	`dict[str, float]`	Mismatch weight parameters.	required

Returns:

Type	Description
`float \| None`	Predicted relative activity, or `None` if sequences are
`float \| None`	identical, different length, or either is `None`.

Examples:

>>> score_variant("ATGC", "ATGG", weights)
0.42

Source code in src/seqchain/operations/score/mismatch.py

def score_variant(
    original: str,
    variant: str,
    weights: dict[str, float],
) -> float | None:
    """Score one variant against the original spacer.

    Pure function matching legacy ``calculate_y_pred()``.

    Args:
        original: Original spacer sequence.
        variant: Variant spacer sequence.
        weights: Mismatch weight parameters.

    Returns:
        Predicted relative activity, or ``None`` if sequences are
        identical, different length, or either is ``None``.

    Examples:
        >>> score_variant("ATGC", "ATGG", weights)  # doctest: +SKIP
        0.42
    """
    if original is None or variant is None:
        return None
    if original == variant:
        return None
    if len(original) != len(variant):
        return None

    y_pred = weights["intercept"]

    for pos, (orig_base, var_base) in enumerate(zip(original, variant)):
        if orig_base != var_base:
            y_pred += weights[f"{pos}"]
            y_pred += weights[f"{orig_base}{var_base}"]

    y_pred += weights["GC_content"] * gc_content(original)
    return y_pred

generate_variants ¶

generate_variants(spacer: str, weights: dict[str, float], *, num_variants: int = 10) -> list[dict]

Generate mismatch variants and select a spread across the score range.

Creates all possible single-nucleotide variants, scores each, then selects num_variants evenly spread across the score range using greedy closest-to-desired without replacement (matching legacy).

Parameters:

Name	Type	Description	Default
`spacer`	`str`	Original spacer sequence.	required
`weights`	`dict[str, float]`	Mismatch weight parameters.	required
`num_variants`	`int`	Number of variants to select. Defaults to 10.	`10`

Returns:

Type	Description
`list[dict]`	List of dicts with keys `variant`, `change_description`,
`list[dict]`	`y_pred`.

Examples:

>>> variants = generate_variants("ATGCATGCATGCATGCATGC", weights)
>>> len(variants) == 10
True

Source code in src/seqchain/operations/score/mismatch.py

def generate_variants(
    spacer: str,
    weights: dict[str, float],
    *,
    num_variants: int = 10,
) -> list[dict]:
    """Generate mismatch variants and select a spread across the score range.

    Creates all possible single-nucleotide variants, scores each, then
    selects ``num_variants`` evenly spread across the score range using
    greedy closest-to-desired without replacement (matching legacy).

    Args:
        spacer: Original spacer sequence.
        weights: Mismatch weight parameters.
        num_variants: Number of variants to select. Defaults to 10.

    Returns:
        List of dicts with keys ``variant``, ``change_description``,
        ``y_pred``.

    Examples:
        >>> variants = generate_variants("ATGCATGCATGCATGCATGC", weights)
        >>> len(variants) == 10
        True
    """
    spacer_upper = spacer.upper()

    # Generate all possible 1-mismatch variants and score them
    possible: list[dict] = []
    for pos in range(len(spacer_upper)):
        for nt in _NUCLEOTIDES:
            if nt == spacer_upper[pos]:
                continue
            variant_seq = spacer_upper[:pos] + nt + spacer_upper[pos + 1:]
            y_pred = score_variant(spacer_upper, variant_seq, weights)
            if y_pred is not None:
                possible.append({
                    "variant": variant_seq,
                    "position": pos,
                    "nucleotide": nt,
                    "y_pred": y_pred,
                })

    # Sort by score for selection
    possible.sort(key=lambda x: x["y_pred"])

    # Select num_variants evenly spread across [0, 1]
    step = 1.0 / (num_variants - 1) if num_variants > 1 else 1.0
    desired_scores: list[float] = []
    for i in range(num_variants):
        desired_scores.append(i * step)
    # Ensure we include 1.0
    if num_variants > 1:
        desired_scores[-1] = 1.0

    selected: list[dict] = []
    remaining = possible[:]

    for desired in desired_scores:
        if not remaining:
            continue
        closest = min(remaining, key=lambda x: abs(x["y_pred"] - desired))
        selected.append(closest)
        remaining.remove(closest)

    # Format output
    result = []
    for v in selected:
        change_desc = (
            f"{spacer_upper[v['position']]}{v['position'] + 1}{v['nucleotide']}"
        )
        result.append({
            "variant": v["variant"],
            "change_description": change_desc,
            "y_pred": v["y_pred"],
        })

    return result