Skip to content

Mismatch Scoring

Score CRISPR guide mismatch variants using a position-weighted linear model. score_mismatches() generates single-nucleotide variants for each guide and predicts their activity using a mismatch weight matrix.

mismatch

Mismatch variant scorer for CRISPR guide spacers.

Predicts relative activity of single-nucleotide mismatch variants using a linear model. Matches the scoring logic from legacy/crispr_experiment/assets/mismatch.py.

score_mismatches

score_mismatches(regions: Iterable[Region], weights: dict[str, float], *, num_variants: int = 10) -> Iterator[Region]

Score regions that have a spacer tag.

For each Region with a spacer tag, yields the original unchanged, then yields num_variants variant Regions with tags: variant_spacer, change_description, y_pred, mismatches.

Regions without a spacer tag pass through unchanged.

Parameters:

Name Type Description Default
regions Iterable[Region]

Input regions to score.

required
weights dict[str, float]

Dict from load_mismatch_weights(). Keys: "intercept", "0"-"N" (positions), "AC"/"AG"/... (substitutions), "GC_content".

required
num_variants int

Variants to select per guide. Defaults to 10.

10

Yields:

Type Description
Region

Original region followed by variant regions.

Examples:

>>> scored = list(score_mismatches(regions, weights))
Source code in src/seqchain/operations/score/mismatch.py
def score_mismatches(
    regions: Iterable[Region],
    weights: dict[str, float],
    *,
    num_variants: int = 10,
) -> Iterator[Region]:
    """Score regions that have a ``spacer`` tag.

    For each Region with a ``spacer`` tag, yields the original
    unchanged, then yields ``num_variants`` variant Regions with
    tags: ``variant_spacer``, ``change_description``, ``y_pred``,
    ``mismatches``.

    Regions without a ``spacer`` tag pass through unchanged.

    Args:
        regions: Input regions to score.
        weights: Dict from `load_mismatch_weights()`. Keys:
            ``"intercept"``, ``"0"``-``"N"`` (positions),
            ``"AC"``/``"AG"``/... (substitutions), ``"GC_content"``.
        num_variants: Variants to select per guide. Defaults to 10.

    Yields:
        Original region followed by variant regions.

    Examples:
        >>> scored = list(score_mismatches(regions, weights))
    """
    for region in regions:
        spacer = region.tags.get("spacer")
        if spacer is None:
            yield region
            continue

        # Yield original unchanged
        yield region

        # Generate and yield variant regions
        variants = generate_variants(spacer, weights, num_variants=num_variants)
        for v in variants:
            yield replace(
                region,
                tags={
                    **region.tags,
                    "variant_spacer": v["variant"],
                    "change_description": v["change_description"],
                    "y_pred": v["y_pred"],
                    "mismatches": 1,
                },
            )

score_variant

score_variant(original: str, variant: str, weights: dict[str, float]) -> float | None

Score one variant against the original spacer.

Pure function matching legacy calculate_y_pred().

Parameters:

Name Type Description Default
original str

Original spacer sequence.

required
variant str

Variant spacer sequence.

required
weights dict[str, float]

Mismatch weight parameters.

required

Returns:

Type Description
float | None

Predicted relative activity, or None if sequences are

float | None

identical, different length, or either is None.

Examples:

>>> score_variant("ATGC", "ATGG", weights)
0.42
Source code in src/seqchain/operations/score/mismatch.py
def score_variant(
    original: str,
    variant: str,
    weights: dict[str, float],
) -> float | None:
    """Score one variant against the original spacer.

    Pure function matching legacy ``calculate_y_pred()``.

    Args:
        original: Original spacer sequence.
        variant: Variant spacer sequence.
        weights: Mismatch weight parameters.

    Returns:
        Predicted relative activity, or ``None`` if sequences are
        identical, different length, or either is ``None``.

    Examples:
        >>> score_variant("ATGC", "ATGG", weights)  # doctest: +SKIP
        0.42
    """
    if original is None or variant is None:
        return None
    if original == variant:
        return None
    if len(original) != len(variant):
        return None

    y_pred = weights["intercept"]

    for pos, (orig_base, var_base) in enumerate(zip(original, variant)):
        if orig_base != var_base:
            y_pred += weights[f"{pos}"]
            y_pred += weights[f"{orig_base}{var_base}"]

    y_pred += weights["GC_content"] * gc_content(original)
    return y_pred

generate_variants

generate_variants(spacer: str, weights: dict[str, float], *, num_variants: int = 10) -> list[dict]

Generate mismatch variants and select a spread across the score range.

Creates all possible single-nucleotide variants, scores each, then selects num_variants evenly spread across the score range using greedy closest-to-desired without replacement (matching legacy).

Parameters:

Name Type Description Default
spacer str

Original spacer sequence.

required
weights dict[str, float]

Mismatch weight parameters.

required
num_variants int

Number of variants to select. Defaults to 10.

10

Returns:

Type Description
list[dict]

List of dicts with keys variant, change_description,

list[dict]

y_pred.

Examples:

>>> variants = generate_variants("ATGCATGCATGCATGCATGC", weights)
>>> len(variants) == 10
True
Source code in src/seqchain/operations/score/mismatch.py
def generate_variants(
    spacer: str,
    weights: dict[str, float],
    *,
    num_variants: int = 10,
) -> list[dict]:
    """Generate mismatch variants and select a spread across the score range.

    Creates all possible single-nucleotide variants, scores each, then
    selects ``num_variants`` evenly spread across the score range using
    greedy closest-to-desired without replacement (matching legacy).

    Args:
        spacer: Original spacer sequence.
        weights: Mismatch weight parameters.
        num_variants: Number of variants to select. Defaults to 10.

    Returns:
        List of dicts with keys ``variant``, ``change_description``,
        ``y_pred``.

    Examples:
        >>> variants = generate_variants("ATGCATGCATGCATGCATGC", weights)
        >>> len(variants) == 10
        True
    """
    spacer_upper = spacer.upper()

    # Generate all possible 1-mismatch variants and score them
    possible: list[dict] = []
    for pos in range(len(spacer_upper)):
        for nt in _NUCLEOTIDES:
            if nt == spacer_upper[pos]:
                continue
            variant_seq = spacer_upper[:pos] + nt + spacer_upper[pos + 1:]
            y_pred = score_variant(spacer_upper, variant_seq, weights)
            if y_pred is not None:
                possible.append({
                    "variant": variant_seq,
                    "position": pos,
                    "nucleotide": nt,
                    "y_pred": y_pred,
                })

    # Sort by score for selection
    possible.sort(key=lambda x: x["y_pred"])

    # Select num_variants evenly spread across [0, 1]
    step = 1.0 / (num_variants - 1) if num_variants > 1 else 1.0
    desired_scores: list[float] = []
    for i in range(num_variants):
        desired_scores.append(i * step)
    # Ensure we include 1.0
    if num_variants > 1:
        desired_scores[-1] = 1.0

    selected: list[dict] = []
    remaining = possible[:]

    for desired in desired_scores:
        if not remaining:
            continue
        closest = min(remaining, key=lambda x: abs(x["y_pred"] - desired))
        selected.append(closest)
        remaining.remove(closest)

    # Format output
    result = []
    for v in selected:
        change_desc = (
            f"{spacer_upper[v['position']]}{v['position'] + 1}{v['nucleotide']}"
        )
        result.append({
            "variant": v["variant"],
            "change_description": change_desc,
            "y_pred": v["y_pred"],
        })

    return result