Skip to content

regex_map

Find IUPAC nucleotide patterns in genome sequences. Expands patterns like NGG into regex ([ATGC]GG) and scans both strands. Supports circular genome topologies with origin-spanning matches.

Use this for PAM scanning in CRISPR guide design, or any motif search where you need all occurrences of a degenerate pattern.

regex_map

regex_map(sequences: dict[str, str], pattern: str, *, topologies: dict[str, str] | None = None) -> Iterator[Region]

Scan genome sequences for an IUPAC motif pattern.

Scans both forward and reverse strands of every chromosome for occurrences of pattern, yielding a Region per match.

For circular chromosomes, also scans a junction window around the origin to find motif sites that straddle position 0.

Each Region carries:

  • name: the matched sequence string
  • tags["pattern"]: the original IUPAC pattern
  • tags["matched"]: the concrete matched sequence

Parameters:

Name Type Description Default
sequences dict[str, str]

Mapping of chromosome name to uppercase DNA string.

required
pattern str

IUPAC nucleotide pattern (e.g. "NGG").

required
topologies dict[str, str] | None

Mapping of chromosome name to "circular" or "linear". Defaults to all linear.

None

Yields:

Type Description
Region

Region for each match, with 0-based forward-strand

Region

coordinates.

Examples:

>>> hits = list(regex_map({"chr1": "AATGGCCATGGTTT"}, "NGG"))
>>> len(hits)
3
Source code in src/seqchain/transform/regex.py
def regex_map(
    sequences: dict[str, str],
    pattern: str,
    *,
    topologies: dict[str, str] | None = None,
) -> Iterator[Region]:
    """Scan genome sequences for an IUPAC motif pattern.

    Scans both forward and reverse strands of every chromosome for
    occurrences of *pattern*, yielding a `Region` per match.

    For circular chromosomes, also scans a junction window around the
    origin to find motif sites that straddle position 0.

    Each Region carries:

    - ``name``: the matched sequence string
    - ``tags["pattern"]``: the original IUPAC pattern
    - ``tags["matched"]``: the concrete matched sequence

    Args:
        sequences: Mapping of chromosome name to uppercase DNA string.
        pattern: IUPAC nucleotide pattern (e.g. ``"NGG"``).
        topologies: Mapping of chromosome name to ``"circular"`` or
            ``"linear"``. Defaults to all linear.

    Yields:
        `Region` for each match, with 0-based forward-strand
        coordinates.

    Examples:
        >>> hits = list(regex_map({"chr1": "AATGGCCATGGTTT"}, "NGG"))
        >>> len(hits)
        3
    """
    pat = pattern.upper()
    topo = topologies or {}
    pat_len = len(pat)

    for chrom, seq in sequences.items():
        seq_len = len(seq)
        for pos, matched, strand in find_pattern(seq, pat):
            yield Region(
                chrom=chrom,
                start=pos,
                end=pos + pat_len,
                strand=strand,
                name=matched,
                tags={
                    "pattern": pat,
                    "matched": matched,
                },
            )

        # Junction scan for circular chromosomes
        is_circular = topo.get(chrom, "linear") == "circular"
        if is_circular and pat_len > 1 and seq_len >= pat_len:
            ov = pat_len - 1
            junction = seq[-ov:] + seq[:ov]
            for pos, matched, strand in find_pattern(junction, pat):
                # Only keep matches that actually straddle the boundary
                if pos < ov and pos + pat_len > ov:
                    orig_start = seq_len - ov + pos
                    orig_end = (orig_start + pat_len) % seq_len
                    yield Region(
                        chrom=chrom,
                        start=orig_start,
                        end=orig_end,
                        strand=strand,
                        name=matched,
                        tags={
                            "pattern": pat,
                            "matched": matched,
                        },
                    )