Sequence¶

Pure functions for DNA sequence math. No dependencies beyond stdlib.

Basics¶

reverse_complement ¶

reverse_complement(seq: str) -> str

Return the reverse complement of a DNA sequence.

Complements A<->T and C<->G while preserving N. Case is preserved (lowercase input yields lowercase output).

Parameters:

Name	Type	Description	Default
`seq`	`str`	DNA sequence containing characters in `ATCGNatcgn`.	required

Returns:

Type	Description
`str`	The reverse-complemented string.

Examples:

>>> reverse_complement("ATCG")
'CGAT'

Source code in src/seqchain/primitives/sequence.py

def reverse_complement(seq: str) -> str:
    """Return the reverse complement of a DNA sequence.

    Complements A<->T and C<->G while preserving N. Case is preserved
    (lowercase input yields lowercase output).

    Args:
        seq: DNA sequence containing characters in ``ATCGNatcgn``.

    Returns:
        The reverse-complemented string.

    Examples:
        >>> reverse_complement("ATCG")
        'CGAT'
    """
    return seq[::-1].translate(_COMPLEMENT)

gc_content ¶

gc_content(seq: str) -> float

Compute the GC fraction of a DNA sequence.

Counts G and C bases (case-insensitive) and divides by the total sequence length.

Parameters:

Name	Type	Description	Default
`seq`	`str`	DNA sequence.	required

Returns:

Type	Description
`float`	GC fraction in the range `[0.0, 1.0]`, or `0.0` for an
`float`	empty string.

Examples:

>>> gc_content("ATGC")
0.5

Source code in src/seqchain/primitives/sequence.py

def gc_content(seq: str) -> float:
    """Compute the GC fraction of a DNA sequence.

    Counts G and C bases (case-insensitive) and divides by the total
    sequence length.

    Args:
        seq: DNA sequence.

    Returns:
        GC fraction in the range ``[0.0, 1.0]``, or ``0.0`` for an
        empty string.

    Examples:
        >>> gc_content("ATGC")
        0.5
    """
    if not seq:
        return 0.0
    upper = seq.upper()
    gc = upper.count("G") + upper.count("C")
    return gc / len(seq)

is_valid_dna ¶

is_valid_dna(seq: str) -> bool

Check whether a sequence contains only unambiguous DNA bases.

Accepts A, T, G, C in either case. Rejects IUPAC ambiguity codes (N, R, Y, etc.) and empty strings.

Parameters:

Name	Type	Description	Default
`seq`	`str`	Candidate DNA string.	required

Returns:

Type	Description
`bool`	`True` if every character is in `ATGCatgc` and the string
`bool`	is non-empty; `False` otherwise.

Examples:

>>> is_valid_dna("GATC")
True

Source code in src/seqchain/primitives/sequence.py

def is_valid_dna(seq: str) -> bool:
    """Check whether a sequence contains only unambiguous DNA bases.

    Accepts ``A``, ``T``, ``G``, ``C`` in either case. Rejects IUPAC
    ambiguity codes (``N``, ``R``, ``Y``, etc.) and empty strings.

    Args:
        seq: Candidate DNA string.

    Returns:
        ``True`` if every character is in ``ATGCatgc`` and the string
        is non-empty; ``False`` otherwise.

    Examples:
        >>> is_valid_dna("GATC")
        True
    """
    if not seq:
        return False
    return all(base in "GATCgatc" for base in seq)

melting_temp ¶

melting_temp(seq: str, oligo_conc: float = 2.5e-07, na_conc: float = 0.05) -> float

Estimate melting temperature using the nearest-neighbor method.

Implements the SantaLucia (1998) unified parameters with a salt correction for monovalent cations.

Parameters:

Name	Type	Description	Default
`seq`	`str`	DNA oligonucleotide sequence (case-insensitive). Must be at least 2 nt for a meaningful result.	required
`oligo_conc`	`float`	Total oligonucleotide concentration in mol/L. Defaults to 250 nM.	`2.5e-07`
`na_conc`	`float`	Monovalent cation (Na+) concentration in mol/L. Defaults to 50 mM.	`0.05`

Returns:

Type	Description
`float`	Predicted melting temperature in degrees Celsius, rounded to one
`float`	decimal place. Returns `0.0` for sequences shorter than 2 nt.

Examples:

>>> melting_temp("ATCGATCGATCGATCGATCG")
55.9

Source code in src/seqchain/primitives/sequence.py

def melting_temp(seq: str, oligo_conc: float = 250e-9, na_conc: float = 50e-3) -> float:
    """Estimate melting temperature using the nearest-neighbor method.

    Implements the SantaLucia (1998) unified parameters with a salt
    correction for monovalent cations.

    Args:
        seq: DNA oligonucleotide sequence (case-insensitive). Must be
            at least 2 nt for a meaningful result.
        oligo_conc: Total oligonucleotide concentration in mol/L.
            Defaults to 250 nM.
        na_conc: Monovalent cation (Na+) concentration in mol/L.
            Defaults to 50 mM.

    Returns:
        Predicted melting temperature in degrees Celsius, rounded to one
        decimal place. Returns ``0.0`` for sequences shorter than 2 nt.

    Examples:
        >>> melting_temp("ATCGATCGATCGATCGATCG")
        55.9
    """
    import math

    seq = seq.upper()
    if len(seq) < 2:
        return 0.0

    dh = _INIT_DH  # kcal/mol
    ds = _INIT_DS  # cal/(mol*K)

    for i in range(len(seq) - 1):
        dinuc = seq[i:i + 2]
        dh += _NN_DH.get(dinuc, 0.0)
        ds += _NN_DS.get(dinuc, 0.0)

    # Salt correction (SantaLucia 1998)
    ds += 0.368 * (len(seq) - 1) * math.log(na_conc)

    # Tm = dH / (dS + R * ln(Ct/4)) - 273.15
    R = 1.987  # cal/(mol*K)
    tm = (dh * 1000) / (ds + R * math.log(oligo_conc / 4)) - 273.15
    return round(tm, 1)

Distance¶

hamming_distance ¶

hamming_distance(a: str, b: str) -> int

Count the number of positions where two sequences differ.

Both sequences must be the same length.

Parameters:

Name	Type	Description	Default
`a`	`str`	First sequence.	required
`b`	`str`	Second sequence (same length as a).	required

Returns:

Type	Description
`int`	Number of mismatched positions.

Raises:

Type	Description
`ValueError`	If a and b have different lengths.

Examples:

>>> hamming_distance("ATCG", "ATGG")
1

Source code in src/seqchain/primitives/sequence.py

def hamming_distance(a: str, b: str) -> int:
    """Count the number of positions where two sequences differ.

    Both sequences must be the same length.

    Args:
        a: First sequence.
        b: Second sequence (same length as *a*).

    Returns:
        Number of mismatched positions.

    Raises:
        ValueError: If *a* and *b* have different lengths.

    Examples:
        >>> hamming_distance("ATCG", "ATGG")
        1
    """
    if len(a) != len(b):
        raise ValueError(f"Sequences must be equal length: {len(a)} != {len(b)}")
    return sum(ca != cb for ca, cb in zip(a, b))

edit_distance ¶

edit_distance(a: str, b: str) -> int

Compute the Levenshtein edit distance between two sequences.

Uses a single-row dynamic-programming approach for O(min(m, n)) space.

Parameters:

Name	Type	Description	Default
`a`	`str`	First sequence.	required
`b`	`str`	Second sequence.	required

Returns:

Type	Description
`int`	Minimum number of single-character insertions, deletions, or
`int`	substitutions to transform a into b.

Examples:

>>> edit_distance("ATCG", "ATC")
1

Source code in src/seqchain/primitives/sequence.py

def edit_distance(a: str, b: str) -> int:
    """Compute the Levenshtein edit distance between two sequences.

    Uses a single-row dynamic-programming approach for O(min(m, n))
    space.

    Args:
        a: First sequence.
        b: Second sequence.

    Returns:
        Minimum number of single-character insertions, deletions, or
        substitutions to transform *a* into *b*.

    Examples:
        >>> edit_distance("ATCG", "ATC")
        1
    """
    m, n = len(a), len(b)
    # Use single-row DP for space efficiency
    prev = []
    for i in range(n + 1):
        prev.append(i)
    for i in range(1, m + 1):
        curr = [i] + [0] * n
        for j in range(1, n + 1):
            cost = 0 if a[i - 1] == b[j - 1] else 1
            curr[j] = min(
                curr[j - 1] + 1,      # insertion
                prev[j] + 1,          # deletion
                prev[j - 1] + cost,   # substitution
            )
        prev = curr
    return prev[n]

diff ¶

diff(a: str, b: str) -> str

Produce a character-level diff between two equal-length sequences.

Matching positions are represented by '.' and mismatches show the character from b.

Parameters:

Name	Type	Description	Default
`a`	`str`	Reference sequence.	required
`b`	`str`	Query sequence (same length as a).	required

Returns:

Type	Description
`str`	Diff string of the same length as a and b.

Raises:

Type	Description
`ValueError`	If a and b have different lengths.

Examples:

>>> diff("ATCG", "ATGG")
'..G.'

Source code in src/seqchain/primitives/sequence.py

def diff(a: str, b: str) -> str:
    """Produce a character-level diff between two equal-length sequences.

    Matching positions are represented by ``'.'`` and mismatches show
    the character from *b*.

    Args:
        a: Reference sequence.
        b: Query sequence (same length as *a*).

    Returns:
        Diff string of the same length as *a* and *b*.

    Raises:
        ValueError: If *a* and *b* have different lengths.

    Examples:
        >>> diff("ATCG", "ATGG")
        '..G.'
    """
    if len(a) != len(b):
        raise ValueError(f"Sequences must be equal length: {len(a)} != {len(b)}")
    return "".join("." if ca == cb else cb for ca, cb in zip(a, b))

IUPAC & pattern matching¶

expand_iupac ¶

expand_iupac(pattern: str) -> str

Convert an IUPAC degenerate nucleotide pattern to a regex string.

Each IUPAC ambiguity code is replaced by the corresponding regex character class (e.g. N -> [ATGC], V -> [ACG]). Concrete bases and unrecognised characters pass through unchanged. Input is uppercased before conversion.

Parameters:

Name	Type	Description	Default
`pattern`	`str`	IUPAC nucleotide pattern (e.g. `"NGG"`, `"TTTV"`).	required

Returns:

Type	Description
`str`	Regex-compatible string.

Examples:

>>> expand_iupac("NGG")
'[ATGC]GG'

Source code in src/seqchain/primitives/sequence.py

def expand_iupac(pattern: str) -> str:
    """Convert an IUPAC degenerate nucleotide pattern to a regex string.

    Each IUPAC ambiguity code is replaced by the corresponding regex
    character class (e.g. ``N`` -> ``[ATGC]``, ``V`` -> ``[ACG]``).
    Concrete bases and unrecognised characters pass through unchanged.
    Input is uppercased before conversion.

    Args:
        pattern: IUPAC nucleotide pattern (e.g. ``"NGG"``, ``"TTTV"``).

    Returns:
        Regex-compatible string.

    Examples:
        >>> expand_iupac("NGG")
        '[ATGC]GG'
    """
    result = ""
    for char in pattern.upper():
        result += _IUPAC_MAP.get(char, char)
    return result

pattern_matches ¶

pattern_matches(pattern: str, seq: str) -> bool

Test whether an IUPAC pattern fully matches a concrete sequence.

The pattern must cover the entire sequence (anchored full-match). Both pattern and seq are compared case-insensitively.

Parameters:

Name	Type	Description	Default
`pattern`	`str`	IUPAC nucleotide pattern (e.g. `"NGG"`).	required
`seq`	`str`	Concrete DNA sequence to test.	required

Returns:

Type	Description
`bool`	`True` if seq matches pattern over its full length.

Examples:

>>> pattern_matches("NGG", "AGG")
True

Source code in src/seqchain/primitives/sequence.py

def pattern_matches(pattern: str, seq: str) -> bool:
    """Test whether an IUPAC pattern fully matches a concrete sequence.

    The pattern must cover the entire sequence (anchored full-match).
    Both *pattern* and *seq* are compared case-insensitively.

    Args:
        pattern: IUPAC nucleotide pattern (e.g. ``"NGG"``).
        seq: Concrete DNA sequence to test.

    Returns:
        ``True`` if *seq* matches *pattern* over its full length.

    Examples:
        >>> pattern_matches("NGG", "AGG")
        True
    """
    regex = expand_iupac(pattern)
    return bool(re.fullmatch(regex, seq.upper()))

find_pattern ¶

find_pattern(seq: str, pattern: str) -> Iterator[tuple[int, str, str]]

Find all occurrences of an IUPAC pattern on both strands.

Scans the forward strand and then the reverse complement. All positions are reported in forward-strand coordinates. Overlapping matches are included (e.g. "AGGG" yields NGG at positions 0 and 1).

Parameters:

Name	Type	Description	Default
`seq`	`str`	Reference DNA sequence to search.	required
`pattern`	`str`	IUPAC nucleotide pattern (e.g. `"NGG"`).	required

Yields:

Type	Description
`int`	`(position, matched_sequence, strand)` tuples where strand
`str`	is `"+"` or `"-"` and position is a 0-based forward-strand
`str`	coordinate.

Examples:

>>> list(find_pattern("AAGGA", "NGG"))
[(1, 'AGG', '+')]

Source code in src/seqchain/primitives/sequence.py

def find_pattern(seq: str, pattern: str) -> Iterator[tuple[int, str, str]]:
    """Find all occurrences of an IUPAC pattern on both strands.

    Scans the forward strand and then the reverse complement. All
    positions are reported in forward-strand coordinates. Overlapping
    matches are included (e.g. ``"AGGG"`` yields NGG at positions 0
    and 1).

    Args:
        seq: Reference DNA sequence to search.
        pattern: IUPAC nucleotide pattern (e.g. ``"NGG"``).

    Yields:
        ``(position, matched_sequence, strand)`` tuples where *strand*
        is ``"+"`` or ``"-"`` and *position* is a 0-based forward-strand
        coordinate.

    Examples:
        >>> list(find_pattern("AAGGA", "NGG"))
        [(1, 'AGG', '+')]
    """
    # Lookahead (?=...) finds overlapping matches (zero-width assertion)
    regex = re.compile(f"(?=({expand_iupac(pattern)}))")
    pat_len = len(pattern)

    # Forward strand
    for m in regex.finditer(seq.upper()):
        start = m.start()
        yield (start, seq[start:start + pat_len], "+")

    # Reverse strand
    rc = reverse_complement(seq)
    seq_len = len(seq)
    for m in regex.finditer(rc.upper()):
        # Convert reverse-strand position to forward-strand coordinate
        rc_start = m.start()
        fwd_pos = seq_len - rc_start - pat_len
        yield (fwd_pos, rc[rc_start:rc_start + pat_len], "-")