Skip to content

Signal

Signal detection primitives for count-based structure discovery.

Pure functions for deciding whether observed counts represent real signal or noise. Used by heuristic_discover() and tnseq_discover() to auto-detect read structure (barcode offsets, transposon flanks) from raw sequencing data.

is_signal_boundary

is_signal_boundary(count_at_n: int, count_at_n_plus_1: int, *, alphabet_size: int = 4, safety_margin: float = 0.75) -> bool

Test whether extending a sequence by one base crosses a signal boundary.

A real flanking sequence at length N will have roughly alphabet_size times the count of any extension to length N+1, because the next base is random and splits counts K ways. The threshold is alphabet_size * safety_margin to tolerate sampling noise.

Parameters:

Name Type Description Default
count_at_n int

Observation count for the candidate at length N.

required
count_at_n_plus_1 int

Observation count for the best candidate at length N+1.

required
alphabet_size int

Size of the sequence alphabet. 4 for DNA, 20 for protein. Defaults to 4.

4
safety_margin float

Fraction of the theoretical ratio to require. Defaults to 0.75 (i.e., require 3x for DNA instead of the theoretical 4x).

0.75

Returns:

Type Description
bool

True if count_at_n exceeds the threshold relative to

bool

count_at_n_plus_1.

Note

At very low counts (below ~30), Poisson sampling noise may exceed the safety margin. Callers processing sparse data should enforce a minimum observation count before relying on this test.

Examples:

>>> is_signal_boundary(1000, 280, alphabet_size=4)
True
Source code in src/seqchain/primitives/signal.py
def is_signal_boundary(
    count_at_n: int,
    count_at_n_plus_1: int,
    *,
    alphabet_size: int = 4,
    safety_margin: float = 0.75,
) -> bool:
    """Test whether extending a sequence by one base crosses a signal boundary.

    A real flanking sequence at length N will have roughly ``alphabet_size``
    times the count of any extension to length N+1, because the next base
    is random and splits counts K ways. The threshold is
    ``alphabet_size * safety_margin`` to tolerate sampling noise.

    Args:
        count_at_n: Observation count for the candidate at length N.
        count_at_n_plus_1: Observation count for the best candidate at
            length N+1.
        alphabet_size: Size of the sequence alphabet. 4 for DNA, 20 for
            protein. Defaults to 4.
        safety_margin: Fraction of the theoretical ratio to require.
            Defaults to 0.75 (i.e., require 3x for DNA instead of the
            theoretical 4x).

    Returns:
        ``True`` if count_at_n exceeds the threshold relative to
        count_at_n_plus_1.

    Note:
        At very low counts (below ~30), Poisson sampling noise may
        exceed the safety margin. Callers processing sparse data
        should enforce a minimum observation count before relying
        on this test.

    Examples:
        >>> is_signal_boundary(1000, 280, alphabet_size=4)
        True
    """
    if count_at_n_plus_1 <= 0:
        return True
    threshold = alphabet_size * safety_margin
    return count_at_n > threshold * count_at_n_plus_1

is_dominant

is_dominant(top_count: int, runner_up_count: int, *, dominance: float = 2.0) -> bool

Test whether the top candidate dominates the runner-up.

Used for offset and orientation convergence: the most common offset must be at least dominance times more frequent than the second most common before sampling can stop.

Parameters:

Name Type Description Default
top_count int

Observation count for the leading candidate.

required
runner_up_count int

Observation count for the second candidate.

required
dominance float

Required ratio of top to runner-up. Defaults to 2.0.

2.0

Returns:

Type Description
bool

True if top_count >= dominance * runner_up_count, or if

bool

there is no runner-up (runner_up_count <= 0).

Examples:

>>> is_dominant(100, 40)
True
Source code in src/seqchain/primitives/signal.py
def is_dominant(
    top_count: int,
    runner_up_count: int,
    *,
    dominance: float = 2.0,
) -> bool:
    """Test whether the top candidate dominates the runner-up.

    Used for offset and orientation convergence: the most common offset
    must be at least ``dominance`` times more frequent than the second
    most common before sampling can stop.

    Args:
        top_count: Observation count for the leading candidate.
        runner_up_count: Observation count for the second candidate.
        dominance: Required ratio of top to runner-up. Defaults to 2.0.

    Returns:
        ``True`` if top_count >= dominance * runner_up_count, or if
        there is no runner-up (runner_up_count <= 0).

    Examples:
        >>> is_dominant(100, 40)
        True
    """
    if runner_up_count <= 0:
        return True
    return top_count >= dominance * runner_up_count

diversity_saturated

diversity_saturated(observed: int, expected: int, *, factor: float = 5.0) -> bool

Test whether sampling has seen enough diversity to be confident.

Sampling should continue until observed >= factor * expected. The default factor of 5 means we want 5x oversampling before declaring saturation.

Parameters:

Name Type Description Default
observed int

Number of observations (diversity hits, novel reads, novel barcodes, etc.).

required
expected int

Expected population size (typically library size).

required
factor float

Required oversampling ratio. Defaults to 5.0.

5.0

Returns:

Type Description
bool

True if observed >= factor * expected.

Examples:

>>> diversity_saturated(500, 100)
True
Source code in src/seqchain/primitives/signal.py
def diversity_saturated(
    observed: int,
    expected: int,
    *,
    factor: float = 5.0,
) -> bool:
    """Test whether sampling has seen enough diversity to be confident.

    Sampling should continue until ``observed >= factor * expected``.
    The default factor of 5 means we want 5x oversampling before
    declaring saturation.

    Args:
        observed: Number of observations (diversity hits, novel reads,
            novel barcodes, etc.).
        expected: Expected population size (typically library size).
        factor: Required oversampling ratio. Defaults to 5.0.

    Returns:
        ``True`` if observed >= factor * expected.

    Examples:
        >>> diversity_saturated(500, 100)
        True
    """
    return observed >= factor * expected