Signal¶

Signal detection primitives for count-based structure discovery.

Pure functions for deciding whether observed counts represent real signal or noise. Used by heuristic_discover() and tnseq_discover() to auto-detect read structure (barcode offsets, transposon flanks) from raw sequencing data.

is_signal_boundary ¶

is_signal_boundary(count_at_n: int, count_at_n_plus_1: int, *, alphabet_size: int = 4, safety_margin: float = 0.75) -> bool

Test whether extending a sequence by one base crosses a signal boundary.

A real flanking sequence at length N will have roughly alphabet_size times the count of any extension to length N+1, because the next base is random and splits counts K ways. The threshold is alphabet_size * safety_margin to tolerate sampling noise.

Parameters:

Name	Type	Description	Default
`count_at_n`	`int`	Observation count for the candidate at length N.	required
`count_at_n_plus_1`	`int`	Observation count for the best candidate at length N+1.	required
`alphabet_size`	`int`	Size of the sequence alphabet. 4 for DNA, 20 for protein. Defaults to 4.	`4`
`safety_margin`	`float`	Fraction of the theoretical ratio to require. Defaults to 0.75 (i.e., require 3x for DNA instead of the theoretical 4x).	`0.75`

Returns:

Type	Description
`bool`	`True` if count_at_n exceeds the threshold relative to
`bool`	count_at_n_plus_1.

Note

At very low counts (below ~30), Poisson sampling noise may exceed the safety margin. Callers processing sparse data should enforce a minimum observation count before relying on this test.

Examples:

>>> is_signal_boundary(1000, 280, alphabet_size=4)
True

Source code in src/seqchain/primitives/signal.py

def is_signal_boundary(
    count_at_n: int,
    count_at_n_plus_1: int,
    *,
    alphabet_size: int = 4,
    safety_margin: float = 0.75,
) -> bool:
    """Test whether extending a sequence by one base crosses a signal boundary.

    A real flanking sequence at length N will have roughly ``alphabet_size``
    times the count of any extension to length N+1, because the next base
    is random and splits counts K ways. The threshold is
    ``alphabet_size * safety_margin`` to tolerate sampling noise.

    Args:
        count_at_n: Observation count for the candidate at length N.
        count_at_n_plus_1: Observation count for the best candidate at
            length N+1.
        alphabet_size: Size of the sequence alphabet. 4 for DNA, 20 for
            protein. Defaults to 4.
        safety_margin: Fraction of the theoretical ratio to require.
            Defaults to 0.75 (i.e., require 3x for DNA instead of the
            theoretical 4x).

    Returns:
        ``True`` if count_at_n exceeds the threshold relative to
        count_at_n_plus_1.

    Note:
        At very low counts (below ~30), Poisson sampling noise may
        exceed the safety margin. Callers processing sparse data
        should enforce a minimum observation count before relying
        on this test.

    Examples:
        >>> is_signal_boundary(1000, 280, alphabet_size=4)
        True
    """
    if count_at_n_plus_1 <= 0:
        return True
    threshold = alphabet_size * safety_margin
    return count_at_n > threshold * count_at_n_plus_1

is_dominant ¶

is_dominant(top_count: int, runner_up_count: int, *, dominance: float = 2.0) -> bool

Test whether the top candidate dominates the runner-up.

Used for offset and orientation convergence: the most common offset must be at least dominance times more frequent than the second most common before sampling can stop.

Parameters:

Name	Type	Description	Default
`top_count`	`int`	Observation count for the leading candidate.	required
`runner_up_count`	`int`	Observation count for the second candidate.	required
`dominance`	`float`	Required ratio of top to runner-up. Defaults to 2.0.	`2.0`

Returns:

Type	Description
`bool`	`True` if top_count >= dominance * runner_up_count, or if
`bool`	there is no runner-up (runner_up_count <= 0).

Examples:

>>> is_dominant(100, 40)
True

Source code in src/seqchain/primitives/signal.py

def is_dominant(
    top_count: int,
    runner_up_count: int,
    *,
    dominance: float = 2.0,
) -> bool:
    """Test whether the top candidate dominates the runner-up.

    Used for offset and orientation convergence: the most common offset
    must be at least ``dominance`` times more frequent than the second
    most common before sampling can stop.

    Args:
        top_count: Observation count for the leading candidate.
        runner_up_count: Observation count for the second candidate.
        dominance: Required ratio of top to runner-up. Defaults to 2.0.

    Returns:
        ``True`` if top_count >= dominance * runner_up_count, or if
        there is no runner-up (runner_up_count <= 0).

    Examples:
        >>> is_dominant(100, 40)
        True
    """
    if runner_up_count <= 0:
        return True
    return top_count >= dominance * runner_up_count

diversity_saturated ¶

diversity_saturated(observed: int, expected: int, *, factor: float = 5.0) -> bool

Test whether sampling has seen enough diversity to be confident.

Sampling should continue until observed >= factor * expected. The default factor of 5 means we want 5x oversampling before declaring saturation.

Parameters:

Name	Type	Description	Default
`observed`	`int`	Number of observations (diversity hits, novel reads, novel barcodes, etc.).	required
`expected`	`int`	Expected population size (typically library size).	required
`factor`	`float`	Required oversampling ratio. Defaults to 5.0.	`5.0`

Returns:

Type	Description
`bool`	`True` if observed >= factor * expected.

Examples:

>>> diversity_saturated(500, 100)
True

Source code in src/seqchain/primitives/signal.py

def diversity_saturated(
    observed: int,
    expected: int,
    *,
    factor: float = 5.0,
) -> bool:
    """Test whether sampling has seen enough diversity to be confident.

    Sampling should continue until ``observed >= factor * expected``.
    The default factor of 5 means we want 5x oversampling before
    declaring saturation.

    Args:
        observed: Number of observations (diversity hits, novel reads,
            novel barcodes, etc.).
        expected: Expected population size (typically library size).
        factor: Required oversampling ratio. Defaults to 5.0.

    Returns:
        ``True`` if observed >= factor * expected.

    Examples:
        >>> diversity_saturated(500, 100)
        True
    """
    return observed >= factor * expected