Genome¶

Load genomes from GenBank and FASTA files. The Genome dataclass holds sequences, features, organism metadata, and topology (circular/linear) in a frozen, immutable structure.

from seqchain.io.genome import load_genbank

genome = load_genbank("yeast.gb")
print(genome.chrom_lengths)   # {'chrI': 230218, 'chrII': 813184, ...}
print(genome.genes()[:3])     # First 3 gene features

Genome `dataclass` ¶

Genome(name: str, organisms: dict[str, str], sequences: dict[str, str], features: tuple[Region, ...], topologies: dict[str, str])

Immutable container for a parsed genome.

Holds sequences, features, organism names, and topology information for one or more chromosomes or contigs.

Parameters:

Name	Type	Description	Default
`name`	`str`	Genome name (typically the filename stem).	required
`organisms`	`dict[str, str]`	Mapping of chromosome name to organism string.	required
`sequences`	`dict[str, str]`	Mapping of chromosome name to uppercase DNA string.	required
`features`	`tuple[Region, ...]`	All extracted features across all chromosomes.	required
`topologies`	`dict[str, str]`	Mapping of chromosome name to `"circular"` or `"linear"`.	required

Examples:

>>> g = Genome("test", {}, {"chr1": "ATGC"}, (), {})
>>> g.chrom_lengths
{'chr1': 4}

chrom_lengths `property` ¶

chrom_lengths: dict[str, int]

Mapping of chromosome name to sequence length.

Returns:

Type	Description
`dict[str, int]`	Dict of `{chrom: len(sequence)}`.

Examples:

>>> Genome("g", {}, {"c1": "ATGC"}, (), {}).chrom_lengths
{'c1': 4}

chroms `property` ¶

chroms: list[str]

Sorted list of chromosome names.

Returns:

Type	Description
`list[str]`	Chromosome names in sorted order.

Examples:

>>> Genome("g", {}, {"b": "AT", "a": "GC"}, (), {}).chroms
['a', 'b']

features_on ¶

features_on(chrom: str) -> tuple[Region, ...]

Filter features to a single chromosome.

Parameters:

Name	Type	Description	Default
`chrom`	`str`	Chromosome name to filter by.	required

Returns:

Type	Description
`tuple[Region, ...]`	Tuple of regions on the given chromosome.

Examples:

>>> r = Region("chr1", 10, 20)
>>> Genome("g", {}, {}, (r,), {}).features_on("chr1")
(Region(chrom='chr1', start=10, end=20, strand='.', score=0.0, name='', tags={}),)

Source code in src/seqchain/io/genome.py

def features_on(self, chrom: str) -> tuple[Region, ...]:
    """Filter features to a single chromosome.

    Args:
        chrom: Chromosome name to filter by.

    Returns:
        Tuple of regions on the given chromosome.

    Examples:
        >>> r = Region("chr1", 10, 20)
        >>> Genome("g", {}, {}, (r,), {}).features_on("chr1")
        (Region(chrom='chr1', start=10, end=20, strand='.', score=0.0, name='', tags={}),)
    """
    return tuple(f for f in self.features if f.chrom == chrom)

genes ¶

genes() -> tuple[Region, ...]

Filter features to gene-type only.

Returns:

Type	Description
`tuple[Region, ...]`	Tuple of regions where `tags["feature_type"] == "gene"`.

Examples:

>>> r = Region("c", 0, 10, tags={"feature_type": "gene"})
>>> Genome("g", {}, {}, (r,), {}).genes()
(Region(chrom='c', start=0, end=10, strand='.', score=0.0, name='', tags={'feature_type': 'gene'}),)

Source code in src/seqchain/io/genome.py

def genes(self) -> tuple[Region, ...]:
    """Filter features to gene-type only.

    Returns:
        Tuple of regions where ``tags["feature_type"] == "gene"``.

    Examples:
        >>> r = Region("c", 0, 10, tags={"feature_type": "gene"})
        >>> Genome("g", {}, {}, (r,), {}).genes()
        (Region(chrom='c', start=0, end=10, strand='.', score=0.0, name='', tags={'feature_type': 'gene'}),)
    """
    return tuple(f for f in self.features if f.tags.get("feature_type") == "gene")

feature_index ¶

feature_index() -> IntervalTrack

Build an IntervalTrack from all features.

Returns:

Type	Description
`IntervalTrack`	An `IntervalTrack` indexing every
`IntervalTrack`	feature in this genome.

Examples:

>>> Genome("g", {}, {}, (), {}).feature_index().name
'g_features'

Source code in src/seqchain/io/genome.py

def feature_index(self) -> IntervalTrack:
    """Build an IntervalTrack from all features.

    Returns:
        An `IntervalTrack` indexing every
        feature in this genome.

    Examples:
        >>> Genome("g", {}, {}, (), {}).feature_index().name
        'g_features'
    """
    return IntervalTrack(TrackLabel(f"{self.name}_features"), self.features)

is_circular ¶

is_circular(chrom: str) -> bool

Check whether a chromosome has circular topology.

Parameters:

Name	Type	Description	Default
`chrom`	`str`	Chromosome name.	required

Returns:

Type	Description
`bool`	`True` if topology is `"circular"`, `False` otherwise.

Examples:

>>> Genome("g", {}, {}, (), {"c": "circular"}).is_circular("c")
True

Source code in src/seqchain/io/genome.py

def is_circular(self, chrom: str) -> bool:
    """Check whether a chromosome has circular topology.

    Args:
        chrom: Chromosome name.

    Returns:
        ``True`` if topology is ``"circular"``, ``False`` otherwise.

    Examples:
        >>> Genome("g", {}, {}, (), {"c": "circular"}).is_circular("c")
        True
    """
    return self.topologies.get(chrom, "linear") == "circular"

topological_sequence ¶

topological_sequence(chrom: str, overhang: int = 100000) -> str

Return sequence with overhang appended for circular chromosomes.

For circular chromosomes, appends the first overhang bases of the sequence to the end, enabling alignment across the origin. For linear chromosomes, returns the sequence unchanged.

Parameters:

Name	Type	Description	Default
`chrom`	`str`	Chromosome name.	required
`overhang`	`int`	Number of bases to append from the start. Defaults to `100_000`.	`100000`

Returns:

Type	Description
`str`	The (possibly extended) DNA string.

Examples:

>>> g = Genome("g", {}, {"c": "ATGC"}, (), {"c": "circular"})
>>> g.topological_sequence("c", overhang=2)
'ATGCAT'

Source code in src/seqchain/io/genome.py

def topological_sequence(self, chrom: str, overhang: int = 100_000) -> str:
    """Return sequence with overhang appended for circular chromosomes.

    For circular chromosomes, appends the first *overhang* bases of
    the sequence to the end, enabling alignment across the origin.
    For linear chromosomes, returns the sequence unchanged.

    Args:
        chrom: Chromosome name.
        overhang: Number of bases to append from the start.
            Defaults to ``100_000``.

    Returns:
        The (possibly extended) DNA string.

    Examples:
        >>> g = Genome("g", {}, {"c": "ATGC"}, (), {"c": "circular"})
        >>> g.topological_sequence("c", overhang=2)
        'ATGCAT'
    """
    seq = self.sequences[chrom]
    if not self.is_circular(chrom):
        return seq
    return seq + seq[:overhang]

load_genbank ¶

load_genbank(path: str | Path) -> Genome

Parse a GenBank file into a Genome.

Supports plain .gb and gzip-compressed .gb.gz files. Extracts all features except source as Region objects.

BioPython is imported inside the function body (deferred import) so that the rest of the library can be used without it installed.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to a GenBank file (`.gb` or `.gb.gz`).	required

Returns:

Type	Description
`Genome`	A `Genome` populated with sequences, features, and metadata.

Examples:

>>> genome = load_genbank("tests/fixtures/synthetic_genome.gb")

Source code in src/seqchain/io/genome.py

def load_genbank(path: str | Path) -> Genome:
    """Parse a GenBank file into a Genome.

    Supports plain ``.gb`` and gzip-compressed ``.gb.gz`` files.
    Extracts all features except ``source`` as `Region` objects.

    BioPython is imported inside the function body (deferred import)
    so that the rest of the library can be used without it installed.

    Args:
        path: Path to a GenBank file (``.gb`` or ``.gb.gz``).

    Returns:
        A `Genome` populated with sequences, features, and metadata.

    Examples:
        >>> genome = load_genbank("tests/fixtures/synthetic_genome.gb")
    """
    from Bio import SeqIO
    from Bio.SeqFeature import CompoundLocation

    path = Path(path)
    open_func = gzip.open if path.name.endswith(".gz") else open

    organisms: dict[str, str] = {}
    sequences: dict[str, str] = {}
    topologies: dict[str, str] = {}
    features: list[Region] = []

    try:
        with open_func(path, "rt") as handle:
            for record in SeqIO.parse(handle, "genbank"):
                chrom = record.id
                sequences[chrom] = str(record.seq).upper()
                organisms[chrom] = record.annotations.get("organism", "")
                topologies[chrom] = record.annotations.get("topology", "linear")
                seq_len = len(record.seq)

                for feature in record.features:
                    if feature.type == "source":
                        continue

                    region = _extract_feature(
                        feature, chrom, seq_len, CompoundLocation,
                    )
                    features.append(region)
    except FileNotFoundError:
        raise FileNotFoundError(
            f"Genome file not found: {path}"
        ) from None
    except Exception as exc:
        if isinstance(exc, FileNotFoundError):
            raise
        raise ValueError(
            f"Failed to parse GenBank file {path}: {exc}"
        ) from exc

    return Genome(
        name=path.stem.removesuffix(".gb"),
        organisms=organisms,
        sequences=sequences,
        features=tuple(features),
        topologies=topologies,
    )

load_fasta ¶

load_fasta(path: str | Path, circular: set[str] | None = None) -> Genome

Parse a FASTA file into a Genome.

Produces a Genome with sequences only — no features, no organism metadata. All chromosomes default to linear topology unless listed in circular.

Supports plain and gzip-compressed (.gz) files. No BioPython required — uses a stdlib-only parser.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to a FASTA file.	required
`circular`	`set[str] \| None`	Set of chromosome names to mark as circular. Defaults to `None` (all linear).	`None`

Returns:

Type	Description
`Genome`	A `Genome` with sequences and topologies only.

Examples:

>>> genome = load_fasta("tests/fixtures/synthetic_genome.fa")

Source code in src/seqchain/io/genome.py

def load_fasta(
    path: str | Path,
    circular: set[str] | None = None,
) -> Genome:
    """Parse a FASTA file into a Genome.

    Produces a Genome with sequences only — no features, no organism
    metadata. All chromosomes default to linear topology unless
    listed in *circular*.

    Supports plain and gzip-compressed (``.gz``) files. No BioPython
    required — uses a stdlib-only parser.

    Args:
        path: Path to a FASTA file.
        circular: Set of chromosome names to mark as circular.
            Defaults to ``None`` (all linear).

    Returns:
        A `Genome` with sequences and topologies only.

    Examples:
        >>> genome = load_fasta("tests/fixtures/synthetic_genome.fa")
    """
    path = Path(path)
    circular = circular or set()

    sequences: dict[str, str] = {}
    organisms: dict[str, str] = {}
    topologies: dict[str, str] = {}

    from seqchain.io.fasta import iter_fasta

    for name, seq in iter_fasta(path):
        sequences[name] = seq
        organisms[name] = ""
        topologies[name] = "circular" if name in circular else "linear"

    return Genome(
        name=path.stem,
        organisms=organisms,
        sequences=sequences,
        features=(),
        topologies=topologies,
    )

Genome¶

Genome dataclass ¶

chrom_lengths property ¶

chroms property ¶

features_on ¶

genes ¶

feature_index ¶

is_circular ¶

topological_sequence ¶

load_genbank ¶

load_fasta ¶

Genome `dataclass` ¶

chrom_lengths `property` ¶

chroms `property` ¶