Skip to content

Genome

Load genomes from GenBank and FASTA files. The Genome dataclass holds sequences, features, organism metadata, and topology (circular/linear) in a frozen, immutable structure.

from seqchain.io.genome import load_genbank

genome = load_genbank("yeast.gb")
print(genome.chrom_lengths)   # {'chrI': 230218, 'chrII': 813184, ...}
print(genome.genes()[:3])     # First 3 gene features

Genome dataclass

Genome(name: str, organisms: dict[str, str], sequences: dict[str, str], features: tuple[Region, ...], topologies: dict[str, str])

Immutable container for a parsed genome.

Holds sequences, features, organism names, and topology information for one or more chromosomes or contigs.

Parameters:

Name Type Description Default
name str

Genome name (typically the filename stem).

required
organisms dict[str, str]

Mapping of chromosome name to organism string.

required
sequences dict[str, str]

Mapping of chromosome name to uppercase DNA string.

required
features tuple[Region, ...]

All extracted features across all chromosomes.

required
topologies dict[str, str]

Mapping of chromosome name to "circular" or "linear".

required

Examples:

>>> g = Genome("test", {}, {"chr1": "ATGC"}, (), {})
>>> g.chrom_lengths
{'chr1': 4}

chrom_lengths property

chrom_lengths: dict[str, int]

Mapping of chromosome name to sequence length.

Returns:

Type Description
dict[str, int]

Dict of {chrom: len(sequence)}.

Examples:

>>> Genome("g", {}, {"c1": "ATGC"}, (), {}).chrom_lengths
{'c1': 4}

chroms property

chroms: list[str]

Sorted list of chromosome names.

Returns:

Type Description
list[str]

Chromosome names in sorted order.

Examples:

>>> Genome("g", {}, {"b": "AT", "a": "GC"}, (), {}).chroms
['a', 'b']

features_on

features_on(chrom: str) -> tuple[Region, ...]

Filter features to a single chromosome.

Parameters:

Name Type Description Default
chrom str

Chromosome name to filter by.

required

Returns:

Type Description
tuple[Region, ...]

Tuple of regions on the given chromosome.

Examples:

>>> r = Region("chr1", 10, 20)
>>> Genome("g", {}, {}, (r,), {}).features_on("chr1")
(Region(chrom='chr1', start=10, end=20, strand='.', score=0.0, name='', tags={}),)
Source code in src/seqchain/io/genome.py
def features_on(self, chrom: str) -> tuple[Region, ...]:
    """Filter features to a single chromosome.

    Args:
        chrom: Chromosome name to filter by.

    Returns:
        Tuple of regions on the given chromosome.

    Examples:
        >>> r = Region("chr1", 10, 20)
        >>> Genome("g", {}, {}, (r,), {}).features_on("chr1")
        (Region(chrom='chr1', start=10, end=20, strand='.', score=0.0, name='', tags={}),)
    """
    return tuple(f for f in self.features if f.chrom == chrom)

genes

genes() -> tuple[Region, ...]

Filter features to gene-type only.

Returns:

Type Description
tuple[Region, ...]

Tuple of regions where tags["feature_type"] == "gene".

Examples:

>>> r = Region("c", 0, 10, tags={"feature_type": "gene"})
>>> Genome("g", {}, {}, (r,), {}).genes()
(Region(chrom='c', start=0, end=10, strand='.', score=0.0, name='', tags={'feature_type': 'gene'}),)
Source code in src/seqchain/io/genome.py
def genes(self) -> tuple[Region, ...]:
    """Filter features to gene-type only.

    Returns:
        Tuple of regions where ``tags["feature_type"] == "gene"``.

    Examples:
        >>> r = Region("c", 0, 10, tags={"feature_type": "gene"})
        >>> Genome("g", {}, {}, (r,), {}).genes()
        (Region(chrom='c', start=0, end=10, strand='.', score=0.0, name='', tags={'feature_type': 'gene'}),)
    """
    return tuple(f for f in self.features if f.tags.get("feature_type") == "gene")

feature_index

feature_index() -> IntervalTrack

Build an IntervalTrack from all features.

Returns:

Type Description
IntervalTrack

An IntervalTrack indexing every

IntervalTrack

feature in this genome.

Examples:

>>> Genome("g", {}, {}, (), {}).feature_index().name
'g_features'
Source code in src/seqchain/io/genome.py
def feature_index(self) -> IntervalTrack:
    """Build an IntervalTrack from all features.

    Returns:
        An `IntervalTrack` indexing every
        feature in this genome.

    Examples:
        >>> Genome("g", {}, {}, (), {}).feature_index().name
        'g_features'
    """
    return IntervalTrack(TrackLabel(f"{self.name}_features"), self.features)

is_circular

is_circular(chrom: str) -> bool

Check whether a chromosome has circular topology.

Parameters:

Name Type Description Default
chrom str

Chromosome name.

required

Returns:

Type Description
bool

True if topology is "circular", False otherwise.

Examples:

>>> Genome("g", {}, {}, (), {"c": "circular"}).is_circular("c")
True
Source code in src/seqchain/io/genome.py
def is_circular(self, chrom: str) -> bool:
    """Check whether a chromosome has circular topology.

    Args:
        chrom: Chromosome name.

    Returns:
        ``True`` if topology is ``"circular"``, ``False`` otherwise.

    Examples:
        >>> Genome("g", {}, {}, (), {"c": "circular"}).is_circular("c")
        True
    """
    return self.topologies.get(chrom, "linear") == "circular"

topological_sequence

topological_sequence(chrom: str, overhang: int = 100000) -> str

Return sequence with overhang appended for circular chromosomes.

For circular chromosomes, appends the first overhang bases of the sequence to the end, enabling alignment across the origin. For linear chromosomes, returns the sequence unchanged.

Parameters:

Name Type Description Default
chrom str

Chromosome name.

required
overhang int

Number of bases to append from the start. Defaults to 100_000.

100000

Returns:

Type Description
str

The (possibly extended) DNA string.

Examples:

>>> g = Genome("g", {}, {"c": "ATGC"}, (), {"c": "circular"})
>>> g.topological_sequence("c", overhang=2)
'ATGCAT'
Source code in src/seqchain/io/genome.py
def topological_sequence(self, chrom: str, overhang: int = 100_000) -> str:
    """Return sequence with overhang appended for circular chromosomes.

    For circular chromosomes, appends the first *overhang* bases of
    the sequence to the end, enabling alignment across the origin.
    For linear chromosomes, returns the sequence unchanged.

    Args:
        chrom: Chromosome name.
        overhang: Number of bases to append from the start.
            Defaults to ``100_000``.

    Returns:
        The (possibly extended) DNA string.

    Examples:
        >>> g = Genome("g", {}, {"c": "ATGC"}, (), {"c": "circular"})
        >>> g.topological_sequence("c", overhang=2)
        'ATGCAT'
    """
    seq = self.sequences[chrom]
    if not self.is_circular(chrom):
        return seq
    return seq + seq[:overhang]

load_genbank

load_genbank(path: str | Path) -> Genome

Parse a GenBank file into a Genome.

Supports plain .gb and gzip-compressed .gb.gz files. Extracts all features except source as Region objects.

BioPython is imported inside the function body (deferred import) so that the rest of the library can be used without it installed.

Parameters:

Name Type Description Default
path str | Path

Path to a GenBank file (.gb or .gb.gz).

required

Returns:

Type Description
Genome

A Genome populated with sequences, features, and metadata.

Examples:

>>> genome = load_genbank("tests/fixtures/synthetic_genome.gb")
Source code in src/seqchain/io/genome.py
def load_genbank(path: str | Path) -> Genome:
    """Parse a GenBank file into a Genome.

    Supports plain ``.gb`` and gzip-compressed ``.gb.gz`` files.
    Extracts all features except ``source`` as `Region` objects.

    BioPython is imported inside the function body (deferred import)
    so that the rest of the library can be used without it installed.

    Args:
        path: Path to a GenBank file (``.gb`` or ``.gb.gz``).

    Returns:
        A `Genome` populated with sequences, features, and metadata.

    Examples:
        >>> genome = load_genbank("tests/fixtures/synthetic_genome.gb")
    """
    from Bio import SeqIO
    from Bio.SeqFeature import CompoundLocation

    path = Path(path)
    open_func = gzip.open if path.name.endswith(".gz") else open

    organisms: dict[str, str] = {}
    sequences: dict[str, str] = {}
    topologies: dict[str, str] = {}
    features: list[Region] = []

    try:
        with open_func(path, "rt") as handle:
            for record in SeqIO.parse(handle, "genbank"):
                chrom = record.id
                sequences[chrom] = str(record.seq).upper()
                organisms[chrom] = record.annotations.get("organism", "")
                topologies[chrom] = record.annotations.get("topology", "linear")
                seq_len = len(record.seq)

                for feature in record.features:
                    if feature.type == "source":
                        continue

                    region = _extract_feature(
                        feature, chrom, seq_len, CompoundLocation,
                    )
                    features.append(region)
    except FileNotFoundError:
        raise FileNotFoundError(
            f"Genome file not found: {path}"
        ) from None
    except Exception as exc:
        if isinstance(exc, FileNotFoundError):
            raise
        raise ValueError(
            f"Failed to parse GenBank file {path}: {exc}"
        ) from exc

    return Genome(
        name=path.stem.removesuffix(".gb"),
        organisms=organisms,
        sequences=sequences,
        features=tuple(features),
        topologies=topologies,
    )

load_fasta

load_fasta(path: str | Path, circular: set[str] | None = None) -> Genome

Parse a FASTA file into a Genome.

Produces a Genome with sequences only — no features, no organism metadata. All chromosomes default to linear topology unless listed in circular.

Supports plain and gzip-compressed (.gz) files. No BioPython required — uses a stdlib-only parser.

Parameters:

Name Type Description Default
path str | Path

Path to a FASTA file.

required
circular set[str] | None

Set of chromosome names to mark as circular. Defaults to None (all linear).

None

Returns:

Type Description
Genome

A Genome with sequences and topologies only.

Examples:

>>> genome = load_fasta("tests/fixtures/synthetic_genome.fa")
Source code in src/seqchain/io/genome.py
def load_fasta(
    path: str | Path,
    circular: set[str] | None = None,
) -> Genome:
    """Parse a FASTA file into a Genome.

    Produces a Genome with sequences only — no features, no organism
    metadata. All chromosomes default to linear topology unless
    listed in *circular*.

    Supports plain and gzip-compressed (``.gz``) files. No BioPython
    required — uses a stdlib-only parser.

    Args:
        path: Path to a FASTA file.
        circular: Set of chromosome names to mark as circular.
            Defaults to ``None`` (all linear).

    Returns:
        A `Genome` with sequences and topologies only.

    Examples:
        >>> genome = load_fasta("tests/fixtures/synthetic_genome.fa")
    """
    path = Path(path)
    circular = circular or set()

    sequences: dict[str, str] = {}
    organisms: dict[str, str] = {}
    topologies: dict[str, str] = {}

    from seqchain.io.fasta import iter_fasta

    for name, seq in iter_fasta(path):
        sequences[name] = seq
        organisms[name] = ""
        topologies[name] = "circular" if name in circular else "linear"

    return Genome(
        name=path.stem,
        organisms=organisms,
        sequences=sequences,
        features=(),
        topologies=topologies,
    )