autometa.taxonomy package

Submodules

autometa.taxonomy.database module

class autometa.taxonomy.database.TaxonomyDatabase

Bases: ABC

TaxonomyDatabase Abstract Base Class

Abstract methods

  1. parse_nodes(self)

  2. parse_names(self)

  3. parse_merged(self)

  4. parse_delnodes(self)

  5. convert_accessions_to_taxids(self)

e.g.

class GTDB(TaxonomyDatabase):
def __init__(self, …):

self.nodes = self.parse_nodes() self.names = self.parse_names() self.merged = self.parse_merged() self.delnodes = self.parse_delnodes() …

def parse_nodes(self):

…

def parse_nodes(self):

…

def parse_merged(self):

…

def parse_delnodes(self):

…

def convert_accessions_to_taxids(self, accessions):

…

Available methods (after aforementioned implementations):

  1. convert_taxid_dtype

  2. name

  3. rank

  4. parent

  5. lineage

  6. is_common_ancestor

  7. get_lineage_dataframe

Available attributes:

CANONICAL_RANKS UNCLASSIFIED

CANONICAL_RANKS = ['species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom', 'root']
UNCLASSIFIED = 'unclassified'
abstract convert_accessions_to_taxids(accessions: Dict[str, Set[str]]) Tuple[Dict[str, Set[int]], pandas.DataFrame]

Translates subject sequence ids to taxids

Parameters:

accessions (dict) – {qseqid: {sseqid, …}, …}

Returns:

{qseqid: {taxid, taxid, …}, …}, index=range, cols=[qseqid, sseqid, raw_taxid, …, cleaned_taxid]

Return type:

Tuple[Dict[str, Set[int]], pd.DataFrame]

convert_taxid_dtype(taxid: int) int
  1. Converts the given taxid to an integer and checks whether it is positive.

2. Checks whether taxid is present in both nodes.dmp and names.dmp. 3a. If (2) is false, will check for corresponding taxid in merged.dmp and convert to this then redo (2). 3b. If (2) is true, will return converted taxid. 4. If (3a) is false will look for taxid in delnodes.dmp. If present will convert to root (taxid=1)

Parameters:

taxid (int) – identifier for a taxon in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp

Returns:

taxid if the taxid is a positive integer and present in either nodes.dmp or names.dmp or taxid recovered from merged.dmp

Return type:

int

Raises:
  • ValueError – Provided taxid is not a positive integer

  • DatabaseOutOfSyncError – NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other

get_lineage_dataframe(taxids: Iterable, fillna: bool = True) pandas.DataFrame

Given an iterable of taxids generate a pandas DataFrame of their canonical lineages

Parameters:
  • taxids (iterable) – taxids whose lineage dataframe is being returned

  • fillna (bool, optional) – Whether to fill the empty cells with TaxonomyDatabase.UNCLASSIFIED or not, default True

Returns:

index = taxid columns = [superkingdom,phylum,class,order,family,genus,species]

Return type:

pd.DataFrame

Example

If you would like to merge the returned DataFrame (β€˜this_df’) with another DataFrame (β€˜your_df’). Let’s say where you retrieved your taxids:

merged_df = pd.merge(
    left=your_df,
    right=this_df,
    how='left',
    left_on=<taxid_column>,
    right_index=True)
is_common_ancestor(taxid_A: int, taxid_B: int) bool

Determines whether the provided taxids have a non-root common ancestor

Parameters:
  • taxid_A (int) – taxid in taxonomy database

  • taxid_B (int) – taxid in taxonomy database

Returns:

True if taxids share a common ancestor else False

Return type:

boolean

lineage(taxid: int, canonical: bool = True) List[Dict[str, Union[str, int]]]

Returns the lineage of taxids encountered when traversing to root

Parameters:
  • taxid (int) – taxid in nodes.dmp, whose lineage is being returned

  • canonical (bool, optional) – Lineage includes both canonical and non-canonical ranks when False, and only the canonical ranks when True Canonical ranks include : species, genus , family, order, class, phylum, superkingdom, root

Returns:

[{β€˜taxid’:taxid, β€˜rank’:rank,’name’:name}, …]

Return type:

ordered list of dicts

name(taxid: int, rank: Optional[str] = None) str

Parses through the names.dmp in search of the given taxid and returns its name.

Parameters:
  • taxid (int) – taxid whose name is being returned

  • rank (str, optional) – If provided, will return taxid name at rank, by default None Must be a canonical rank, choices: species, genus, family, order, class, phylum, superkingdom Eg. self.name(562, β€˜genus’) would return β€˜Escherichia’, where 562 is the taxid for Escherichia coli

Returns:

Name of provided taxid if taxid is found in names.dmp else TaxonomyDatabase.UNCLASSIFIED

Return type:

str

parent(taxid: int) int

Retrieve the parent taxid of provided taxid.

Parameters:

taxid (int) – child taxid to retrieve parent

Returns:

Parent taxid if found in nodes otherwise 1

Return type:

int

abstract parse_delnodes() Set[int]

Parses delnodes.dmp such that deleted `taxid`s may be updated with their up-to-date `taxid`s

Returns:

{taxid, …}

Return type:

set

abstract parse_merged() Dict[int, int]

Parses merged.dmp such that merged `taxid`s may be updated with their up-to-date `taxid`s

Returns:

{old_taxid: new_taxid, …}

Return type:

dict

abstract parse_names() Dict[int, str]

Parses through the names.dmp in search of the given taxid and returns its name

Parameters:
  • taxid (int) – taxid whose name is being returned

  • rank (str, optional) – If provided, will return taxid name at rank, by default None Must be a canonical rank, choices: species, genus, family, order, class, phylum, superkingdom Eg. self.name(562, β€˜genus’) would return β€˜Escherichia’, where 562 is the taxid for Escherichia coli

Returns:

Name of provided taxid if taxid is found in names.dmp else TaxonomyDatabase.UNCLASSIFIED

Return type:

str

abstract parse_nodes() Dict[int, Dict[str, Union[str, int]]]

Parse the nodes.dmp database and set to self.nodes.

Returns:

{child_taxid:{β€˜parent’:parent_taxid,’rank’:rank}, …}

Return type:

dict

rank(taxid: int) str

Return the respective rank of provided taxid.

Parameters:

taxid (int) – taxid to retrieve rank from nodes

Returns:

rank name if taxid is found in nodes else autoattribute:: autometa.taxonomy.database.TaxonomyDatabase.UNCLASSIFIED

Return type:

str

autometa.taxonomy.gtdb module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

File containing definition of the GTDB class and containing functions useful for handling GTDB taxonomy databases

class autometa.taxonomy.gtdb.GTDB(dbdir: str, verbose: bool = True)

Bases: TaxonomyDatabase

Taxonomy utilities for GTDB databases.

__repr__()

Operator overloading to return the string representation of the class object

Returns:

String representation of the class object

Return type:

str

__str__()

Operator overloading to return the directory path of the class object

Returns:

Directory path of the class object

Return type:

str

convert_accessions_to_taxids(accessions: Dict[str, Set[str]]) Tuple[Dict[str, Set[int]], pandas.DataFrame]

Translates subject sequence ids to taxids

Parameters:

accessions (dict) – {qseqid: {sseqid, …}, …}

Returns:

{qseqid: {taxid, taxid, …}, …}, index=range, cols=[qseqid, sseqid, raw_taxid, …, cleaned_taxid]

Return type:

Tuple[Dict[str, Set[int]], pd.DataFrame]

parse_delnodes() Set[int]

Parse the delnodes.dmp database

Returns:

{taxid, …}

Return type:

set

parse_merged() Dict[int, int]

Parse the merged.dmp database

Returns:

{old_taxid: new_taxid, …}

Return type:

dict

parse_names() Dict[int, str]

Parses through names.dmp database and loads taxids with scientific names

Returns:

{taxid:name, …}

Return type:

dict

parse_nodes() Dict[int, str]

Parse the nodes.dmp database. Note: This is performed when a new GTDB class instance is constructed

Returns:

{child_taxid:{β€˜parent’:parent_taxid,’rank’:rank}, …}

Return type:

dict

search_genome_accessions(accessions: set) Dict[str, int]

Search taxid.map file

Parameters:

accessions (set) – Set of subject sequence ids retrieved from diamond blastp search (sseqids)

Returns:

Dictionary containing sseqids converted to taxids

Return type:

Dict[str, int]

verify_databases()

Verify if the required databases are present.

Raises:

FileNotFoundError – One or more of the required database were not found.

autometa.taxonomy.gtdb.create_gtdb_db(reps_faa: str, dbdir: str) str

Generate a combined faa file to create the GTDB-t database.

Parameters:
  • reps_faa (str) – Directory having faa file of all representative genomes. Can be tarballed.

  • dbdir (str) – Path to output directory.

Returns:

Path to combined faa file. This can be used to make a diamond database.

Return type:

str

autometa.taxonomy.gtdb.main()

autometa.taxonomy.lca module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

This script contains the LCA class containing methods to determine the Lowest Common Ancestor given a tab-delimited BLAST table, fasta file, or iterable of SeqRecords.

Note: LCA will assume the BLAST results table is in output format 6.

class autometa.taxonomy.lca.LCA(taxonomy_db: TaxonomyDatabase, verbose: bool = False, cache: str = '')

Bases: object

LCA class containing methods to retrieve the Lowest Common Ancestor.

LCAs may be computed given taxids, a fasta or BLAST results.

Parameters:
  • dbdir (str) – Path to directory containing files: nodes.dmp, names.dmp, merged.dmp, prot.accession2taxid.gz

  • outdir (str) – Output directory path to to serialize intermediate files to disk for later lookup

  • verbose (bool, optional) – Add verbosity to logging stream (the default is False).

disable

Opposite of verbose. Used to disable tqdm module.

Type:

bool

tour_fp

</path/to/serialized/file/eulerian/tour.pkl.gz>

Type:

str

tour

Eulerian tour containing branches and leaves information from tree traversal.

Type:

list

level_fp

</path/to/serialized/file/level.pkl.gz>

Type:

str

level

Lengths from root corresponding to tour during tree traversal.

Type:

list

occurrence_fp

</path/to/serialized/file/level.pkl.gz>

Type:

str

occurrence

Contains first occurrence of each taxid while traversing tree (index in tour). e.g. {taxid:index, taxid: index, …}

Type:

dict

sparse_fp

</path/to/serialized/file/sparse.pkl.gz>

Type:

str

sparse

Precomputed LCA values corresponding to tour,`level` and occurrence.

Type:

numpy.ndarray

lca_prepared

Whether LCA internals have been computed (e.g. tour,`level`,`occurrence`,`sparse`).

Type:

bool

blast2lca(blast: str, out: str, sseqid_to_taxid_output: str = '', lca_reduction_log: str = '', force: bool = False) str

Determine lowest common ancestor of provided amino-acid ORFs.

Parameters:
  • blast (str) – </path/to/diamond/outfmt6/blastp.tsv>.

  • out (str) – </path/to/output/lca.tsv>.

  • sseqid_to_taxid_output (str) – Path to write qseqids’ sseqids and their taxid designations from NCBI databases

  • force (bool, optional) – Force overwrite of existing out.

Returns:

out </path/to/output/lca.tsv>.

Return type:

str

lca(node1, node2)

Performs Range Minimum Query between 2 taxids.

Parameters:
  • node1 (int) – taxid

  • node2 (int) – taxid

Returns:

LCA taxid

Return type:

int

Raises:

ValueError – Provided taxid is not in the nodes.dmp tree.

parse(lca_fpath: str, orfs_fpath: Optional[str] = None) Dict[str, Dict[str, Dict[int, int]]]

Retrieve and construct contig dictionary from provided lca_fpath.

Parameters:
  • lca_fpath (str) – </path/to/lcas.tsv> tab-delimited ordered columns: qseqid, name, rank, lca_taxid

  • orfs_fpath (str, optional (required if using prodigal version <2.6)) – </path/to/prodigal/called/orfs.fasta> Note: These ORFs should correspond to the ORFs provided in the BLAST table.

Returns:

{contig:{rank:{taxid:counts, …}, rank:{…}, …}, …}

Return type:

dict

Raises:
  • FileNotFoundError – lca_fpath does not exist.

  • FileNotFoundError – orfs_fpath does not exist.

  • ValueError – If prodigal version is under 2.6, orfs_fpath is a required input.

prepare_lca()

Prepare LCA internal data structures for lca().

e.g. self.tour, self.level, self.occurrence, self.sparse are all ready.

Returns:

Prepares all LCA internals and if successful sets self.lca_prepared to True.

Return type:

NoneType

prepare_tree()

Performs Eulerian tour of nodes.dmp taxids and constructs three data structures:

  1. tour : list of branches and leaves.

  2. level: list of distances from the root.

  3. occurrence: dict of occurences of the taxid respective to the root.

Notes

For more information on why we construct these three data structures see references below:

Returns:

sets internals to be used for LCA lookup

Return type:

NoneType

preprocess_minimums()

Preprocesses all possible LCAs.

This constructs a sparse table to be used for LCA/Range Minimum Query using the self.level array associated with its respective eulerian self.tour. For more information on these data structures see prepare_tree().

Sparse table size:

  • n = number of elements in level list

  • rows range = (0 to n)

  • columns range = (0 to logn)

Returns:

sets self.sparse internal to be used for LCA lookup.

Return type:

NoneType

read_sseqid_to_taxid_table(sseqid_to_taxid_filepath: str) Dict[str, Set[int]]

Retrieve each qseqid’s set of taxids from sseqid_to_taxid_filepath for reduction by LCA

Parameters:

sseqid_to_taxid_filepath (str) – Path to sseqid to taxid table with columns: qseqid, sseqid, raw_taxid, merged_taxid, clean_taxid

Returns:

Dictionary keyed by qseqid containing sets of respective clean taxid

Return type:

Dict[str, Set[int]]

reduce_taxids_to_lcas(taxids: Dict[str, Set[int]]) Tuple[Dict[str, int], pandas.DataFrame]

Retrieves the lowest common ancestor for each set of taxids in of the taxids

Parameters:

taxids (dict) – {qseqid: {taxid, …}, qseqid: {taxid, …}, …}

Returns:

{qseqid: lca, qseqid: lca, …}, pd.DataFrame(index=range, cols=[qseqid, taxids])

Return type:

Tuple[Dict[str, int], pd.DataFrame]

write_lcas(lcas: Dict[str, int], out: str) str

Write lcas to tab-delimited file: out.

Ordered columns are:

  • qseqid : query seqid

  • name : LCA name

  • rank : LCA rank

  • lca : LCA taxid

Parameters:
  • lcas (dict) – {qseqid:lca_taxid, qseqid:lca_taxid, …}

  • out (str) – </path/to/output/file.tsv>

Returns:

out

Return type:

str

autometa.taxonomy.lca.main()

autometa.taxonomy.majority_vote module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

This script contains the modified majority vote algorithm used in Autometa version 1.0

autometa.taxonomy.majority_vote.is_consistent_with_other_orfs(taxid: int, rank: str, rank_counts: Dict[str, Dict], taxa_db: TaxonomyDatabase) bool

Determines whether the majority of proteins in a contig, with rank equal to or above the given rank, are common ancestors of the taxid.

If the majority are, this function returns True, otherwise it returns False.

Parameters:
  • taxid (int) – taxid to search against other taxids at rank in rank_counts.

  • rank (str) – Canonical rank to search in rank_counts. Choices: species, genus, family, order, class, phylum, superkingdom.

  • rank_counts (dict) – LCA canonical rank counts retrieved from ORFs respective to a contig. e.g. {canonical_rank: {taxid: num_hits, …}, …}

  • ncbi (NCBI instance) – Instance or subclass of NCBI from autometa.taxonomy.ncbi.

Returns:

If the majority of ORFs in a contig are equal or above given rank then return True, otherwise return False.

Return type:

boolean

autometa.taxonomy.majority_vote.lowest_majority(rank_counts: Dict[str, Dict], taxa_db: TaxonomyDatabase) int

Determine the lowest majority given rank_counts by first attempting to get a taxid that leads in counts with the highest specificity in terms of canonical rank.

Parameters:
  • rank_counts (dict) – {canonical_rank:{taxid:num_hits, …}, rank2: {…}, …}

  • taxa_db (TaxonomyDatabase instance) – NCBI or GTDB subclass object of autometa.taxonomy.database.TaxonomyDatabase

Returns:

Taxid above the lowest majority threshold.

Return type:

int

autometa.taxonomy.majority_vote.main()
autometa.taxonomy.majority_vote.majority_vote(lca_fpath: str, out: str, taxa_db: TaxonomyDatabase, verbose: bool = False, orfs: Optional[str] = None) str

Wrapper for modified majority voting algorithm from Autometa 1.0

Parameters:
  • lca_fpath (str) – Path to lowest common ancestor assignments table.

  • out (str) – Path to write assigned taxids.

  • taxa_db (TaxonomyDatabase) – An instance of TaxonomyDatabase

  • verbose (bool, optional) – Increase verbosity of logging stream

  • orfs (str, optional) – Path to prodigal called orfs corresponding to LCA table computed from BLAST output

  • force (bool, optional) – Whether to overwrite existing LCA results.

Returns:

Path to assigned taxids table.

Return type:

str

autometa.taxonomy.majority_vote.rank_taxids(ctg_lcas: Dict[str, Dict[str, Dict[int, int]]], taxa_db: TaxonomyDatabase, verbose: bool = False) Dict[str, int]

Votes for taxids based on modified majority vote system where if a majority does not exist, the lowest majority is voted.

Parameters:
  • ctg_lcas (dict) – {ctg1:{canonical_rank:{taxid:num_hits,…},…}, ctg2:{…},…}

  • taxa_db (TaxonomyDatabase) – instance of NCBI or GTDB subclass of autometa.taxonomy.database.TaxonomyDatabase

  • verbose (bool) – Description of parameter verbose (the default is False).

Returns:

{contig:voted_taxid, contig:voted_taxid, …}

Return type:

dict

autometa.taxonomy.majority_vote.write_votes(results: Dict[str, int], out: str) str

Writes voting results to provided outfpath.

Parameters:
  • results (dict) – {contig:voted_taxid, contig:voted_taxid, …}

  • out (str) – </path/to/results.tsv>.

Returns:

</path/to/results.tsv>

Return type:

str

Raises:

FileExistsError – Voting results file already exists

autometa.taxonomy.ncbi module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

File containing definition of the NCBI class and containing functions useful for handling NCBI taxonomy databases

class autometa.taxonomy.ncbi.NCBI(dbdir, verbose=False)

Bases: TaxonomyDatabase

Taxonomy utilities for NCBI databases.

__repr__()

Operator overloading to return the string representation of the class object

Returns:

String representation of the class object

Return type:

str

__str__()

Operator overloading to return the directory path of the class object

Returns:

Directory path of the class object

Return type:

str

convert_accessions_to_taxids(accessions: set) Tuple[Dict[str, Set[int]], pandas.DataFrame]

Translates subject sequence ids to taxids

Parameters:

accessions (dict) – {qseqid: {sseqid, …}, …}

Returns:

{qseqid: {taxid, taxid, …}, …}, index=range, cols=[qseqid, sseqid, raw_taxid, …, cleaned_taxid]

Return type:

Tuple[Dict[str, Set[int]], pd.DataFrame]

convert_taxid_dtype(taxid: int) int
  1. Converts the given taxid to an integer and checks whether it is positive.

2. Checks whether taxid is present in both nodes.dmp and names.dmp. 3a. If (2) is false, will check for corresponding taxid in merged.dmp and convert to this then redo (2). 3b. If (2) is true, will return converted taxid. 4. If (3a) is false will look for taxid in delnodes.dmp. If present will convert to root (taxid=1)

Parameters:

taxid (int) – identifier for a taxon in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp

Returns:

taxid if the taxid is a positive integer and present in either nodes.dmp or names.dmp or taxid recovered from merged.dmp

Return type:

int

Raises:
  • ValueError – Provided taxid is not a positive integer

  • DatabaseOutOfSyncError – NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other

parse_delnodes() Set[int]

Parse the delnodes.dmp database Note: This is performed when a new NCBI class instance is constructed

Returns:

{taxid, …}

Return type:

set

parse_merged() Dict[int, int]

Parse the merged.dmp database Note: This is performed when a new NCBI class instance is constructed

Returns:

{old_taxid: new_taxid, …}

Return type:

dict

parse_names() Dict[int, str]

Parses through names.dmp database and loads taxids with scientific names

Returns:

{taxid:name, …}

Return type:

dict

parse_nodes() Dict[int, str]

Parse the nodes.dmp database to be used later by autometa.taxonomy.ncbi.NCBI.parent(), autometa.taxonomy.ncbi.NCBI.rank() Note: This is performed when a new NCBI class instance is constructed

Returns:

{child_taxid:{β€˜parent’:parent_taxid,’rank’:rank}, …}

Return type:

dict

search_prot_accessions(accessions: set, sseqids_to_taxids: Optional[Dict[str, int]] = None, db: str = 'live') Dict[str, int]

Search prot.accession2taxid.gz and dead_prot.accession2taxid.gz

Parameters:
  • accessions (set) – Set of subject sequence ids retrieved from diamond blastp search (sseqids)

  • sseqids_to_taxids (Dict[str, int], optional) – Dictionary containing sseqids converted to taxids

  • db (str, optional) –

    selection of one of the prot accession to taxid databases from NCBI. Choices are live, dead, full

    • live: prot.accession2taxid.gz

    • full: prot.accession2taxid.FULL.gz

    • dead: dead_prot.accession2taxid.gz

Returns:

Dictionary containing sseqids converted to taxids

Return type:

Dict[str, int]

autometa.taxonomy.vote module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Script to split metagenome assembly by kingdoms given the input votes. The lineages of the provided voted taxids will also be added and written to taxonomy.tsv

autometa.taxonomy.vote.add_ranks(df: pandas.DataFrame, taxa_db: TaxonomyDatabase) pandas.DataFrame

Add canonical ranks to df and write to out

Parameters:
  • df (pd.DataFrame) – index=”contig”, column=”taxid”

  • taxa_db (TaxonomyDatabase) – NCBI or GTDB TaxonomyDatabase instance.

Returns:

index=”contig”, columns=[β€œtaxid”, *canonical_ranks]

Return type:

pd.DataFrame

autometa.taxonomy.vote.assign(out: str, method: str = 'majority_vote', assembly: Optional[str] = None, prot_orfs: Optional[str] = None, nucl_orfs: Optional[str] = None, blast: Optional[str] = None, lca_fpath: Optional[str] = None, dbdir: str = './autometa/databases/ncbi', dbtype: Literal['ncbi', 'gtdb'] = 'ncbi', force: bool = False, verbose: bool = False, parallel: bool = True, cpus: int = 0) pandas.DataFrame

Assign taxonomy using method and write to out.

Parameters:
  • out (str) – Path to write taxonomy table of votes

  • method (str, optional) – Method to assign contig taxonomy, by default β€œmajority_vote”. choices include β€œmajority_vote”, …

  • assembly (str, optional) – Path to assembly fasta file (nucleotide), by default None

  • prot_orfs (str, optional) – Path to amino-acid ORFs called from assembly, by default None

  • nucl_orfs (str, optional) – Path to nucleotide ORFs called from assembly, by default None

  • blast (str, optional) – Path to blastp table, by default None

  • lca_fpath (str, optional) – Path to output of LCA analysis, by default None

  • dbdir (str, optional) – Path to NCBI databases directory, by default NCBI_DIR

  • dbtype (str, optional) – Type of Taxonomy database to use, by default ncbi

  • force (bool, optional) – Overwrite existing annotations, by default False

  • verbose (bool, optional) – Increase verbosity, by default False

  • parallel (bool, optional) – Whether to perform annotations using multiprocessing and GNU parallel, by default True

  • cpus (int, optional) – Number of cpus to use if parallel is True, by default will try to use all available.

Returns:

index=”contig”, columns=[β€œtaxid”]

Return type:

pd.DataFrame

Raises:
  • NotImplementedError – Provided method has not yet been implemented.

  • ValueError – Assembly file is required if no other annotations are provided.

autometa.taxonomy.vote.get(filepath_or_dataframe: Union[str, pandas.DataFrame], kingdom: str, taxa_db: TaxonomyDatabase) pandas.DataFrame

Retrieve specific kingdom voted taxa for assembly from filepath

Parameters:
  • filepath (str) – Path to tab-delimited taxonomy table. cols=[β€˜contig’,’taxid’, *canonical_ranks]

  • kingdom (str) – rank to retrieve from superkingdom column in taxonomy table.

  • ncbi (str or autometa.taxonomy.NCBI instance, optional) – Path to NCBI database directory or NCBI instance, by default NCBI_DIR. This is necessary only if filepath does not already contain columns of canonical ranks.

Returns:

DataFrame of contigs pertaining to retrieved kingdom.

Return type:

pd.DataFrame

Raises:
  • FileNotFoundError – Provided filepath does not exists or is empty.

  • TableFormatError – Provided filepath does not contain the β€˜superkingdom’ column.

  • KeyError – kingdom is absent in provided taxonomy table.

autometa.taxonomy.vote.main()
autometa.taxonomy.vote.write_ranks(taxonomy: pandas.DataFrame, assembly: str, outdir: str, rank: str = 'superkingdom', prefix: Optional[str] = None) List[str]

Write fastas split by rank

Parameters:
  • taxonomy (pd.DataFrame) – dataframe containing canonical ranks of contigs assigned from :func:autometa.taxonomy.vote.assign(…)

  • assembly (str) – Path to assembly fasta file

  • outdir (str) – Path to output directory to write fasta files

  • rank (str, optional) – canonical rank column in taxonomy table to split by, by default β€œsuperkingdom”

  • prefix (str, optional) – Prefix each of the paths written with prefix string.

Returns:

[rank_name_fpath, …]

Return type:

list

Raises:

KeyError – rank not in taxonomy columns

Module contents