autometa.taxonomy packageο
Submodulesο
autometa.taxonomy.database moduleο
- class autometa.taxonomy.database.TaxonomyDatabaseο
Bases:
ABC
TaxonomyDatabase Abstract Base Class
Abstract methods
parse_nodes(self)
parse_names(self)
parse_merged(self)
parse_delnodes(self)
convert_accessions_to_taxids(self)
e.g.
- class GTDB(TaxonomyDatabase):
- def __init__(self, β¦):
self.nodes = self.parse_nodes() self.names = self.parse_names() self.merged = self.parse_merged() self.delnodes = self.parse_delnodes() β¦
- def parse_nodes(self):
β¦
- def parse_nodes(self):
β¦
- def parse_merged(self):
β¦
- def parse_delnodes(self):
β¦
- def convert_accessions_to_taxids(self, accessions):
β¦
Available methods (after aforementioned implementations):
convert_taxid_dtype
name
rank
parent
lineage
is_common_ancestor
get_lineage_dataframe
Available attributes:
CANONICAL_RANKS UNCLASSIFIED
- CANONICAL_RANKS = ['species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom', 'root']ο
- UNCLASSIFIED = 'unclassified'ο
- abstract convert_accessions_to_taxids(accessions: Dict[str, Set[str]]) Tuple[Dict[str, Set[int]], pandas.DataFrame] ο
Translates subject sequence ids to taxids
- Parameters:
accessions (dict) β {qseqid: {sseqid, β¦}, β¦}
- Returns:
{qseqid: {taxid, taxid, β¦}, β¦}, index=range, cols=[qseqid, sseqid, raw_taxid, β¦, cleaned_taxid]
- Return type:
Tuple[Dict[str, Set[int]], pd.DataFrame]
- convert_taxid_dtype(taxid: int) int ο
Converts the given taxid to an integer and checks whether it is positive.
2. Checks whether taxid is present in both nodes.dmp and names.dmp. 3a. If (2) is false, will check for corresponding taxid in merged.dmp and convert to this then redo (2). 3b. If (2) is true, will return converted taxid. 4. If (3a) is false will look for taxid in delnodes.dmp. If present will convert to root (taxid=1)
- Parameters:
taxid (int) β identifier for a taxon in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp
- Returns:
taxid if the taxid is a positive integer and present in either nodes.dmp or names.dmp or taxid recovered from merged.dmp
- Return type:
int
- Raises:
ValueError β Provided taxid is not a positive integer
DatabaseOutOfSyncError β NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other
- get_lineage_dataframe(taxids: Iterable, fillna: bool = True) pandas.DataFrame ο
Given an iterable of taxids generate a pandas DataFrame of their canonical lineages
- Parameters:
taxids (iterable) β taxids whose lineage dataframe is being returned
fillna (bool, optional) β Whether to fill the empty cells with TaxonomyDatabase.UNCLASSIFIED or not, default True
- Returns:
index = taxid columns = [superkingdom,phylum,class,order,family,genus,species]
- Return type:
pd.DataFrame
Example
If you would like to merge the returned DataFrame (βthis_dfβ) with another DataFrame (βyour_dfβ). Letβs say where you retrieved your taxids:
merged_df = pd.merge( left=your_df, right=this_df, how='left', left_on=<taxid_column>, right_index=True)
- is_common_ancestor(taxid_A: int, taxid_B: int) bool ο
Determines whether the provided taxids have a non-root common ancestor
- Parameters:
taxid_A (int) β taxid in taxonomy database
taxid_B (int) β taxid in taxonomy database
- Returns:
True if taxids share a common ancestor else False
- Return type:
boolean
- lineage(taxid: int, canonical: bool = True) List[Dict[str, Union[str, int]]] ο
Returns the lineage of taxids encountered when traversing to root
- Parameters:
taxid (int) β taxid in nodes.dmp, whose lineage is being returned
canonical (bool, optional) β Lineage includes both canonical and non-canonical ranks when False, and only the canonical ranks when True Canonical ranks include : species, genus , family, order, class, phylum, superkingdom, root
- Returns:
[{βtaxidβ:taxid, βrankβ:rank,βnameβ:name}, β¦]
- Return type:
ordered list of dicts
- name(taxid: int, rank: Optional[str] = None) str ο
Parses through the names.dmp in search of the given taxid and returns its name.
- Parameters:
taxid (int) β taxid whose name is being returned
rank (str, optional) β If provided, will return taxid name at rank, by default None Must be a canonical rank, choices: species, genus, family, order, class, phylum, superkingdom Eg. self.name(562, βgenusβ) would return βEscherichiaβ, where 562 is the taxid for Escherichia coli
- Returns:
Name of provided taxid if taxid is found in names.dmp else TaxonomyDatabase.UNCLASSIFIED
- Return type:
str
- parent(taxid: int) int ο
Retrieve the parent taxid of provided taxid.
- Parameters:
taxid (int) β child taxid to retrieve parent
- Returns:
Parent taxid if found in nodes otherwise 1
- Return type:
int
- abstract parse_delnodes() Set[int] ο
Parses delnodes.dmp such that deleted `taxid`s may be updated with their up-to-date `taxid`s
- Returns:
{taxid, β¦}
- Return type:
set
- abstract parse_merged() Dict[int, int] ο
Parses merged.dmp such that merged `taxid`s may be updated with their up-to-date `taxid`s
- Returns:
{old_taxid: new_taxid, β¦}
- Return type:
dict
- abstract parse_names() Dict[int, str] ο
Parses through the names.dmp in search of the given taxid and returns its name
- Parameters:
taxid (int) β taxid whose name is being returned
rank (str, optional) β If provided, will return taxid name at rank, by default None Must be a canonical rank, choices: species, genus, family, order, class, phylum, superkingdom Eg. self.name(562, βgenusβ) would return βEscherichiaβ, where 562 is the taxid for Escherichia coli
- Returns:
Name of provided taxid if taxid is found in names.dmp else TaxonomyDatabase.UNCLASSIFIED
- Return type:
str
- abstract parse_nodes() Dict[int, Dict[str, Union[str, int]]] ο
Parse the nodes.dmp database and set to self.nodes.
- Returns:
{child_taxid:{βparentβ:parent_taxid,βrankβ:rank}, β¦}
- Return type:
dict
- rank(taxid: int) str ο
Return the respective rank of provided taxid.
- Parameters:
taxid (int) β taxid to retrieve rank from nodes
- Returns:
rank name if taxid is found in nodes else autoattribute:: autometa.taxonomy.database.TaxonomyDatabase.UNCLASSIFIED
- Return type:
str
autometa.taxonomy.gtdb moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
File containing definition of the GTDB class and containing functions useful for handling GTDB taxonomy databases
- class autometa.taxonomy.gtdb.GTDB(dbdir: str, verbose: bool = True)ο
Bases:
TaxonomyDatabase
Taxonomy utilities for GTDB databases.
- __repr__()ο
Operator overloading to return the string representation of the class object
- Returns:
String representation of the class object
- Return type:
str
- __str__()ο
Operator overloading to return the directory path of the class object
- Returns:
Directory path of the class object
- Return type:
str
- convert_accessions_to_taxids(accessions: Dict[str, Set[str]]) Tuple[Dict[str, Set[int]], pandas.DataFrame] ο
Translates subject sequence ids to taxids
- Parameters:
accessions (dict) β {qseqid: {sseqid, β¦}, β¦}
- Returns:
{qseqid: {taxid, taxid, β¦}, β¦}, index=range, cols=[qseqid, sseqid, raw_taxid, β¦, cleaned_taxid]
- Return type:
Tuple[Dict[str, Set[int]], pd.DataFrame]
- parse_delnodes() Set[int] ο
Parse the delnodes.dmp database
- Returns:
{taxid, β¦}
- Return type:
set
- parse_merged() Dict[int, int] ο
Parse the merged.dmp database
- Returns:
{old_taxid: new_taxid, β¦}
- Return type:
dict
- parse_names() Dict[int, str] ο
Parses through names.dmp database and loads taxids with scientific names
- Returns:
{taxid:name, β¦}
- Return type:
dict
- parse_nodes() Dict[int, str] ο
Parse the nodes.dmp database. Note: This is performed when a new GTDB class instance is constructed
- Returns:
{child_taxid:{βparentβ:parent_taxid,βrankβ:rank}, β¦}
- Return type:
dict
- search_genome_accessions(accessions: set) Dict[str, int] ο
Search taxid.map file
- Parameters:
accessions (set) β Set of subject sequence ids retrieved from diamond blastp search (sseqids)
- Returns:
Dictionary containing sseqids converted to taxids
- Return type:
Dict[str, int]
- verify_databases()ο
Verify if the required databases are present.
- Raises:
FileNotFoundError β One or more of the required database were not found.
- autometa.taxonomy.gtdb.create_gtdb_db(reps_faa: str, dbdir: str) str ο
Generate a combined faa file to create the GTDB-t database.
- Parameters:
reps_faa (str) β Directory having faa file of all representative genomes. Can be tarballed.
dbdir (str) β Path to output directory.
- Returns:
Path to combined faa file. This can be used to make a diamond database.
- Return type:
str
- autometa.taxonomy.gtdb.main()ο
autometa.taxonomy.lca moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
This script contains the LCA class containing methods to determine the Lowest Common Ancestor given a tab-delimited BLAST table, fasta file, or iterable of SeqRecords.
Note: LCA will assume the BLAST results table is in output format 6.
- class autometa.taxonomy.lca.LCA(taxonomy_db: TaxonomyDatabase, verbose: bool = False, cache: str = '')ο
Bases:
object
LCA class containing methods to retrieve the Lowest Common Ancestor.
LCAs may be computed given taxids, a fasta or BLAST results.
- Parameters:
dbdir (str) β Path to directory containing files: nodes.dmp, names.dmp, merged.dmp, prot.accession2taxid.gz
outdir (str) β Output directory path to to serialize intermediate files to disk for later lookup
verbose (bool, optional) β Add verbosity to logging stream (the default is False).
- disableο
Opposite of verbose. Used to disable tqdm module.
- Type:
bool
- tour_fpο
</path/to/serialized/file/eulerian/tour.pkl.gz>
- Type:
str
- tourο
Eulerian tour containing branches and leaves information from tree traversal.
- Type:
list
- level_fpο
</path/to/serialized/file/level.pkl.gz>
- Type:
str
- levelο
Lengths from root corresponding to tour during tree traversal.
- Type:
list
- occurrence_fpο
</path/to/serialized/file/level.pkl.gz>
- Type:
str
- occurrenceο
Contains first occurrence of each taxid while traversing tree (index in tour). e.g. {taxid:index, taxid: index, β¦}
- Type:
dict
- sparse_fpο
</path/to/serialized/file/sparse.pkl.gz>
- Type:
str
- sparseο
Precomputed LCA values corresponding to tour,`level` and occurrence.
- Type:
numpy.ndarray
- lca_preparedο
Whether LCA internals have been computed (e.g. tour,`level`,`occurrence`,`sparse`).
- Type:
bool
- blast2lca(blast: str, out: str, sseqid_to_taxid_output: str = '', lca_reduction_log: str = '', force: bool = False) str ο
Determine lowest common ancestor of provided amino-acid ORFs.
- Parameters:
blast (str) β </path/to/diamond/outfmt6/blastp.tsv>.
out (str) β </path/to/output/lca.tsv>.
sseqid_to_taxid_output (str) β Path to write qseqidsβ sseqids and their taxid designations from NCBI databases
force (bool, optional) β Force overwrite of existing out.
- Returns:
out </path/to/output/lca.tsv>.
- Return type:
str
- lca(node1, node2)ο
Performs Range Minimum Query between 2 taxids.
- Parameters:
node1 (int) β taxid
node2 (int) β taxid
- Returns:
LCA taxid
- Return type:
int
- Raises:
ValueError β Provided taxid is not in the nodes.dmp tree.
- parse(lca_fpath: str, orfs_fpath: Optional[str] = None) Dict[str, Dict[str, Dict[int, int]]] ο
Retrieve and construct contig dictionary from provided lca_fpath.
- Parameters:
lca_fpath (str) β </path/to/lcas.tsv> tab-delimited ordered columns: qseqid, name, rank, lca_taxid
orfs_fpath (str, optional (required if using prodigal version <2.6)) β </path/to/prodigal/called/orfs.fasta> Note: These ORFs should correspond to the ORFs provided in the BLAST table.
- Returns:
{contig:{rank:{taxid:counts, β¦}, rank:{β¦}, β¦}, β¦}
- Return type:
dict
- Raises:
FileNotFoundError β lca_fpath does not exist.
FileNotFoundError β orfs_fpath does not exist.
ValueError β If prodigal version is under 2.6, orfs_fpath is a required input.
- prepare_lca()ο
Prepare LCA internal data structures for
lca()
.e.g. self.tour, self.level, self.occurrence, self.sparse are all ready.
- Returns:
Prepares all LCA internals and if successful sets self.lca_prepared to True.
- Return type:
NoneType
- prepare_tree()ο
Performs Eulerian tour of nodes.dmp taxids and constructs three data structures:
tour : list of branches and leaves.
level: list of distances from the root.
occurrence: dict of occurences of the taxid respective to the root.
Notes
For more information on why we construct these three data structures see references below:
- Returns:
sets internals to be used for LCA lookup
- Return type:
NoneType
- preprocess_minimums()ο
Preprocesses all possible LCAs.
This constructs a sparse table to be used for LCA/Range Minimum Query using the self.level array associated with its respective eulerian self.tour. For more information on these data structures see
prepare_tree()
.Sparse table size:
n = number of elements in level list
rows range = (0 to n)
columns range = (0 to logn)
- Returns:
sets self.sparse internal to be used for LCA lookup.
- Return type:
NoneType
- read_sseqid_to_taxid_table(sseqid_to_taxid_filepath: str) Dict[str, Set[int]] ο
Retrieve each qseqidβs set of taxids from sseqid_to_taxid_filepath for reduction by LCA
- Parameters:
sseqid_to_taxid_filepath (str) β Path to sseqid to taxid table with columns: qseqid, sseqid, raw_taxid, merged_taxid, clean_taxid
- Returns:
Dictionary keyed by qseqid containing sets of respective clean taxid
- Return type:
Dict[str, Set[int]]
- reduce_taxids_to_lcas(taxids: Dict[str, Set[int]]) Tuple[Dict[str, int], pandas.DataFrame] ο
Retrieves the lowest common ancestor for each set of taxids in of the taxids
- Parameters:
taxids (dict) β {qseqid: {taxid, β¦}, qseqid: {taxid, β¦}, β¦}
- Returns:
{qseqid: lca, qseqid: lca, β¦}, pd.DataFrame(index=range, cols=[qseqid, taxids])
- Return type:
Tuple[Dict[str, int], pd.DataFrame]
- write_lcas(lcas: Dict[str, int], out: str) str ο
Write lcas to tab-delimited file: out.
Ordered columns are:
qseqid : query seqid
name : LCA name
rank : LCA rank
lca : LCA taxid
- Parameters:
lcas (dict) β {qseqid:lca_taxid, qseqid:lca_taxid, β¦}
out (str) β </path/to/output/file.tsv>
- Returns:
out
- Return type:
str
- autometa.taxonomy.lca.main()ο
autometa.taxonomy.majority_vote moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
This script contains the modified majority vote algorithm used in Autometa version 1.0
- autometa.taxonomy.majority_vote.is_consistent_with_other_orfs(taxid: int, rank: str, rank_counts: Dict[str, Dict], taxa_db: TaxonomyDatabase) bool ο
Determines whether the majority of proteins in a contig, with rank equal to or above the given rank, are common ancestors of the taxid.
If the majority are, this function returns True, otherwise it returns False.
- Parameters:
taxid (int) β taxid to search against other taxids at rank in rank_counts.
rank (str) β Canonical rank to search in rank_counts. Choices: species, genus, family, order, class, phylum, superkingdom.
rank_counts (dict) β LCA canonical rank counts retrieved from ORFs respective to a contig. e.g. {canonical_rank: {taxid: num_hits, β¦}, β¦}
ncbi (NCBI instance) β Instance or subclass of NCBI from autometa.taxonomy.ncbi.
- Returns:
If the majority of ORFs in a contig are equal or above given rank then return True, otherwise return False.
- Return type:
boolean
- autometa.taxonomy.majority_vote.lowest_majority(rank_counts: Dict[str, Dict], taxa_db: TaxonomyDatabase) int ο
Determine the lowest majority given rank_counts by first attempting to get a taxid that leads in counts with the highest specificity in terms of canonical rank.
- Parameters:
rank_counts (dict) β {canonical_rank:{taxid:num_hits, β¦}, rank2: {β¦}, β¦}
taxa_db (TaxonomyDatabase instance) β NCBI or GTDB subclass object of autometa.taxonomy.database.TaxonomyDatabase
- Returns:
Taxid above the lowest majority threshold.
- Return type:
int
- autometa.taxonomy.majority_vote.main()ο
- autometa.taxonomy.majority_vote.majority_vote(lca_fpath: str, out: str, taxa_db: TaxonomyDatabase, verbose: bool = False, orfs: Optional[str] = None) str ο
Wrapper for modified majority voting algorithm from Autometa 1.0
- Parameters:
lca_fpath (str) β Path to lowest common ancestor assignments table.
out (str) β Path to write assigned taxids.
taxa_db (TaxonomyDatabase) β An instance of TaxonomyDatabase
verbose (bool, optional) β Increase verbosity of logging stream
orfs (str, optional) β Path to prodigal called orfs corresponding to LCA table computed from BLAST output
force (bool, optional) β Whether to overwrite existing LCA results.
- Returns:
Path to assigned taxids table.
- Return type:
str
- autometa.taxonomy.majority_vote.rank_taxids(ctg_lcas: Dict[str, Dict[str, Dict[int, int]]], taxa_db: TaxonomyDatabase, verbose: bool = False) Dict[str, int] ο
Votes for taxids based on modified majority vote system where if a majority does not exist, the lowest majority is voted.
- Parameters:
ctg_lcas (dict) β {ctg1:{canonical_rank:{taxid:num_hits,β¦},β¦}, ctg2:{β¦},β¦}
taxa_db (TaxonomyDatabase) β instance of NCBI or GTDB subclass of autometa.taxonomy.database.TaxonomyDatabase
verbose (bool) β Description of parameter verbose (the default is False).
- Returns:
{contig:voted_taxid, contig:voted_taxid, β¦}
- Return type:
dict
- autometa.taxonomy.majority_vote.write_votes(results: Dict[str, int], out: str) str ο
Writes voting results to provided outfpath.
- Parameters:
results (dict) β {contig:voted_taxid, contig:voted_taxid, β¦}
out (str) β </path/to/results.tsv>.
- Returns:
</path/to/results.tsv>
- Return type:
str
- Raises:
FileExistsError β Voting results file already exists
autometa.taxonomy.ncbi moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
File containing definition of the NCBI class and containing functions useful for handling NCBI taxonomy databases
- class autometa.taxonomy.ncbi.NCBI(dbdir, verbose=False)ο
Bases:
TaxonomyDatabase
Taxonomy utilities for NCBI databases.
- __repr__()ο
Operator overloading to return the string representation of the class object
- Returns:
String representation of the class object
- Return type:
str
- __str__()ο
Operator overloading to return the directory path of the class object
- Returns:
Directory path of the class object
- Return type:
str
- convert_accessions_to_taxids(accessions: set) Tuple[Dict[str, Set[int]], pandas.DataFrame] ο
Translates subject sequence ids to taxids
- Parameters:
accessions (dict) β {qseqid: {sseqid, β¦}, β¦}
- Returns:
{qseqid: {taxid, taxid, β¦}, β¦}, index=range, cols=[qseqid, sseqid, raw_taxid, β¦, cleaned_taxid]
- Return type:
Tuple[Dict[str, Set[int]], pd.DataFrame]
- convert_taxid_dtype(taxid: int) int ο
Converts the given taxid to an integer and checks whether it is positive.
2. Checks whether taxid is present in both nodes.dmp and names.dmp. 3a. If (2) is false, will check for corresponding taxid in merged.dmp and convert to this then redo (2). 3b. If (2) is true, will return converted taxid. 4. If (3a) is false will look for taxid in delnodes.dmp. If present will convert to root (taxid=1)
- Parameters:
taxid (int) β identifier for a taxon in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp
- Returns:
taxid if the taxid is a positive integer and present in either nodes.dmp or names.dmp or taxid recovered from merged.dmp
- Return type:
int
- Raises:
ValueError β Provided taxid is not a positive integer
DatabaseOutOfSyncError β NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other
- parse_delnodes() Set[int] ο
Parse the delnodes.dmp database Note: This is performed when a new NCBI class instance is constructed
- Returns:
{taxid, β¦}
- Return type:
set
- parse_merged() Dict[int, int] ο
Parse the merged.dmp database Note: This is performed when a new NCBI class instance is constructed
- Returns:
{old_taxid: new_taxid, β¦}
- Return type:
dict
- parse_names() Dict[int, str] ο
Parses through names.dmp database and loads taxids with scientific names
- Returns:
{taxid:name, β¦}
- Return type:
dict
- parse_nodes() Dict[int, str] ο
Parse the nodes.dmp database to be used later by
autometa.taxonomy.ncbi.NCBI.parent()
,autometa.taxonomy.ncbi.NCBI.rank()
Note: This is performed when a new NCBI class instance is constructed- Returns:
{child_taxid:{βparentβ:parent_taxid,βrankβ:rank}, β¦}
- Return type:
dict
- search_prot_accessions(accessions: set, sseqids_to_taxids: Optional[Dict[str, int]] = None, db: str = 'live') Dict[str, int] ο
Search prot.accession2taxid.gz and dead_prot.accession2taxid.gz
- Parameters:
accessions (set) β Set of subject sequence ids retrieved from diamond blastp search (sseqids)
sseqids_to_taxids (Dict[str, int], optional) β Dictionary containing sseqids converted to taxids
db (str, optional) β
selection of one of the prot accession to taxid databases from NCBI. Choices are live, dead, full
live: prot.accession2taxid.gz
full: prot.accession2taxid.FULL.gz
dead: dead_prot.accession2taxid.gz
- Returns:
Dictionary containing sseqids converted to taxids
- Return type:
Dict[str, int]
autometa.taxonomy.vote moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
Script to split metagenome assembly by kingdoms given the input votes. The lineages of the provided voted taxids will also be added and written to taxonomy.tsv
- autometa.taxonomy.vote.add_ranks(df: pandas.DataFrame, taxa_db: TaxonomyDatabase) pandas.DataFrame ο
Add canonical ranks to df and write to out
- Parameters:
df (pd.DataFrame) β index=βcontigβ, column=βtaxidβ
taxa_db (TaxonomyDatabase) β NCBI or GTDB TaxonomyDatabase instance.
- Returns:
index=βcontigβ, columns=[βtaxidβ, *canonical_ranks]
- Return type:
pd.DataFrame
- autometa.taxonomy.vote.assign(out: str, method: str = 'majority_vote', assembly: Optional[str] = None, prot_orfs: Optional[str] = None, nucl_orfs: Optional[str] = None, blast: Optional[str] = None, lca_fpath: Optional[str] = None, dbdir: str = './autometa/databases/ncbi', dbtype: Literal['ncbi', 'gtdb'] = 'ncbi', force: bool = False, verbose: bool = False, parallel: bool = True, cpus: int = 0) pandas.DataFrame ο
Assign taxonomy using method and write to out.
- Parameters:
out (str) β Path to write taxonomy table of votes
method (str, optional) β Method to assign contig taxonomy, by default βmajority_voteβ. choices include βmajority_voteβ, β¦
assembly (str, optional) β Path to assembly fasta file (nucleotide), by default None
prot_orfs (str, optional) β Path to amino-acid ORFs called from assembly, by default None
nucl_orfs (str, optional) β Path to nucleotide ORFs called from assembly, by default None
blast (str, optional) β Path to blastp table, by default None
lca_fpath (str, optional) β Path to output of LCA analysis, by default None
dbdir (str, optional) β Path to NCBI databases directory, by default NCBI_DIR
dbtype (str, optional) β Type of Taxonomy database to use, by default ncbi
force (bool, optional) β Overwrite existing annotations, by default False
verbose (bool, optional) β Increase verbosity, by default False
parallel (bool, optional) β Whether to perform annotations using multiprocessing and GNU parallel, by default True
cpus (int, optional) β Number of cpus to use if parallel is True, by default will try to use all available.
- Returns:
index=βcontigβ, columns=[βtaxidβ]
- Return type:
pd.DataFrame
- Raises:
NotImplementedError β Provided method has not yet been implemented.
ValueError β Assembly file is required if no other annotations are provided.
- autometa.taxonomy.vote.get(filepath_or_dataframe: Union[str, pandas.DataFrame], kingdom: str, taxa_db: TaxonomyDatabase) pandas.DataFrame ο
Retrieve specific kingdom voted taxa for assembly from filepath
- Parameters:
filepath (str) β Path to tab-delimited taxonomy table. cols=[βcontigβ,βtaxidβ, *canonical_ranks]
kingdom (str) β rank to retrieve from superkingdom column in taxonomy table.
ncbi (str or autometa.taxonomy.NCBI instance, optional) β Path to NCBI database directory or NCBI instance, by default NCBI_DIR. This is necessary only if filepath does not already contain columns of canonical ranks.
- Returns:
DataFrame of contigs pertaining to retrieved kingdom.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError β Provided filepath does not exists or is empty.
TableFormatError β Provided filepath does not contain the βsuperkingdomβ column.
KeyError β kingdom is absent in provided taxonomy table.
- autometa.taxonomy.vote.main()ο
- autometa.taxonomy.vote.write_ranks(taxonomy: pandas.DataFrame, assembly: str, outdir: str, rank: str = 'superkingdom', prefix: Optional[str] = None) List[str] ο
Write fastas split by rank
- Parameters:
taxonomy (pd.DataFrame) β dataframe containing canonical ranks of contigs assigned from :func:autometa.taxonomy.vote.assign(β¦)
assembly (str) β Path to assembly fasta file
outdir (str) β Path to output directory to write fasta files
rank (str, optional) β canonical rank column in taxonomy table to split by, by default βsuperkingdomβ
prefix (str, optional) β Prefix each of the paths written with prefix string.
- Returns:
[rank_name_fpath, β¦]
- Return type:
list
- Raises:
KeyError β rank not in taxonomy columns