autometa.binning package
Submodules
autometa.binning.large_data_mode module
Autometa large-data-mode binning by selection of taxon sets using provided upper bound and determined lower bound
- autometa.binning.large_data_mode.checkpoint(checkpoints_df: pandas.DataFrame, clustered: pandas.DataFrame, rank: str, rank_name_txt: str, completeness: float, purity: float, coverage_stddev: float, gc_content_stddev: float, cluster_method: str, norm_method: str, pca_dimensions: int, embed_dimensions: int, embed_method: str, min_contigs: int, max_partition_size: int, binning_checkpoints_fpath: str) pandas.DataFrame
- autometa.binning.large_data_mode.cluster_by_taxon_partitioning(main: pandas.DataFrame, counts: pandas.DataFrame, markers: pandas.DataFrame, norm_method: str = 'am_clr', pca_dimensions: int = 50, embed_dimensions: int = 2, embed_method: str = 'umap', max_partition_size: int = 10000, completeness: float = 20.0, purity: float = 95.0, coverage_stddev: float = 25.0, gc_content_stddev: float = 5.0, starting_rank: str = 'superkingdom', method: str = 'dbscan', reverse_ranks: bool = False, cache: Optional[str] = None, binning_checkpoints_fpath: Optional[str] = None, n_jobs: int = -1, verbose: bool = False) pandas.DataFrame
Perform clustering of contigs by provided method and use metrics to filter clusters that should be retained via completeness and purity thresholds.
- Parameters:
main (pd.DataFrame) – index=contig, cols=[‘coverage’, ‘gc_content’] taxa cols should be present if taxonomy is True. i.e. [taxid,superkingdom,phylum,class,order,family,genus,species]
counts (pd.DataFrame) – contig kmer counts -> index_col=’contig’, cols=[‘AAAAA’, ‘AAAAT’, …] NOTE: columns will correspond to the selected k-mer count size. e.g. 3-mers would be [‘AAA’,’AAT’, …]
markers (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]
completeness (float, optional) – Description of parameter completeness (the default is 20.).
purity (float, optional) – purity threshold to retain cluster (the default is 95.0). e.g. cluster purity >= purity_cutoff
coverage_stddev (float, optional) – cluster coverage threshold to retain cluster (the default is 25.0).
gc_content_stddev (float, optional) – cluster GC content threshold to retain cluster (the default is 5.0).
starting_rank (str, optional) – Starting canonical rank at which to begin subsetting taxonomy (the default is superkingdom). Choices are superkingdom, phylum, class, order, family, genus, species.
method (str, optional) – Clustering method (the default is ‘dbscan’). choices = [‘dbscan’,’hdbscan’]
reverse_ranks (bool, optional) – False - [superkingdom,phylum,class,order,family,genus,species] (Default) True - [species,genus,family,order,class,phylum,superkingdom]
cache (str, optional) – Directory to cache intermediate results
binning_checkpoints_fpath (str, optional) – File path to binning checkpoints (checkpoints are only created if the cache argument is provided)
verbose (bool, optional) – log stats for each recursive_dbscan clustering iteration
- Returns:
main with [‘cluster’,’completeness’,’purity’] columns added
- Return type:
pd.DataFrame
- Raises:
TableFormatError – No marker information is available for contigs to be binned.
FileNotFoundError – Provided binning_checkpoints_fpath does not exist
TableFormatError – No marker information is availble for contigs to be binned.
- autometa.binning.large_data_mode.get_checkpoint_info(checkpoints_fpath: str) Dict[str, Union[pandas.DataFrame, str]]
Retrieve checkpoint information from generated binning_checkpoints.tsv
- Parameters:
checkpoints_fpath (str) – Generated binning_checkpoints.tsv within cache directory
- Returns:
binning_checkpoints, starting canonical rank, starting rank name within starting canonical rank keys=”binning_checkpoints”, “starting_rank”, “starting_rank_name_txt” values=pd.DataFrame, str, str
- Return type:
Dict[str, str, str]
- autometa.binning.large_data_mode.get_kmer_embedding(counts: pandas.DataFrame, cache_fpath: str, norm_method: str, pca_dimensions: int, embed_dimensions: int, embed_method: str) pandas.DataFrame
Retrieve kmer embeddings for provided counts by first performing kmer normalization with norm_method then PCA down to pca_dimensions until the normalized kmer frequencies are embedded to embed_dimensions using embed_method.
- Parameters:
counts (pd.DataFrame) – Kmer counts where index column is ‘contig’ and each column is a kmer count.
cache_fpath (str) – Path to cache embedded kmers table for later look-up/inspection.
norm_method (str) – normalization transformation to use on kmer counts. Choices include ‘am_clr’, ‘ilr’ and ‘clr’. See :func:kmers.normalize for more details.
pca_dimensions (int) – Number of dimensions by which to initially reduce normalized kmer frequencies (Must be greater than embed_dimensions).
embed_dimensions (int) – Embedding dimensions by which to reduce normalized PCA-transformed kmer frequencies (Must be less than pca_dimensions).
embed_method (str) – Embedding method to use on normalized, PCA-transformed kmer frequencies. Choices include ‘bhsne’, ‘sksne’ and ‘umap’. See :func:kmers.embed for more details.
- Returns:
[description]
- Return type:
pd.DataFrame
- autometa.binning.large_data_mode.main()
autometa.binning.large_data_mode_loginfo module
Autometa module: autometa-large-data-mode-binning-loginfo
Generates tabular metadata from logfile for large-data-mode task (e.g. slurm_job.stderr)
- autometa.binning.large_data_mode_loginfo.add_clustering_runtime_summary_info(clustering_df: pandas.DataFrame, totals: pandas.DataFrame) pandas.DataFrame
Retrieve information about the clustering that took the longest.
- Parameters:
clustering_df (pd.DataFrame) – Clustering runtime info retrieved from logfile
totals (pd.DataFrame) – runtime totals summary table
- Returns:
runtime totals summary table updated with clustering runtime info
- Return type:
pd.DataFrame
- autometa.binning.large_data_mode_loginfo.add_embedding_runtime_summary_info(embedding_df: pandas.DataFrame, totals: pandas.DataFrame) pandas.DataFrame
Retrieve information about the embeddings that took the longest.
- Parameters:
embedding_df (pd.DataFrame) – Embedding info retrieved from logfile
totals (pd.DataFrame) – runtime totals summary table
- Returns:
runtime totals summary table updated with embedding info
- Return type:
pd.DataFrame
- autometa.binning.large_data_mode_loginfo.format_total_times(total_times: list, max_partition_size: str) pandas.DataFrame
Format total runtimes from timedelta objects to hours
- Parameters:
total_times (list) – Runtime totals per algorithm during large-data-mode
max_partition_size (str) – Partition size parameter retrieved from log file
- Returns:
Formatted dataframe of timedelta objects and hours per algorithm in large-data-mode run.
- Return type:
pd.DataFrame
- autometa.binning.large_data_mode_loginfo.get_loginfo(logfile: str) Dict[str, pandas.DataFrame]
Get autometa-large-data-mode-binning runtime information
Data
“embedding”: Embeddings ranks and times
“kmer_count_normalization”: K-mer count normalization times
“clustering”: Embeddings retrieved from cache
“skipped_taxa”: Ranks above max_partition_size
“totals”: Total times for all binning tasks
- param logfile:
Path to autometa-large-data-mode-binning stderr logfile
- type logfile:
str
- returns:
Dictionary containing large-data-mode-binning information corresponding to task
- rtype:
Dict[str, pd.DataFrame]
- autometa.binning.large_data_mode_loginfo.main()
autometa.binning.recursive_dbscan module
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
Cluster contigs recursively searching for bins with highest completeness and purity.
- autometa.binning.recursive_dbscan.get_clusters(main: pandas.DataFrame, markers_df: pandas.DataFrame, completeness: float, purity: float, coverage_stddev: float, gc_content_stddev: float, method: str, n_jobs: int = -1, verbose: bool = False) pandas.DataFrame
Find best clusters retained after applying metrics filters.
- Parameters:
main (pd.DataFrame) – index=contig, cols=[‘x’,’y’,’coverage’,’gc_content’]
markers_df (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]
completeness (float) – completeness threshold to retain cluster. e.g. cluster completeness >= completeness
purity (float) – purity threshold to retain cluster. e.g. cluster purity >= purity
coverage_stddev (float) – cluster coverage std.dev. threshold to retain cluster. e.g. cluster coverage std.dev. <= coverage_stddev
gc_content_stddev (float) – cluster GC content std.dev. threshold to retain cluster. e.g. cluster GC content std.dev. <= gc_content_stddev
method (str) – Description of parameter method. choices = [‘dbscan’,’hdbscan’]
verbose (bool) – log stats for each recursive_dbscan clustering iteration
- Returns:
main with [‘cluster’,’completeness’,’purity’,’coverage_stddev’,’gc_content_stddev’] columns added
- Return type:
pd.DataFrame
- autometa.binning.recursive_dbscan.main()
- autometa.binning.recursive_dbscan.recursive_dbscan(table: pandas.DataFrame, markers_df: pandas.DataFrame, completeness_cutoff: float, purity_cutoff: float, coverage_stddev_cutoff: float, gc_content_stddev_cutoff: float, n_jobs: int = -1, verbose: bool = False) Tuple[pandas.DataFrame, pandas.DataFrame]
Carry out DBSCAN, starting at eps=0.3 and continuing until there is just one group.
Break conditions to speed up pipeline: Give up if we’ve got up to eps 1.3 and still no complete and pure clusters. Often when you start at 0.3 there are zero complete and pure clusters, because the groups are too small. Later, some are found as the groups enlarge enough, but after it becomes zero again, it is a lost cause and we may as well stop. On the other hand, sometimes we never find any groups, so perhaps we should give up if by EPS 1.3 we never find any complete/pure groups.
- Parameters:
table (pd.DataFrame) – Contigs with embedded k-mer frequencies (‘x’,’y’), ‘coverage’ and ‘gc_content’ columns
markers_df (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]
completeness_cutoff (float) – completeness_cutoff threshold to retain cluster (the default is 20.0). e.g. cluster completeness >= completeness_cutoff
purity_cutoff (float) – purity_cutoff threshold to retain cluster (the default is 95.0). e.g. cluster purity >= purity_cutoff
coverage_stddev_cutoff (float) – coverage_stddev_cutoff threshold to retain cluster (the default is 25.0). e.g. cluster coverage std.dev. <= coverage_stddev_cutoff
gc_content_stddev_cutoff (float) – gc_content_stddev_cutoff threshold to retain cluster (the default is 5.0). e.g. cluster gc_content std.dev. <= gc_content_stddev_cutoff
verbose (bool) – log stats for each recursive_dbscan clustering iteration.
- Returns:
(pd.DataFrame(<passed cutoffs>), pd.DataFrame(<failed cutoffs>)) DataFrames consisting of contigs that passed/failed clustering cutoffs, respectively.
- DataFrame:
index = contig columns = [‘x,’y’,’coverage’,’gc_content’,’cluster’,’purity’,’completeness’,’coverage_stddev’,’gc_content_stddev’]
- Return type:
2-tuple
- autometa.binning.recursive_dbscan.recursive_hdbscan(table: pandas.DataFrame, markers_df: pandas.DataFrame, completeness_cutoff: float, purity_cutoff: float, coverage_stddev_cutoff: float, gc_content_stddev_cutoff: float, n_jobs: int = -1, verbose: bool = False) Tuple[pandas.DataFrame, pandas.DataFrame]
Recursively run HDBSCAN starting with defaults and iterating the min_samples and min_cluster_size until only 1 cluster is recovered.
- Parameters:
table (pd.DataFrame) – Contigs with embedded k-mer frequencies (‘x’,’y’), ‘coverage’ and ‘gc_content’ columns
markers_df (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]
completeness_cutoff (float) – completeness_cutoff threshold to retain cluster. e.g. cluster completeness >= completeness_cutoff
purity_cutoff (float) – purity_cutoff threshold to retain cluster. e.g. cluster purity >= purity_cutoff
coverage_stddev_cutoff (float) – coverage_stddev_cutoff threshold to retain cluster. e.g. cluster coverage std.dev. <= coverage_stddev_cutoff
gc_content_stddev_cutoff (float) – gc_content_stddev_cutoff threshold to retain cluster. e.g. cluster gc_content std.dev. <= gc_content_stddev_cutoff
verbose (bool) – log stats for each recursive_dbscan clustering iteration.
- Returns:
(pd.DataFrame(<passed cutoffs>), pd.DataFrame(<failed cutoffs>)) DataFrames consisting of contigs that passed/failed clustering cutoffs, respectively.
- DataFrame:
index = contig columns = [‘x_1’,’x_2’,’coverage’,’gc_content’,’cluster’,’purity’,’completeness’,’coverage_stddev’,’gc_content_stddev’]
- Return type:
2-tuple
- autometa.binning.recursive_dbscan.run_dbscan(df: pandas.DataFrame, eps: float, n_jobs: int = -1, dropcols: List[str] = ['cluster', 'purity', 'completeness', 'coverage_stddev', 'gc_content_stddev']) pandas.DataFrame
Run clustering on df at provided eps.
Notes
- Parameters:
df (pd.DataFrame) – Contigs with embedded k-mer frequencies as [‘x_1’,’x_2’,…, ‘x_ndims’] columns and ‘coverage’ column
eps (float) – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. See DBSCAN docs for more details.
dropcols (list, optional) – Drop columns in list from df (the default is [‘cluster’,’purity’,’completeness’,’coverage_stddev’,’gc_content_stddev’]).
- Returns:
df with ‘cluster’ column added
- Return type:
pd.DataFrame
- Raises:
BinningError – Dataframe is missing kmer/coverage annotations
- autometa.binning.recursive_dbscan.run_hdbscan(df: pandas.DataFrame, min_cluster_size: int, min_samples: int, n_jobs: int = -1) pandas.DataFrame
Run clustering on df at provided min_cluster_size.
Notes
Reasoning for parameter: cluster_selection_method
Reasoning for parameters: min_cluster_size and min_samples
Documentation for HDBSCAN
- Parameters:
df (pd.DataFrame) – Contigs with embedded k-mer frequencies as [‘x’,’y’] columns and optionally ‘coverage’ column
min_cluster_size (int) – The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.
min_samples (int) – The number of samples in a neighborhood for a point to be considered a core point.
n_jobs (int) – Number of parallel jobs to run in core distance computations. For
n_jobs
below -1, (n_cpus + 1 + n_jobs) are used.
- Returns:
df with ‘cluster’ column added
- Return type:
pd.DataFrame
- Raises:
ValueError – sets usecols and dropcols may not share elements
TableFormatError – df is missing k-mer or coverage annotations.
- autometa.binning.recursive_dbscan.taxon_guided_binning(main: pandas.DataFrame, markers: pandas.DataFrame, completeness: float = 20.0, purity: float = 95.0, coverage_stddev: float = 25.0, gc_content_stddev: float = 5.0, starting_rank: str = 'superkingdom', method: str = 'dbscan', reverse_ranks: bool = False, n_jobs: int = -1, verbose: bool = False) pandas.DataFrame
Perform clustering of contigs by provided method and use metrics to filter clusters that should be retained via completeness and purity thresholds.
- Parameters:
main (pd.DataFrame) – index=contig, cols=[‘x’,’y’, ‘coverage’, ‘gc_content’] taxa cols should be present if taxonomy is True. i.e. [taxid,superkingdom,phylum,class,order,family,genus,species]
markers (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]
completeness (float, optional) – Description of parameter completeness (the default is 20.).
purity (float, optional) – purity threshold to retain cluster (the default is 95.0). e.g. cluster purity >= purity_cutoff
coverage_stddev (float, optional) – cluster coverage threshold to retain cluster (the default is 25.0).
gc_content_stddev (float, optional) – cluster GC content threshold to retain cluster (the default is 5.0).
starting_rank (str, optional) – Starting canonical rank at which to begin subsetting taxonomy (the default is superkingdom). Choices are superkingdom, phylum, class, order, family, genus, species.
method (str, optional) – Clustering method (the default is ‘dbscan’). choices = [‘dbscan’,’hdbscan’]
reverse_ranks (bool, optional) – False - [superkingdom,phylum,class,order,family,genus,species] (Default) True - [species,genus,family,order,class,phylum,superkingdom]
verbose (bool, optional) – log stats for each recursive_dbscan clustering iteration
- Returns:
main with [‘cluster’,’completeness’,’purity’] columns added
- Return type:
pd.DataFrame
- Raises:
TableFormatError – No marker information is availble for contigs to be binned.
autometa.binning.summary module
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
Script to summarize Autometa binning results
- autometa.binning.summary.fragmentation_metric(df: pandas.DataFrame, quality_measure: float = 0.5) int
Describes the quality of assembled genomes that are fragmented in contigs of different length.
Note
For more information see this metagenomics wiki from Matthias Scholz
- Parameters:
df (pd.DataFrame) – DataFrame to assess fragmentation within metagenome-assembled genome.
quality_measure (0 < float < 1) – Description of parameter quality_measure (the default is .50). I.e. default measure is N50, but could use .1 for N10 or .9 for N90
- Returns:
Minimum contig length to cover quality_measure of genome (i.e. percentile contig length)
- Return type:
int
- autometa.binning.summary.get_agg_stats(cluster_groups: pandas.core.groupby.generic.DataFrameGroupBy, stat_col: str) pandas.DataFrame
Compute min, max, (length weighted) mean and median from provided stat_col
- Parameters:
cluster_groups (pd.core.groupby.generic.DataFrameGroupBy) – pandas DataFrame grouped by cluster
stat_col (str) – column to on which to compute min, max, (length-weighted) mean and median
- Returns:
index=cluster, columns=[min_{stat_col}, max_{stat_col}, std_{stat_col}, length_weighted_{stat_col}]
- Return type:
pd.DataFrame
- autometa.binning.summary.get_metabin_stats(bin_df: pandas.DataFrame, markers: Union[str, pandas.DataFrame], cluster_col: str = 'cluster') pandas.DataFrame
Retrieve statistics for all clusters recovered from Autometa binning.
- Parameters:
bin_df (pd.DataFrame) – Autometa binning table. index=contig, cols=[‘cluster’,’length’, ‘gc_content’, ‘coverage’, …]
markers (str,pd.DataFrame) – Path to or pd.DataFrame of markers table corresponding to contigs in bin_df
cluster_col (str, optional) – Clustering column by which to group metabins
- Returns:
dataframe consisting of various metagenome-assembled genome statistics indexed by cluster.
- Return type:
pd.DataFrame
- Raises:
TypeError – markers should be a path to or pd.DataFrame of a markers table corresponding to contigs in bin_df
ValueError – One of the required columns (cluster_col, coverage, length, gc_content) was not found in bin_df
- autometa.binning.summary.get_metabin_taxonomies(bin_df: pandas.DataFrame, taxa_db: TaxonomyDatabase, cluster_col: str = 'cluster') pandas.DataFrame
Retrieve taxonomies of all clusters recovered from Autometa binning.
- Parameters:
bin_df (pd.DataFrame) – Autometa binning table. index=contig, cols=[‘cluster’,’length’,’taxid’, *canonical_ranks]
taxa_db (autometa.taxonomy.ncbi.TaxonomyDatabase instance) – Autometa NCBI or GTDB class instance
cluster_col (str, optional) – Clustering column by which to group metabins
- Returns:
Dataframe consisting of cluster taxonomy with taxid and canonical rank. Indexed by cluster
- Return type:
pd.DataFrame
- autometa.binning.summary.main()
- autometa.binning.summary.write_cluster_records(bin_df: pandas.DataFrame, metagenome: str, outdir: str, cluster_col: str = 'cluster') None
Write clusters to outdir given clusters df and metagenome records
- Parameters:
bin_df (pd.DataFrame) – Autometa binning dataframe. index=’contig’, cols=[‘cluster’, …]
metagenome (str) – Path to metagenome fasta file
outdir (str) – Path to output directory to write fastas for each metagenome-assembled genome
cluster_col (str, optional) – Clustering column by which to group metabins
autometa.binning.unclustered_recruitment module
autometa.binning.utilities module
binning utilities script for autometa-binning
Script containing utility functions when performing autometa clustering/classification tasks.
- autometa.binning.utilities.add_metrics(df: pandas.DataFrame, markers_df: pandas.DataFrame) Tuple[pandas.DataFrame, pandas.DataFrame]
Adds cluster metrics to each respective contig in df.
:math:`completeness =
- rac{markers_{cluster}}{markers_{ref}} * 100`
:math:`purity % =
- rac{markers_{single-copy}}{markers_{cluster}} * 100`
:math:`mu_{coverage} =
rac{1}{N}sum_{i=1}^{N}left(x_{i}-mu ight)^{2}`
:math:`mu_{GC%} =
rac{1}{N}sum_{i=1}^{N}left(x_{i}-mu ight)^{2}`
- dfpd.DataFrame
index=’contig’ cols=[‘coverage’,’gc_content’,’cluster’,’x_1’,’x_2’,…,’x_n’]
- markers_dfpd.DataFrame
wide format, i.e. index=contig cols=[marker,marker,…]
- 2-tuple
df with added cluster metrics columns=[‘completeness’, ‘purity’, ‘coverage_stddev’, ‘gc_content_stddev’] pd.DataFrame(index=clusters, cols=[‘completeness’, ‘purity’, ‘coverage_stddev’, ‘gc_content_stddev’])
- autometa.binning.utilities.apply_binning_metrics_filter(df: pandas.DataFrame, completeness_cutoff: float = 20.0, purity_cutoff: float = 95.0, coverage_stddev_cutoff: float = 25.0, gc_content_stddev_cutoff: float = 5.0) pandas.DataFrame
Filter df by provided cutoff values.
- Parameters:
df (pd.DataFrame) – Dataframe containing binning metrics ‘completeness’, ‘purity’, ‘coverage_stddev’ and ‘gc_content_stddev’
completeness_cutoff (float) – completeness_cutoff threshold to retain cluster (the default is 20.0). e.g. cluster completeness >= completeness_cutoff
purity_cutoff (float) – purity_cutoff threshold to retain cluster (the default is 95.00). e.g. cluster purity >= purity_cutoff
coverage_stddev_cutoff (float) – coverage_stddev_cutoff threshold to retain cluster (the default is 25.0). e.g. cluster coverage std.dev. <= coverage_stddev_cutoff
gc_content_stddev_cutoff (float) – gc_content_stddev_cutoff threshold to retain cluster (the default is 5.0). e.g. cluster gc_content std.dev. <= gc_content_stddev_cutoff
- Returns:
Cutoff filtered df
- Return type:
pd.DataFrame
- Raises:
KeyError – One of metrics to apply cutoff does not exist in the df columns
- autometa.binning.utilities.filter_taxonomy(df: pandas.DataFrame, rank: str, name: str) pandas.DataFrame
Clean taxon names (by broadcasting lowercase and replacing whitespace) then subset by all contigs under rank that are equal to name.
- Parameters:
df (pd.DataFrame) – Input dataframe containing columns of canonical ranks.
rank (str) – Canonical rank on which to apply filtering.
name (str) – Taxon in rank to retrieve.
- Returns:
DataFrame subset by df[rank] == name
- Return type:
pd.DataFrame
- Raises:
KeyError – rank not in taxonomy columns.
ValueError – Provided name not found in rank column.
- autometa.binning.utilities.read_annotations(annotations: Iterable, how: str = 'inner') pandas.DataFrame
Read in a list of contig annotations from filepaths and return all provided annotations in a single dataframe.
- Parameters:
annotations (Iterable) – Filepaths of annotations. These should all contain a ‘contig’ column to be used as the index
how (str, optional) – How to join the provided annotations. By default will take the ‘inner’ or intersection of all contigs from annotations.
- Returns:
index_col=’contig’, cols=[annotations, …]
- Return type:
pd.DataFrame
- autometa.binning.utilities.reindex_bin_names(df: pandas.DataFrame, cluster_col: str = 'cluster', initial_index: int = 0) pandas.DataFrame
Re-index cluster_col using the provided initial_index as the initial index number then enumerating from this to the number of bins in cluster_col of df.
- Parameters:
df (pd.DataFrame) – Dataframe containing cluster_col
cluster_col (str, optional) – Cluster column to apply reindexing, by default “cluster”
initial_index (int, optional) – Starting index number when reindexing, by default 0
Note
The bin names will start one number above the initial_index number provided. Therefore, the default behavior is to use 0 as the initial_index meaning the first bin name will be bin_1.
Example
>>>import pandas as pd >>>from autometa.binning.utilities import reindex_bin_names >>>df = pd.read_csv("binning.tsv", sep=' ', index_col='contig') >>>reindex_bin_names(df, cluster_col='cluster', initial_index=0) cluster completeness purity coverage_stddev gc_content_stddev contig k141_1102126 bin_1 90.647482 100.0 1.20951 1.461658 k141_110415 bin_1 90.647482 100.0 1.20951 1.461658 k141_1210233 bin_1 90.647482 100.0 1.20951 1.461658 k141_1227553 bin_1 90.647482 100.0 1.20951 1.461658 k141_1227735 bin_1 90.647482 100.0 1.20951 1.461658 ... ... ... ... ... ... k141_999969 NaN NaN NaN NaN NaN k141_99997 NaN NaN NaN NaN NaN k141_999982 NaN NaN NaN NaN NaN k141_999984 NaN NaN NaN NaN NaN k141_999987 NaN NaN NaN NaN NaN
- Returns:
DataFrame of re-indexed bins in cluster_col starting at initial_index + 1
- Return type:
pd.DataFrame
- autometa.binning.utilities.write_results(results: pandas.DataFrame, binning_output: str, full_output: Optional[str] = None) None
Write out binning results with their respective binning metrics
- Parameters:
results (pd.DataFrame) – Binning results contigs dataframe consisting of “cluster” assignments with their respective metrics and annotations
binning_output (str) – Filepath to write binning results
full_output (str, optional) – If provided, will write assignments, metrics and annotations together into full_output (filepath)
- Return type:
NoneType
- autometa.binning.utilities.zero_pad_bin_names(df: pandas.DataFrame, cluster_col: str = 'cluster') pandas.DataFrame
Apply zero padding to cluster_col using the length of digit corresponding to the number of unique clusters in cluster_col in the df.
- Parameters:
df (pd.DataFrame) – Dataframe containing cluster_col
cluster_col (str, optional) – Cluster column to apply zero padding, by default “cluster”
- Returns:
Dataframe with cluster_col zero padded to the length of the number of clusters
- Return type:
pd.DataFrame