autometa.binning package

Submodules

autometa.binning.large_data_mode module

Autometa large-data-mode binning by selection of taxon sets using provided upper bound and determined lower bound

autometa.binning.large_data_mode.checkpoint(checkpoints_df: pandas.DataFrame, clustered: pandas.DataFrame, rank: str, rank_name_txt: str, completeness: float, purity: float, coverage_stddev: float, gc_content_stddev: float, cluster_method: str, norm_method: str, pca_dimensions: int, embed_dimensions: int, embed_method: str, min_contigs: int, max_partition_size: int, binning_checkpoints_fpath: str) → pandas.DataFrame

autometa.binning.large_data_mode.cluster_by_taxon_partitioning(main: pandas.DataFrame, counts: pandas.DataFrame, markers: pandas.DataFrame, norm_method: str = 'am_clr', pca_dimensions: int = 50, embed_dimensions: int = 2, embed_method: str = 'umap', max_partition_size: int = 10000, completeness: float = 20.0, purity: float = 95.0, coverage_stddev: float = 25.0, gc_content_stddev: float = 5.0, starting_rank: str = 'superkingdom', method: str = 'dbscan', reverse_ranks: bool = False, cache: Optional[str] = None, binning_checkpoints_fpath: Optional[str] = None, n_jobs: int = -1, verbose: bool = False) → pandas.DataFrame

Perform clustering of contigs by provided method and use metrics to filter clusters that should be retained via completeness and purity thresholds.

Parameters:

main (pd.DataFrame) – index=contig, cols=[‘coverage’, ‘gc_content’] taxa cols should be present if taxonomy is True. i.e. [taxid,superkingdom,phylum,class,order,family,genus,species]
counts (pd.DataFrame) – contig kmer counts -> index_col=’contig’, cols=[‘AAAAA’, ‘AAAAT’, …] NOTE: columns will correspond to the selected k-mer count size. e.g. 3-mers would be [‘AAA’,’AAT’, …]
markers (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]
completeness (float, optional) – Description of parameter completeness (the default is 20.).
purity (float, optional) – purity threshold to retain cluster (the default is 95.0). e.g. cluster purity >= purity_cutoff
coverage_stddev (float, optional) – cluster coverage threshold to retain cluster (the default is 25.0).
gc_content_stddev (float, optional) – cluster GC content threshold to retain cluster (the default is 5.0).
starting_rank (str, optional) – Starting canonical rank at which to begin subsetting taxonomy (the default is superkingdom). Choices are superkingdom, phylum, class, order, family, genus, species.
method (str, optional) – Clustering method (the default is ‘dbscan’). choices = [‘dbscan’,’hdbscan’]
reverse_ranks (bool, optional) – False - [superkingdom,phylum,class,order,family,genus,species] (Default) True - [species,genus,family,order,class,phylum,superkingdom]
cache (str, optional) – Directory to cache intermediate results
binning_checkpoints_fpath (str, optional) – File path to binning checkpoints (checkpoints are only created if the cache argument is provided)
verbose (bool, optional) – log stats for each recursive_dbscan clustering iteration

Returns:

main with [‘cluster’,’completeness’,’purity’] columns added

Return type:

pd.DataFrame

Raises:

TableFormatError – No marker information is available for contigs to be binned.
FileNotFoundError – Provided binning_checkpoints_fpath does not exist
TableFormatError – No marker information is availble for contigs to be binned.

autometa.binning.large_data_mode.get_checkpoint_info(checkpoints_fpath: str) → Dict[str, Union[pandas.DataFrame, str]]

Retrieve checkpoint information from generated binning_checkpoints.tsv

Parameters:: checkpoints_fpath (str) – Generated binning_checkpoints.tsv within cache directory
Returns:: binning_checkpoints, starting canonical rank, starting rank name within starting canonical rank keys=”binning_checkpoints”, “starting_rank”, “starting_rank_name_txt” values=pd.DataFrame, str, str
Return type:: Dict[str, str, str]

autometa.binning.large_data_mode.get_kmer_embedding(counts: pandas.DataFrame, cache_fpath: str, norm_method: str, pca_dimensions: int, embed_dimensions: int, embed_method: str) → pandas.DataFrame

Retrieve kmer embeddings for provided counts by first performing kmer normalization with norm_method then PCA down to pca_dimensions until the normalized kmer frequencies are embedded to embed_dimensions using embed_method.

Parameters:

counts (pd.DataFrame) – Kmer counts where index column is ‘contig’ and each column is a kmer count.
cache_fpath (str) – Path to cache embedded kmers table for later look-up/inspection.
norm_method (str) – normalization transformation to use on kmer counts. Choices include ‘am_clr’, ‘ilr’ and ‘clr’. See :func:kmers.normalize for more details.
pca_dimensions (int) – Number of dimensions by which to initially reduce normalized kmer frequencies (Must be greater than embed_dimensions).
embed_dimensions (int) – Embedding dimensions by which to reduce normalized PCA-transformed kmer frequencies (Must be less than pca_dimensions).
embed_method (str) – Embedding method to use on normalized, PCA-transformed kmer frequencies. Choices include ‘bhsne’, ‘sksne’ and ‘umap’. See :func:kmers.embed for more details.

Returns:

[description]

Return type:

pd.DataFrame

autometa.binning.large_data_mode.main()

autometa.binning.large_data_mode_loginfo module

Autometa module: autometa-large-data-mode-binning-loginfo

Generates tabular metadata from logfile for large-data-mode task (e.g. slurm_job.stderr)

autometa.binning.large_data_mode_loginfo.add_clustering_runtime_summary_info(clustering_df: pandas.DataFrame, totals: pandas.DataFrame) → pandas.DataFrame

Retrieve information about the clustering that took the longest.

Parameters:

clustering_df (pd.DataFrame) – Clustering runtime info retrieved from logfile
totals (pd.DataFrame) – runtime totals summary table

Returns:

runtime totals summary table updated with clustering runtime info

Return type:

pd.DataFrame

autometa.binning.large_data_mode_loginfo.add_embedding_runtime_summary_info(embedding_df: pandas.DataFrame, totals: pandas.DataFrame) → pandas.DataFrame

Retrieve information about the embeddings that took the longest.

Parameters:

embedding_df (pd.DataFrame) – Embedding info retrieved from logfile
totals (pd.DataFrame) – runtime totals summary table

Returns:

runtime totals summary table updated with embedding info

Return type:

pd.DataFrame

autometa.binning.large_data_mode_loginfo.format_total_times(total_times: list, max_partition_size: str) → pandas.DataFrame

Format total runtimes from timedelta objects to hours

Parameters:

total_times (list) – Runtime totals per algorithm during large-data-mode
max_partition_size (str) – Partition size parameter retrieved from log file

Returns:

Formatted dataframe of timedelta objects and hours per algorithm in large-data-mode run.

Return type:

pd.DataFrame

autometa.binning.large_data_mode_loginfo.get_loginfo(logfile: str) → Dict[str, pandas.DataFrame]

Get autometa-large-data-mode-binning runtime information

Data

“embedding”: Embeddings ranks and times

“kmer_count_normalization”: K-mer count normalization times

“clustering”: Embeddings retrieved from cache

“skipped_taxa”: Ranks above max_partition_size

“totals”: Total times for all binning tasks

param logfile:: Path to autometa-large-data-mode-binning stderr logfile
type logfile:: str
returns:: Dictionary containing large-data-mode-binning information corresponding to task
rtype:: Dict[str, pd.DataFrame]

autometa.binning.large_data_mode_loginfo.main()

autometa.binning.recursive_dbscan module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Cluster contigs recursively searching for bins with highest completeness and purity.

autometa.binning.recursive_dbscan.get_clusters(main: pandas.DataFrame, markers_df: pandas.DataFrame, completeness: float, purity: float, coverage_stddev: float, gc_content_stddev: float, method: str, n_jobs: int = -1, verbose: bool = False) → pandas.DataFrame

Find best clusters retained after applying metrics filters.

Parameters:

main (pd.DataFrame) – index=contig, cols=[‘x’,’y’,’coverage’,’gc_content’]
markers_df (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]
completeness (float) – completeness threshold to retain cluster. e.g. cluster completeness >= completeness
purity (float) – purity threshold to retain cluster. e.g. cluster purity >= purity
coverage_stddev (float) – cluster coverage std.dev. threshold to retain cluster. e.g. cluster coverage std.dev. <= coverage_stddev
gc_content_stddev (float) – cluster GC content std.dev. threshold to retain cluster. e.g. cluster GC content std.dev. <= gc_content_stddev
method (str) – Description of parameter method. choices = [‘dbscan’,’hdbscan’]
verbose (bool) – log stats for each recursive_dbscan clustering iteration

Returns:

main with [‘cluster’,’completeness’,’purity’,’coverage_stddev’,’gc_content_stddev’] columns added

Return type:

pd.DataFrame

autometa.binning.recursive_dbscan.main()

autometa.binning.recursive_dbscan.recursive_dbscan(table: pandas.DataFrame, markers_df: pandas.DataFrame, completeness_cutoff: float, purity_cutoff: float, coverage_stddev_cutoff: float, gc_content_stddev_cutoff: float, n_jobs: int = -1, verbose: bool = False) → Tuple[pandas.DataFrame, pandas.DataFrame]

Carry out DBSCAN, starting at eps=0.3 and continuing until there is just one group.

Break conditions to speed up pipeline: Give up if we’ve got up to eps 1.3 and still no complete and pure clusters. Often when you start at 0.3 there are zero complete and pure clusters, because the groups are too small. Later, some are found as the groups enlarge enough, but after it becomes zero again, it is a lost cause and we may as well stop. On the other hand, sometimes we never find any groups, so perhaps we should give up if by EPS 1.3 we never find any complete/pure groups.

Parameters:

table (pd.DataFrame) – Contigs with embedded k-mer frequencies (‘x’,’y’), ‘coverage’ and ‘gc_content’ columns
markers_df (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]
completeness_cutoff (float) – completeness_cutoff threshold to retain cluster (the default is 20.0). e.g. cluster completeness >= completeness_cutoff
purity_cutoff (float) – purity_cutoff threshold to retain cluster (the default is 95.0). e.g. cluster purity >= purity_cutoff
coverage_stddev_cutoff (float) – coverage_stddev_cutoff threshold to retain cluster (the default is 25.0). e.g. cluster coverage std.dev. <= coverage_stddev_cutoff
gc_content_stddev_cutoff (float) – gc_content_stddev_cutoff threshold to retain cluster (the default is 5.0). e.g. cluster gc_content std.dev. <= gc_content_stddev_cutoff
verbose (bool) – log stats for each recursive_dbscan clustering iteration.

Returns:

(pd.DataFrame(<passed cutoffs>), pd.DataFrame(<failed cutoffs>)) DataFrames consisting of contigs that passed/failed clustering cutoffs, respectively.

DataFrame:: index = contig columns = [‘x,’y’,’coverage’,’gc_content’,’cluster’,’purity’,’completeness’,’coverage_stddev’,’gc_content_stddev’]

Return type:

2-tuple

autometa.binning.recursive_dbscan.recursive_hdbscan(table: pandas.DataFrame, markers_df: pandas.DataFrame, completeness_cutoff: float, purity_cutoff: float, coverage_stddev_cutoff: float, gc_content_stddev_cutoff: float, n_jobs: int = -1, verbose: bool = False) → Tuple[pandas.DataFrame, pandas.DataFrame]

Recursively run HDBSCAN starting with defaults and iterating the min_samples and min_cluster_size until only 1 cluster is recovered.

Parameters:

table (pd.DataFrame) – Contigs with embedded k-mer frequencies (‘x’,’y’), ‘coverage’ and ‘gc_content’ columns
markers_df (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]
completeness_cutoff (float) – completeness_cutoff threshold to retain cluster. e.g. cluster completeness >= completeness_cutoff
purity_cutoff (float) – purity_cutoff threshold to retain cluster. e.g. cluster purity >= purity_cutoff
coverage_stddev_cutoff (float) – coverage_stddev_cutoff threshold to retain cluster. e.g. cluster coverage std.dev. <= coverage_stddev_cutoff
gc_content_stddev_cutoff (float) – gc_content_stddev_cutoff threshold to retain cluster. e.g. cluster gc_content std.dev. <= gc_content_stddev_cutoff
verbose (bool) – log stats for each recursive_dbscan clustering iteration.

Returns:

(pd.DataFrame(<passed cutoffs>), pd.DataFrame(<failed cutoffs>)) DataFrames consisting of contigs that passed/failed clustering cutoffs, respectively.

DataFrame:: index = contig columns = [‘x_1’,’x_2’,’coverage’,’gc_content’,’cluster’,’purity’,’completeness’,’coverage_stddev’,’gc_content_stddev’]

Return type:

2-tuple

autometa.binning.recursive_dbscan.run_dbscan(df: pandas.DataFrame, eps: float, n_jobs: int = -1, dropcols: List[str] = ['cluster', 'purity', 'completeness', 'coverage_stddev', 'gc_content_stddev']) → pandas.DataFrame

Run clustering on df at provided eps.

Notes

documentation for sklearn DBSCAN
documentation for HDBSCAN

Parameters:

df (pd.DataFrame) – Contigs with embedded k-mer frequencies as [‘x_1’,’x_2’,…, ‘x_ndims’] columns and ‘coverage’ column
eps (float) – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. See DBSCAN docs for more details.
dropcols (list, optional) – Drop columns in list from df (the default is [‘cluster’,’purity’,’completeness’,’coverage_stddev’,’gc_content_stddev’]).

Returns:

df with ‘cluster’ column added

Return type:

pd.DataFrame

Raises:

BinningError – Dataframe is missing kmer/coverage annotations

autometa.binning.recursive_dbscan.run_hdbscan(df: pandas.DataFrame, min_cluster_size: int, min_samples: int, n_jobs: int = -1) → pandas.DataFrame

Run clustering on df at provided min_cluster_size.

Notes

Reasoning for parameter: cluster_selection_method
Reasoning for parameters: min_cluster_size and min_samples
Documentation for HDBSCAN

Parameters:

df (pd.DataFrame) – Contigs with embedded k-mer frequencies as [‘x’,’y’] columns and optionally ‘coverage’ column
min_cluster_size (int) – The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.
min_samples (int) – The number of samples in a neighborhood for a point to be considered a core point.
n_jobs (int) – Number of parallel jobs to run in core distance computations. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.

Returns:

df with ‘cluster’ column added

Return type:

pd.DataFrame

Raises:

ValueError – sets usecols and dropcols may not share elements
TableFormatError – df is missing k-mer or coverage annotations.

autometa.binning.recursive_dbscan.taxon_guided_binning(main: pandas.DataFrame, markers: pandas.DataFrame, completeness: float = 20.0, purity: float = 95.0, coverage_stddev: float = 25.0, gc_content_stddev: float = 5.0, starting_rank: str = 'superkingdom', method: str = 'dbscan', reverse_ranks: bool = False, n_jobs: int = -1, verbose: bool = False) → pandas.DataFrame

Perform clustering of contigs by provided method and use metrics to filter clusters that should be retained via completeness and purity thresholds.

Parameters:

main (pd.DataFrame) – index=contig, cols=[‘x’,’y’, ‘coverage’, ‘gc_content’] taxa cols should be present if taxonomy is True. i.e. [taxid,superkingdom,phylum,class,order,family,genus,species]
markers (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]
completeness (float, optional) – Description of parameter completeness (the default is 20.).
purity (float, optional) – purity threshold to retain cluster (the default is 95.0). e.g. cluster purity >= purity_cutoff
coverage_stddev (float, optional) – cluster coverage threshold to retain cluster (the default is 25.0).
gc_content_stddev (float, optional) – cluster GC content threshold to retain cluster (the default is 5.0).
starting_rank (str, optional) – Starting canonical rank at which to begin subsetting taxonomy (the default is superkingdom). Choices are superkingdom, phylum, class, order, family, genus, species.
method (str, optional) – Clustering method (the default is ‘dbscan’). choices = [‘dbscan’,’hdbscan’]
reverse_ranks (bool, optional) – False - [superkingdom,phylum,class,order,family,genus,species] (Default) True - [species,genus,family,order,class,phylum,superkingdom]
verbose (bool, optional) – log stats for each recursive_dbscan clustering iteration

Returns:

main with [‘cluster’,’completeness’,’purity’] columns added

Return type:

pd.DataFrame

Raises:

TableFormatError – No marker information is availble for contigs to be binned.

autometa.binning.summary module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Script to summarize Autometa binning results

autometa.binning.summary.fragmentation_metric(df: pandas.DataFrame, quality_measure: float = 0.5) → int

Describes the quality of assembled genomes that are fragmented in contigs of different length.

Note

For more information see this metagenomics wiki from Matthias Scholz

Parameters:

df (pd.DataFrame) – DataFrame to assess fragmentation within metagenome-assembled genome.
quality_measure (0 < float < 1) – Description of parameter quality_measure (the default is .50). I.e. default measure is N50, but could use .1 for N10 or .9 for N90

Returns:

Minimum contig length to cover quality_measure of genome (i.e. percentile contig length)

Return type:

int

autometa.binning.summary.get_agg_stats(cluster_groups: pandas.core.groupby.generic.DataFrameGroupBy, stat_col: str) → pandas.DataFrame

Compute min, max, (length weighted) mean and median from provided stat_col

Parameters:

cluster_groups (pd.core.groupby.generic.DataFrameGroupBy) – pandas DataFrame grouped by cluster
stat_col (str) – column to on which to compute min, max, (length-weighted) mean and median

Returns:

index=cluster, columns=[min_{stat_col}, max_{stat_col}, std_{stat_col}, length_weighted_{stat_col}]

Return type:

pd.DataFrame

autometa.binning.summary.get_metabin_stats(bin_df: pandas.DataFrame, markers: Union[str, pandas.DataFrame], cluster_col: str = 'cluster') → pandas.DataFrame

Retrieve statistics for all clusters recovered from Autometa binning.

Parameters:

bin_df (pd.DataFrame) – Autometa binning table. index=contig, cols=[‘cluster’,’length’, ‘gc_content’, ‘coverage’, …]
markers (str,pd.DataFrame) – Path to or pd.DataFrame of markers table corresponding to contigs in bin_df
cluster_col (str, optional) – Clustering column by which to group metabins

Returns:

dataframe consisting of various metagenome-assembled genome statistics indexed by cluster.

Return type:

pd.DataFrame

Raises:

TypeError – markers should be a path to or pd.DataFrame of a markers table corresponding to contigs in bin_df
ValueError – One of the required columns (cluster_col, coverage, length, gc_content) was not found in bin_df

autometa.binning.summary.get_metabin_taxonomies(bin_df: pandas.DataFrame, taxa_db: TaxonomyDatabase, cluster_col: str = 'cluster') → pandas.DataFrame

Retrieve taxonomies of all clusters recovered from Autometa binning.

Parameters:

bin_df (pd.DataFrame) – Autometa binning table. index=contig, cols=[‘cluster’,’length’,’taxid’, *canonical_ranks]
taxa_db (autometa.taxonomy.ncbi.TaxonomyDatabase instance) – Autometa NCBI or GTDB class instance
cluster_col (str, optional) – Clustering column by which to group metabins

Returns:

Dataframe consisting of cluster taxonomy with taxid and canonical rank. Indexed by cluster

Return type:

pd.DataFrame

autometa.binning.summary.main()

autometa.binning.summary.write_cluster_records(bin_df: pandas.DataFrame, metagenome: str, outdir: str, cluster_col: str = 'cluster') → None

Write clusters to outdir given clusters df and metagenome records

Parameters:

bin_df (pd.DataFrame) – Autometa binning dataframe. index=’contig’, cols=[‘cluster’, …]
metagenome (str) – Path to metagenome fasta file
outdir (str) – Path to output directory to write fastas for each metagenome-assembled genome
cluster_col (str, optional) – Clustering column by which to group metabins

autometa.binning.unclustered_recruitment module

autometa.binning.utilities module

binning utilities script for autometa-binning

Script containing utility functions when performing autometa clustering/classification tasks.

autometa.binning.utilities.add_metrics(df: pandas.DataFrame, markers_df: pandas.DataFrame) → Tuple[pandas.DataFrame, pandas.DataFrame]

Adds cluster metrics to each respective contig in df.

:math:`completeness =

rac{markers_{cluster}}{markers_{ref}} * 100`

:math:`purity % =

rac{markers_{single-copy}}{markers_{cluster}} * 100`

:math:`mu_{coverage} =

rac{1}{N}sum_{i=1}^{N}left(x_{i}-mu ight)^{2}`

:math:`mu_{GC%} =

rac{1}{N}sum_{i=1}^{N}left(x_{i}-mu ight)^{2}`

dfpd.DataFrame
index=’contig’ cols=[‘coverage’,’gc_content’,’cluster’,’x_1’,’x_2’,…,’x_n’]

markers_dfpd.DataFrame
wide format, i.e. index=contig cols=[marker,marker,…]

2-tuple
df with added cluster metrics columns=[‘completeness’, ‘purity’, ‘coverage_stddev’, ‘gc_content_stddev’] pd.DataFrame(index=clusters, cols=[‘completeness’, ‘purity’, ‘coverage_stddev’, ‘gc_content_stddev’])

autometa.binning.utilities.apply_binning_metrics_filter(df: pandas.DataFrame, completeness_cutoff: float = 20.0, purity_cutoff: float = 95.0, coverage_stddev_cutoff: float = 25.0, gc_content_stddev_cutoff: float = 5.0) → pandas.DataFrame

Filter df by provided cutoff values.

Parameters:

df (pd.DataFrame) – Dataframe containing binning metrics ‘completeness’, ‘purity’, ‘coverage_stddev’ and ‘gc_content_stddev’
completeness_cutoff (float) – completeness_cutoff threshold to retain cluster (the default is 20.0). e.g. cluster completeness >= completeness_cutoff
purity_cutoff (float) – purity_cutoff threshold to retain cluster (the default is 95.00). e.g. cluster purity >= purity_cutoff
coverage_stddev_cutoff (float) – coverage_stddev_cutoff threshold to retain cluster (the default is 25.0). e.g. cluster coverage std.dev. <= coverage_stddev_cutoff
gc_content_stddev_cutoff (float) – gc_content_stddev_cutoff threshold to retain cluster (the default is 5.0). e.g. cluster gc_content std.dev. <= gc_content_stddev_cutoff

Returns:

Cutoff filtered df

Return type:

pd.DataFrame

Raises:

KeyError – One of metrics to apply cutoff does not exist in the df columns

autometa.binning.utilities.filter_taxonomy(df: pandas.DataFrame, rank: str, name: str) → pandas.DataFrame

Clean taxon names (by broadcasting lowercase and replacing whitespace) then subset by all contigs under rank that are equal to name.

Parameters:

df (pd.DataFrame) – Input dataframe containing columns of canonical ranks.
rank (str) – Canonical rank on which to apply filtering.
name (str) – Taxon in rank to retrieve.

Returns:

DataFrame subset by df[rank] == name

Return type:

pd.DataFrame

Raises:

KeyError – rank not in taxonomy columns.
ValueError – Provided name not found in rank column.

autometa.binning.utilities.read_annotations(annotations: Iterable, how: str = 'inner') → pandas.DataFrame

Read in a list of contig annotations from filepaths and return all provided annotations in a single dataframe.

Parameters:

annotations (Iterable) – Filepaths of annotations. These should all contain a ‘contig’ column to be used as the index
how (str, optional) – How to join the provided annotations. By default will take the ‘inner’ or intersection of all contigs from annotations.

Returns:

index_col=’contig’, cols=[annotations, …]

Return type:

pd.DataFrame

autometa.binning.utilities.reindex_bin_names(df: pandas.DataFrame, cluster_col: str = 'cluster', initial_index: int = 0) → pandas.DataFrame

Re-index cluster_col using the provided initial_index as the initial index number then enumerating from this to the number of bins in cluster_col of df.

Parameters:

df (pd.DataFrame) – Dataframe containing cluster_col
cluster_col (str, optional) – Cluster column to apply reindexing, by default “cluster”
initial_index (int, optional) – Starting index number when reindexing, by default 0

Note

The bin names will start one number above the initial_index number provided. Therefore, the default behavior is to use 0 as the initial_index meaning the first bin name will be bin_1.

Example

>>>import pandas as pd
>>>from autometa.binning.utilities import reindex_bin_names
>>>df = pd.read_csv("binning.tsv", sep='        ', index_col='contig')
>>>reindex_bin_names(df, cluster_col='cluster', initial_index=0)
            cluster  completeness  purity  coverage_stddev  gc_content_stddev
contig
k141_1102126   bin_1     90.647482   100.0          1.20951           1.461658
k141_110415    bin_1     90.647482   100.0          1.20951           1.461658
k141_1210233   bin_1     90.647482   100.0          1.20951           1.461658
k141_1227553   bin_1     90.647482   100.0          1.20951           1.461658
k141_1227735   bin_1     90.647482   100.0          1.20951           1.461658
...              ...           ...     ...              ...                ...
k141_999969      NaN           NaN     NaN              NaN                NaN
k141_99997       NaN           NaN     NaN              NaN                NaN
k141_999982      NaN           NaN     NaN              NaN                NaN
k141_999984      NaN           NaN     NaN              NaN                NaN
k141_999987      NaN           NaN     NaN              NaN                NaN

Returns:: DataFrame of re-indexed bins in cluster_col starting at initial_index + 1
Return type:: pd.DataFrame

autometa.binning.utilities.write_results(results: pandas.DataFrame, binning_output: str, full_output: Optional[str] = None) → None

Write out binning results with their respective binning metrics

Parameters:

results (pd.DataFrame) – Binning results contigs dataframe consisting of “cluster” assignments with their respective metrics and annotations
binning_output (str) – Filepath to write binning results
full_output (str, optional) – If provided, will write assignments, metrics and annotations together into full_output (filepath)

Return type:

NoneType

autometa.binning.utilities.zero_pad_bin_names(df: pandas.DataFrame, cluster_col: str = 'cluster') → pandas.DataFrame

Apply zero padding to cluster_col using the length of digit corresponding to the number of unique clusters in cluster_col in the df.

Parameters:

df (pd.DataFrame) – Dataframe containing cluster_col
cluster_col (str, optional) – Cluster column to apply zero padding, by default “cluster”

Returns:

Dataframe with cluster_col zero padded to the length of the number of clusters

Return type:

pd.DataFrame

autometa.binning package

Submodules

autometa.binning.large_data_mode module

autometa.binning.large_data_mode_loginfo module

Data

autometa.binning.recursive_dbscan module

autometa.binning.summary module

autometa.binning.unclustered_recruitment module

autometa.binning.utilities module

Module contents