autometa.binning package

Submodules

autometa.binning.large_data_mode module

Autometa large-data-mode binning by selection of taxon sets using provided upper bound and determined lower bound

autometa.binning.large_data_mode.checkpoint(checkpoints_df: pandas.DataFrame, clustered: pandas.DataFrame, rank: str, rank_name_txt: str, completeness: float, purity: float, coverage_stddev: float, gc_content_stddev: float, cluster_method: str, norm_method: str, pca_dimensions: int, embed_dimensions: int, embed_method: str, min_contigs: int, max_partition_size: int, binning_checkpoints_fpath: str) pandas.DataFrame
autometa.binning.large_data_mode.cluster_by_taxon_partitioning(main: pandas.DataFrame, counts: pandas.DataFrame, markers: pandas.DataFrame, norm_method: str = 'am_clr', pca_dimensions: int = 50, embed_dimensions: int = 2, embed_method: str = 'umap', max_partition_size: int = 10000, completeness: float = 20.0, purity: float = 95.0, coverage_stddev: float = 25.0, gc_content_stddev: float = 5.0, starting_rank: str = 'superkingdom', method: str = 'dbscan', reverse_ranks: bool = False, cache: Optional[str] = None, binning_checkpoints_fpath: Optional[str] = None, n_jobs: int = -1, verbose: bool = False) pandas.DataFrame

Perform clustering of contigs by provided method and use metrics to filter clusters that should be retained via completeness and purity thresholds.

Parameters:
  • main (pd.DataFrame) – index=contig, cols=[‘coverage’, ‘gc_content’] taxa cols should be present if taxonomy is True. i.e. [taxid,superkingdom,phylum,class,order,family,genus,species]

  • counts (pd.DataFrame) – contig kmer counts -> index_col=’contig’, cols=[‘AAAAA’, ‘AAAAT’, …] NOTE: columns will correspond to the selected k-mer count size. e.g. 3-mers would be [‘AAA’,’AAT’, …]

  • markers (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]

  • completeness (float, optional) – Description of parameter completeness (the default is 20.).

  • purity (float, optional) – purity threshold to retain cluster (the default is 95.0). e.g. cluster purity >= purity_cutoff

  • coverage_stddev (float, optional) – cluster coverage threshold to retain cluster (the default is 25.0).

  • gc_content_stddev (float, optional) – cluster GC content threshold to retain cluster (the default is 5.0).

  • starting_rank (str, optional) – Starting canonical rank at which to begin subsetting taxonomy (the default is superkingdom). Choices are superkingdom, phylum, class, order, family, genus, species.

  • method (str, optional) – Clustering method (the default is ‘dbscan’). choices = [‘dbscan’,’hdbscan’]

  • reverse_ranks (bool, optional) – False - [superkingdom,phylum,class,order,family,genus,species] (Default) True - [species,genus,family,order,class,phylum,superkingdom]

  • cache (str, optional) – Directory to cache intermediate results

  • binning_checkpoints_fpath (str, optional) – File path to binning checkpoints (checkpoints are only created if the cache argument is provided)

  • verbose (bool, optional) – log stats for each recursive_dbscan clustering iteration

Returns:

main with [‘cluster’,’completeness’,’purity’] columns added

Return type:

pd.DataFrame

Raises:
  • TableFormatError – No marker information is available for contigs to be binned.

  • FileNotFoundError – Provided binning_checkpoints_fpath does not exist

  • TableFormatError – No marker information is availble for contigs to be binned.

autometa.binning.large_data_mode.get_checkpoint_info(checkpoints_fpath: str) Dict[str, Union[pandas.DataFrame, str]]

Retrieve checkpoint information from generated binning_checkpoints.tsv

Parameters:

checkpoints_fpath (str) – Generated binning_checkpoints.tsv within cache directory

Returns:

binning_checkpoints, starting canonical rank, starting rank name within starting canonical rank keys=”binning_checkpoints”, “starting_rank”, “starting_rank_name_txt” values=pd.DataFrame, str, str

Return type:

Dict[str, str, str]

autometa.binning.large_data_mode.get_kmer_embedding(counts: pandas.DataFrame, cache_fpath: str, norm_method: str, pca_dimensions: int, embed_dimensions: int, embed_method: str) pandas.DataFrame

Retrieve kmer embeddings for provided counts by first performing kmer normalization with norm_method then PCA down to pca_dimensions until the normalized kmer frequencies are embedded to embed_dimensions using embed_method.

Parameters:
  • counts (pd.DataFrame) – Kmer counts where index column is ‘contig’ and each column is a kmer count.

  • cache_fpath (str) – Path to cache embedded kmers table for later look-up/inspection.

  • norm_method (str) – normalization transformation to use on kmer counts. Choices include ‘am_clr’, ‘ilr’ and ‘clr’. See :func:kmers.normalize for more details.

  • pca_dimensions (int) – Number of dimensions by which to initially reduce normalized kmer frequencies (Must be greater than embed_dimensions).

  • embed_dimensions (int) – Embedding dimensions by which to reduce normalized PCA-transformed kmer frequencies (Must be less than pca_dimensions).

  • embed_method (str) – Embedding method to use on normalized, PCA-transformed kmer frequencies. Choices include ‘bhsne’, ‘sksne’ and ‘umap’. See :func:kmers.embed for more details.

Returns:

[description]

Return type:

pd.DataFrame

autometa.binning.large_data_mode.main()

autometa.binning.large_data_mode_loginfo module

Autometa module: autometa-large-data-mode-binning-loginfo

Generates tabular metadata from logfile for large-data-mode task (e.g. slurm_job.stderr)

autometa.binning.large_data_mode_loginfo.add_clustering_runtime_summary_info(clustering_df: pandas.DataFrame, totals: pandas.DataFrame) pandas.DataFrame

Retrieve information about the clustering that took the longest.

Parameters:
  • clustering_df (pd.DataFrame) – Clustering runtime info retrieved from logfile

  • totals (pd.DataFrame) – runtime totals summary table

Returns:

runtime totals summary table updated with clustering runtime info

Return type:

pd.DataFrame

autometa.binning.large_data_mode_loginfo.add_embedding_runtime_summary_info(embedding_df: pandas.DataFrame, totals: pandas.DataFrame) pandas.DataFrame

Retrieve information about the embeddings that took the longest.

Parameters:
  • embedding_df (pd.DataFrame) – Embedding info retrieved from logfile

  • totals (pd.DataFrame) – runtime totals summary table

Returns:

runtime totals summary table updated with embedding info

Return type:

pd.DataFrame

autometa.binning.large_data_mode_loginfo.format_total_times(total_times: list, max_partition_size: str) pandas.DataFrame

Format total runtimes from timedelta objects to hours

Parameters:
  • total_times (list) – Runtime totals per algorithm during large-data-mode

  • max_partition_size (str) – Partition size parameter retrieved from log file

Returns:

Formatted dataframe of timedelta objects and hours per algorithm in large-data-mode run.

Return type:

pd.DataFrame

autometa.binning.large_data_mode_loginfo.get_loginfo(logfile: str) Dict[str, pandas.DataFrame]

Get autometa-large-data-mode-binning runtime information

Data

  1. “embedding”: Embeddings ranks and times

  2. “kmer_count_normalization”: K-mer count normalization times

  3. “clustering”: Embeddings retrieved from cache

  4. “skipped_taxa”: Ranks above max_partition_size

  5. “totals”: Total times for all binning tasks

param logfile:

Path to autometa-large-data-mode-binning stderr logfile

type logfile:

str

returns:

Dictionary containing large-data-mode-binning information corresponding to task

rtype:

Dict[str, pd.DataFrame]

autometa.binning.large_data_mode_loginfo.main()

autometa.binning.recursive_dbscan module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Cluster contigs recursively searching for bins with highest completeness and purity.

autometa.binning.recursive_dbscan.get_clusters(main: pandas.DataFrame, markers_df: pandas.DataFrame, completeness: float, purity: float, coverage_stddev: float, gc_content_stddev: float, method: str, n_jobs: int = -1, verbose: bool = False) pandas.DataFrame

Find best clusters retained after applying metrics filters.

Parameters:
  • main (pd.DataFrame) – index=contig, cols=[‘x’,’y’,’coverage’,’gc_content’]

  • markers_df (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]

  • completeness (float) – completeness threshold to retain cluster. e.g. cluster completeness >= completeness

  • purity (float) – purity threshold to retain cluster. e.g. cluster purity >= purity

  • coverage_stddev (float) – cluster coverage std.dev. threshold to retain cluster. e.g. cluster coverage std.dev. <= coverage_stddev

  • gc_content_stddev (float) – cluster GC content std.dev. threshold to retain cluster. e.g. cluster GC content std.dev. <= gc_content_stddev

  • method (str) – Description of parameter method. choices = [‘dbscan’,’hdbscan’]

  • verbose (bool) – log stats for each recursive_dbscan clustering iteration

Returns:

main with [‘cluster’,’completeness’,’purity’,’coverage_stddev’,’gc_content_stddev’] columns added

Return type:

pd.DataFrame

autometa.binning.recursive_dbscan.main()
autometa.binning.recursive_dbscan.recursive_dbscan(table: pandas.DataFrame, markers_df: pandas.DataFrame, completeness_cutoff: float, purity_cutoff: float, coverage_stddev_cutoff: float, gc_content_stddev_cutoff: float, n_jobs: int = -1, verbose: bool = False) Tuple[pandas.DataFrame, pandas.DataFrame]

Carry out DBSCAN, starting at eps=0.3 and continuing until there is just one group.

Break conditions to speed up pipeline: Give up if we’ve got up to eps 1.3 and still no complete and pure clusters. Often when you start at 0.3 there are zero complete and pure clusters, because the groups are too small. Later, some are found as the groups enlarge enough, but after it becomes zero again, it is a lost cause and we may as well stop. On the other hand, sometimes we never find any groups, so perhaps we should give up if by EPS 1.3 we never find any complete/pure groups.

Parameters:
  • table (pd.DataFrame) – Contigs with embedded k-mer frequencies (‘x’,’y’), ‘coverage’ and ‘gc_content’ columns

  • markers_df (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]

  • completeness_cutoff (float) – completeness_cutoff threshold to retain cluster (the default is 20.0). e.g. cluster completeness >= completeness_cutoff

  • purity_cutoff (float) – purity_cutoff threshold to retain cluster (the default is 95.0). e.g. cluster purity >= purity_cutoff

  • coverage_stddev_cutoff (float) – coverage_stddev_cutoff threshold to retain cluster (the default is 25.0). e.g. cluster coverage std.dev. <= coverage_stddev_cutoff

  • gc_content_stddev_cutoff (float) – gc_content_stddev_cutoff threshold to retain cluster (the default is 5.0). e.g. cluster gc_content std.dev. <= gc_content_stddev_cutoff

  • verbose (bool) – log stats for each recursive_dbscan clustering iteration.

Returns:

(pd.DataFrame(<passed cutoffs>), pd.DataFrame(<failed cutoffs>)) DataFrames consisting of contigs that passed/failed clustering cutoffs, respectively.

DataFrame:

index = contig columns = [‘x,’y’,’coverage’,’gc_content’,’cluster’,’purity’,’completeness’,’coverage_stddev’,’gc_content_stddev’]

Return type:

2-tuple

autometa.binning.recursive_dbscan.recursive_hdbscan(table: pandas.DataFrame, markers_df: pandas.DataFrame, completeness_cutoff: float, purity_cutoff: float, coverage_stddev_cutoff: float, gc_content_stddev_cutoff: float, n_jobs: int = -1, verbose: bool = False) Tuple[pandas.DataFrame, pandas.DataFrame]

Recursively run HDBSCAN starting with defaults and iterating the min_samples and min_cluster_size until only 1 cluster is recovered.

Parameters:
  • table (pd.DataFrame) – Contigs with embedded k-mer frequencies (‘x’,’y’), ‘coverage’ and ‘gc_content’ columns

  • markers_df (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]

  • completeness_cutoff (float) – completeness_cutoff threshold to retain cluster. e.g. cluster completeness >= completeness_cutoff

  • purity_cutoff (float) – purity_cutoff threshold to retain cluster. e.g. cluster purity >= purity_cutoff

  • coverage_stddev_cutoff (float) – coverage_stddev_cutoff threshold to retain cluster. e.g. cluster coverage std.dev. <= coverage_stddev_cutoff

  • gc_content_stddev_cutoff (float) – gc_content_stddev_cutoff threshold to retain cluster. e.g. cluster gc_content std.dev. <= gc_content_stddev_cutoff

  • verbose (bool) – log stats for each recursive_dbscan clustering iteration.

Returns:

(pd.DataFrame(<passed cutoffs>), pd.DataFrame(<failed cutoffs>)) DataFrames consisting of contigs that passed/failed clustering cutoffs, respectively.

DataFrame:

index = contig columns = [‘x_1’,’x_2’,’coverage’,’gc_content’,’cluster’,’purity’,’completeness’,’coverage_stddev’,’gc_content_stddev’]

Return type:

2-tuple

autometa.binning.recursive_dbscan.run_dbscan(df: pandas.DataFrame, eps: float, n_jobs: int = -1, dropcols: List[str] = ['cluster', 'purity', 'completeness', 'coverage_stddev', 'gc_content_stddev']) pandas.DataFrame

Run clustering on df at provided eps.

Notes

Parameters:
  • df (pd.DataFrame) – Contigs with embedded k-mer frequencies as [‘x_1’,’x_2’,…, ‘x_ndims’] columns and ‘coverage’ column

  • eps (float) – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. See DBSCAN docs for more details.

  • dropcols (list, optional) – Drop columns in list from df (the default is [‘cluster’,’purity’,’completeness’,’coverage_stddev’,’gc_content_stddev’]).

Returns:

df with ‘cluster’ column added

Return type:

pd.DataFrame

Raises:

BinningError – Dataframe is missing kmer/coverage annotations

autometa.binning.recursive_dbscan.run_hdbscan(df: pandas.DataFrame, min_cluster_size: int, min_samples: int, n_jobs: int = -1) pandas.DataFrame

Run clustering on df at provided min_cluster_size.

Notes

Parameters:
  • df (pd.DataFrame) – Contigs with embedded k-mer frequencies as [‘x’,’y’] columns and optionally ‘coverage’ column

  • min_cluster_size (int) – The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.

  • min_samples (int) – The number of samples in a neighborhood for a point to be considered a core point.

  • n_jobs (int) – Number of parallel jobs to run in core distance computations. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.

Returns:

df with ‘cluster’ column added

Return type:

pd.DataFrame

Raises:
  • ValueError – sets usecols and dropcols may not share elements

  • TableFormatErrordf is missing k-mer or coverage annotations.

autometa.binning.recursive_dbscan.taxon_guided_binning(main: pandas.DataFrame, markers: pandas.DataFrame, completeness: float = 20.0, purity: float = 95.0, coverage_stddev: float = 25.0, gc_content_stddev: float = 5.0, starting_rank: str = 'superkingdom', method: str = 'dbscan', reverse_ranks: bool = False, n_jobs: int = -1, verbose: bool = False) pandas.DataFrame

Perform clustering of contigs by provided method and use metrics to filter clusters that should be retained via completeness and purity thresholds.

Parameters:
  • main (pd.DataFrame) – index=contig, cols=[‘x’,’y’, ‘coverage’, ‘gc_content’] taxa cols should be present if taxonomy is True. i.e. [taxid,superkingdom,phylum,class,order,family,genus,species]

  • markers (pd.DataFrame) – wide format, i.e. index=contig cols=[marker,marker,…]

  • completeness (float, optional) – Description of parameter completeness (the default is 20.).

  • purity (float, optional) – purity threshold to retain cluster (the default is 95.0). e.g. cluster purity >= purity_cutoff

  • coverage_stddev (float, optional) – cluster coverage threshold to retain cluster (the default is 25.0).

  • gc_content_stddev (float, optional) – cluster GC content threshold to retain cluster (the default is 5.0).

  • starting_rank (str, optional) – Starting canonical rank at which to begin subsetting taxonomy (the default is superkingdom). Choices are superkingdom, phylum, class, order, family, genus, species.

  • method (str, optional) – Clustering method (the default is ‘dbscan’). choices = [‘dbscan’,’hdbscan’]

  • reverse_ranks (bool, optional) – False - [superkingdom,phylum,class,order,family,genus,species] (Default) True - [species,genus,family,order,class,phylum,superkingdom]

  • verbose (bool, optional) – log stats for each recursive_dbscan clustering iteration

Returns:

main with [‘cluster’,’completeness’,’purity’] columns added

Return type:

pd.DataFrame

Raises:

TableFormatError – No marker information is availble for contigs to be binned.

autometa.binning.summary module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Script to summarize Autometa binning results

autometa.binning.summary.fragmentation_metric(df: pandas.DataFrame, quality_measure: float = 0.5) int

Describes the quality of assembled genomes that are fragmented in contigs of different length.

Note

For more information see this metagenomics wiki from Matthias Scholz

Parameters:
  • df (pd.DataFrame) – DataFrame to assess fragmentation within metagenome-assembled genome.

  • quality_measure (0 < float < 1) – Description of parameter quality_measure (the default is .50). I.e. default measure is N50, but could use .1 for N10 or .9 for N90

Returns:

Minimum contig length to cover quality_measure of genome (i.e. percentile contig length)

Return type:

int

autometa.binning.summary.get_agg_stats(cluster_groups: pandas.core.groupby.generic.DataFrameGroupBy, stat_col: str) pandas.DataFrame

Compute min, max, (length weighted) mean and median from provided stat_col

Parameters:
  • cluster_groups (pd.core.groupby.generic.DataFrameGroupBy) – pandas DataFrame grouped by cluster

  • stat_col (str) – column to on which to compute min, max, (length-weighted) mean and median

Returns:

index=cluster, columns=[min_{stat_col}, max_{stat_col}, std_{stat_col}, length_weighted_{stat_col}]

Return type:

pd.DataFrame

autometa.binning.summary.get_metabin_stats(bin_df: pandas.DataFrame, markers: Union[str, pandas.DataFrame], cluster_col: str = 'cluster') pandas.DataFrame

Retrieve statistics for all clusters recovered from Autometa binning.

Parameters:
  • bin_df (pd.DataFrame) – Autometa binning table. index=contig, cols=[‘cluster’,’length’, ‘gc_content’, ‘coverage’, …]

  • markers (str,pd.DataFrame) – Path to or pd.DataFrame of markers table corresponding to contigs in bin_df

  • cluster_col (str, optional) – Clustering column by which to group metabins

Returns:

dataframe consisting of various metagenome-assembled genome statistics indexed by cluster.

Return type:

pd.DataFrame

Raises:
  • TypeError – markers should be a path to or pd.DataFrame of a markers table corresponding to contigs in bin_df

  • ValueError – One of the required columns (cluster_col, coverage, length, gc_content) was not found in bin_df

autometa.binning.summary.get_metabin_taxonomies(bin_df: pandas.DataFrame, taxa_db: TaxonomyDatabase, cluster_col: str = 'cluster') pandas.DataFrame

Retrieve taxonomies of all clusters recovered from Autometa binning.

Parameters:
  • bin_df (pd.DataFrame) – Autometa binning table. index=contig, cols=[‘cluster’,’length’,’taxid’, *canonical_ranks]

  • taxa_db (autometa.taxonomy.ncbi.TaxonomyDatabase instance) – Autometa NCBI or GTDB class instance

  • cluster_col (str, optional) – Clustering column by which to group metabins

Returns:

Dataframe consisting of cluster taxonomy with taxid and canonical rank. Indexed by cluster

Return type:

pd.DataFrame

autometa.binning.summary.main()
autometa.binning.summary.write_cluster_records(bin_df: pandas.DataFrame, metagenome: str, outdir: str, cluster_col: str = 'cluster') None

Write clusters to outdir given clusters df and metagenome records

Parameters:
  • bin_df (pd.DataFrame) – Autometa binning dataframe. index=’contig’, cols=[‘cluster’, …]

  • metagenome (str) – Path to metagenome fasta file

  • outdir (str) – Path to output directory to write fastas for each metagenome-assembled genome

  • cluster_col (str, optional) – Clustering column by which to group metabins

autometa.binning.unclustered_recruitment module

autometa.binning.utilities module

binning utilities script for autometa-binning

Script containing utility functions when performing autometa clustering/classification tasks.

autometa.binning.utilities.add_metrics(df: pandas.DataFrame, markers_df: pandas.DataFrame) Tuple[pandas.DataFrame, pandas.DataFrame]

Adds cluster metrics to each respective contig in df.

  • :math:`completeness =

rac{markers_{cluster}}{markers_{ref}} * 100`
  • :math:`purity % =

rac{markers_{single-copy}}{markers_{cluster}} * 100`
  • :math:`mu_{coverage} =

rac{1}{N}sum_{i=1}^{N}left(x_{i}-mu ight)^{2}`

  • :math:`mu_{GC%} =

rac{1}{N}sum_{i=1}^{N}left(x_{i}-mu ight)^{2}`

dfpd.DataFrame

index=’contig’ cols=[‘coverage’,’gc_content’,’cluster’,’x_1’,’x_2’,…,’x_n’]

markers_dfpd.DataFrame

wide format, i.e. index=contig cols=[marker,marker,…]

2-tuple

df with added cluster metrics columns=[‘completeness’, ‘purity’, ‘coverage_stddev’, ‘gc_content_stddev’] pd.DataFrame(index=clusters, cols=[‘completeness’, ‘purity’, ‘coverage_stddev’, ‘gc_content_stddev’])

autometa.binning.utilities.apply_binning_metrics_filter(df: pandas.DataFrame, completeness_cutoff: float = 20.0, purity_cutoff: float = 95.0, coverage_stddev_cutoff: float = 25.0, gc_content_stddev_cutoff: float = 5.0) pandas.DataFrame

Filter df by provided cutoff values.

Parameters:
  • df (pd.DataFrame) – Dataframe containing binning metrics ‘completeness’, ‘purity’, ‘coverage_stddev’ and ‘gc_content_stddev’

  • completeness_cutoff (float) – completeness_cutoff threshold to retain cluster (the default is 20.0). e.g. cluster completeness >= completeness_cutoff

  • purity_cutoff (float) – purity_cutoff threshold to retain cluster (the default is 95.00). e.g. cluster purity >= purity_cutoff

  • coverage_stddev_cutoff (float) – coverage_stddev_cutoff threshold to retain cluster (the default is 25.0). e.g. cluster coverage std.dev. <= coverage_stddev_cutoff

  • gc_content_stddev_cutoff (float) – gc_content_stddev_cutoff threshold to retain cluster (the default is 5.0). e.g. cluster gc_content std.dev. <= gc_content_stddev_cutoff

Returns:

Cutoff filtered df

Return type:

pd.DataFrame

Raises:

KeyError – One of metrics to apply cutoff does not exist in the df columns

autometa.binning.utilities.filter_taxonomy(df: pandas.DataFrame, rank: str, name: str) pandas.DataFrame

Clean taxon names (by broadcasting lowercase and replacing whitespace) then subset by all contigs under rank that are equal to name.

Parameters:
  • df (pd.DataFrame) – Input dataframe containing columns of canonical ranks.

  • rank (str) – Canonical rank on which to apply filtering.

  • name (str) – Taxon in rank to retrieve.

Returns:

DataFrame subset by df[rank] == name

Return type:

pd.DataFrame

Raises:
  • KeyErrorrank not in taxonomy columns.

  • ValueError – Provided name not found in rank column.

autometa.binning.utilities.read_annotations(annotations: Iterable, how: str = 'inner') pandas.DataFrame

Read in a list of contig annotations from filepaths and return all provided annotations in a single dataframe.

Parameters:
  • annotations (Iterable) – Filepaths of annotations. These should all contain a ‘contig’ column to be used as the index

  • how (str, optional) – How to join the provided annotations. By default will take the ‘inner’ or intersection of all contigs from annotations.

Returns:

index_col=’contig’, cols=[annotations, …]

Return type:

pd.DataFrame

autometa.binning.utilities.reindex_bin_names(df: pandas.DataFrame, cluster_col: str = 'cluster', initial_index: int = 0) pandas.DataFrame

Re-index cluster_col using the provided initial_index as the initial index number then enumerating from this to the number of bins in cluster_col of df.

Parameters:
  • df (pd.DataFrame) – Dataframe containing cluster_col

  • cluster_col (str, optional) – Cluster column to apply reindexing, by default “cluster”

  • initial_index (int, optional) – Starting index number when reindexing, by default 0

Note

The bin names will start one number above the initial_index number provided. Therefore, the default behavior is to use 0 as the initial_index meaning the first bin name will be bin_1.

Example

>>>import pandas as pd
>>>from autometa.binning.utilities import reindex_bin_names
>>>df = pd.read_csv("binning.tsv", sep='        ', index_col='contig')
>>>reindex_bin_names(df, cluster_col='cluster', initial_index=0)
            cluster  completeness  purity  coverage_stddev  gc_content_stddev
contig
k141_1102126   bin_1     90.647482   100.0          1.20951           1.461658
k141_110415    bin_1     90.647482   100.0          1.20951           1.461658
k141_1210233   bin_1     90.647482   100.0          1.20951           1.461658
k141_1227553   bin_1     90.647482   100.0          1.20951           1.461658
k141_1227735   bin_1     90.647482   100.0          1.20951           1.461658
...              ...           ...     ...              ...                ...
k141_999969      NaN           NaN     NaN              NaN                NaN
k141_99997       NaN           NaN     NaN              NaN                NaN
k141_999982      NaN           NaN     NaN              NaN                NaN
k141_999984      NaN           NaN     NaN              NaN                NaN
k141_999987      NaN           NaN     NaN              NaN                NaN
Returns:

DataFrame of re-indexed bins in cluster_col starting at initial_index + 1

Return type:

pd.DataFrame

autometa.binning.utilities.write_results(results: pandas.DataFrame, binning_output: str, full_output: Optional[str] = None) None

Write out binning results with their respective binning metrics

Parameters:
  • results (pd.DataFrame) – Binning results contigs dataframe consisting of “cluster” assignments with their respective metrics and annotations

  • binning_output (str) – Filepath to write binning results

  • full_output (str, optional) – If provided, will write assignments, metrics and annotations together into full_output (filepath)

Return type:

NoneType

autometa.binning.utilities.zero_pad_bin_names(df: pandas.DataFrame, cluster_col: str = 'cluster') pandas.DataFrame

Apply zero padding to cluster_col using the length of digit corresponding to the number of unique clusters in cluster_col in the df.

Parameters:
  • df (pd.DataFrame) – Dataframe containing cluster_col

  • cluster_col (str, optional) – Cluster column to apply zero padding, by default “cluster”

Returns:

Dataframe with cluster_col zero padded to the length of the number of clusters

Return type:

pd.DataFrame

Module contents