autometa.validation package

Submodules

autometa.validation.assess_metagenome_deconvolution module

autometa.validation.benchmark module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Autometa taxon-profiling, clustering and binning-classification evaluation benchmarking.

Script to benchmark Autometa taxon-profiling, clustering and binning-classification results using clustering and classification evaluation metrics.

class autometa.validation.benchmark.Labels(true, pred)

Bases: tuple

pred: Alias for field number 1

true: Alias for field number 0

class autometa.validation.benchmark.Targets(true, pred, target_names)

Bases: tuple

pred: Alias for field number 1

target_names: Alias for field number 2

true: Alias for field number 0

autometa.validation.benchmark.compute_classification_metrics(labels: namedtuple) → dict

Retrieve classification report from using scikit-learn’s metrics.classification_report method.

Parameters:: labels (namedtuple) – Labels namedtuple containing ‘true’ and ‘pred’ fields
Returns:: Computed classification metrics corresponding to labels
Return type:: dict

autometa.validation.benchmark.compute_clustering_metrics(labels: NamedTuple, average_method: str) → Dict[str, float]

Calculate various clustering performance metrics listed below.

Note

Some of these clustering performance evaluation metrics adjust for chance. This is discussed in more detail in the scikit-learn user guide. - Adjustment for chance in clustering performance evaluation

This analysis suggests the most robust and reliable metrics to use as the number of clusters increases are adjusted rand index and adjusted mutual info score.

Parameters:

labels (Labels) – Labels NamedTuple with Labels.pred (predictions) and Labels.true (reference) namespaces
average_method (str) – Normalizer to select for normalized mutual information score clustering metric. choices: min, geometric, arithmetic, max

Returns:

computed clustering evaluation metrics keyed by their metric

Return type:

Dict[str, float]

Raises:

ValueError – The input arguments are not the correct type (pd.DataFrame or str)
ValueError – The provided reference community and predictions do not match!

autometa.validation.benchmark.evaluate_binning_classification(predictions: Iterable, reference: str) → pandas.DataFrame

autometa.validation.benchmark.evaluate_classification(predictions: Iterable, reference: str, ncbi: Union[str, NCBI], keep_averages=['weighted avg', 'samples avg']) → Tuple[pandas.DataFrame, List[Dict[str, str]]]

Evaluate classification predictions against provided reference

Parameters:

predictions (Iterable) – Paths to taxonomic predictions (tab-delimited files of contig and taxid columns)
reference (str) – Path to ground truths (tab-delimited file containing at least contig and taxid columns)
ncbi (Union[str, NCBI]) – Path to NCBI databases directory or instance of autometa NCBI class
keep_averages (list, optional) – averages to keep from classification report, by default [“weighted avg”, “samples avg”]

Returns:

Metrics

Return type:

Tuple[pd.DataFrame, List[dict]]

autometa.validation.benchmark.evaluate_clustering(predictions: Iterable, reference: str, average_method: str) → pandas.DataFrame

Evaluate clustering performance of predictions against reference

Parameters:

predictions (Iterable) – Paths to binning predictions. Paths should be tab-delimited files with ‘cluster’ and ‘contig’ columns.
reference (str) – Path to ground truth reference genome assignments. Should be tab-delimited file with ‘contig’ and ‘reference_genome’ columns.
average_method (str) – Normalizer to select for normalized mutual information score clustering metric. choices: min, geometric, arithmetic, max

Returns:

Dataframe of clustering metrics indexed by ‘dataset’ computed as each predictions basename

Return type:

pd.DataFrame

autometa.validation.benchmark.get_categorical_labels(predictions: str, reference: Union[str, pandas.DataFrame]) → NamedTuple

Retrieve categorical labels from predictions and reference

Parameters:

predictions (str) – Path to tab-delimited file containing contig clustering predictions with columns ‘contig’ and ‘cluster’.
reference (Union[str, pd.DataFrame]) – Path to tab-delimited file containing ground truth reference genome assignments with columns ‘contig’ and ‘reference_genome’.

Returns:

Labels namedtuple with ‘true’ and ‘pred’ fields containing respective dataframes of categorical values

Return type:

NamedTuple

Raises:

ValueError – Provided reference is not a pd.DataFrame or path to ground-truth reference genome assignments file.
ValueError – The provided reference community contigs do not match the predictions contigs

autometa.validation.benchmark.get_target_labels(prediction: str, reference: Union[str, pandas.DataFrame], ncbi: Union[str, NCBI]) → namedtuple

Retrieve taxid lineage as target labels from merge of reference and prediction.

Note

The exact label value matters for these metrics as we are looking at the available target labels for classification (not clustering)

Parameters:

prediction (str) – Path to contig taxid predictions
reference (Union[str, pd.DataFrame]) – Path to ground truth contig taxids
ncbi (Union[str, NCBI]) – Path to NCBI databases directory or instance of autometa NCBI class.

Returns:

Targets namedtuple with fields ‘true’, ‘pred’ and ‘target_names’

Return type:

namedtuple

Raises:

ValueError – Provided reference is not a pd.DataFrame or path to reference assignments file.
ValueError – The provided reference community and predictions do not match

autometa.validation.benchmark.main()

autometa.validation.benchmark.write_reports(reports: Iterable[Dict], outdir: str, ncbi: NCBI) → None

Write taxid multi-label classification reports in reports

Parameters:

reports (Iterable[Dict]) – List of classification report dicts from each classification benchmarking evaluation
outdir (str) – Directory path to write reports
ncbi (NCBI) – autometa.taxonomy.ncbi.NCBI instance for taxid name and rank look-up.

Return type:

NoneType

autometa.validation.build_protein_marker_aln module

autometa.validation.calculate_f1_scores module

autometa.validation.cami module

Autometa module to format Autometa results to bioboxes data format (compatible with CAMI tools)

Reformats Autometa binning results such that they may be submitted to CAMI for benchmarking assessment

autometa.validation.cami.format_genome_binning(df: pandas.DataFrame, sample_id: str, version: str = '0.9.0') → str

Format autometa genome binning results to be compatible with specified BioBox version.

Parameters:

genome_binning (Union[str,pd.DataFrame]) – Path to (to-be-formatted) genome_binning results or pd.DataFrame(index_col=range(0, n_rows), columns=[contig, taxid, …])
sample_id (str) – Sample identifier, not the generating user or program name. It MUST match the regular expression [A-Za-z0-9._]+.
version (str, optional) – Biobox version to format results, by default “0.9”

Returns:

formatted results ready to be written to a file path

Return type:

str

Raises:

NotImplementedError – Specified version is not implemented
TypeError – genome_binning must be a path to a taxon results file or pandas
ValueError – genome_binning does not contain the required columns

autometa.validation.cami.format_profiling(df: pandas.DataFrame, sample_id: str, version: str) → str

autometa.validation.cami.format_taxon_binning(df: pandas.DataFrame, sample_id: str, version: str = '0.9.0') → str

Format autometa taxon binning results to be compatible with specified BioBox version.

Parameters:

taxon_binning (Union[str,pd.DataFrame]) – Path to (to-be-formatted) taxon_binning results or pd.DataFrame(index_col=range(0, n_rows), columns=[contig, taxid, …])
sample_id (str) – Sample identifier, not the generating user or program name. It MUST match the regular expression [A-Za-z0-9._]+.
version (str, optional) – Biobox version to format results, by default “0.9”

Returns:

formatted results ready to be written to a file path

Return type:

str

Raises:

NotImplementedError – Specified version is not implemented
TypeError – taxon_binning must be a path to a taxon results file or pandas
ValueError – taxon_binning does not contain the required columns

autometa.validation.cami.get_biobox_format(predictions: Union[str, pandas.DataFrame], sample_id: str, results_type: Literal['profiling', 'genome_binning', 'taxon_binning'], version: str) → str

autometa.validation.cami.main()

autometa.validation.cluster_process module

Processes metagenome assembled genomes from autometa binning results

autometa.validation.cluster_process.assess_assembly(seq_record_list)

autometa.validation.cluster_process.main(args)

autometa.validation.cluster_process.run_command(command_string, stdout_path=None)

autometa.validation.cluster_process_docker module

autometa.validation.cluster_taxonomy module

autometa.validation.compile_reference_training_table module

autometa.validation.confidence_vs_accuracy module

autometa.validation.datasets module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

pulling data from google drive dataset with simulated or synthetic communities

autometa.validation.datasets.download(community_type: str, community_sizes: list, file_names: list, dir_path: str) → None

Downloads the files specified in a dictionary.

Parameters:

community_type (str) – specifies the type of dataset that the user would like to download from
community_sizes (list) – specifies the size of dataset that the user would like to download
file_names (list) – specifies the file(s) that the user would like to download
dir_path (str) – output path where the user wants to download the file(s)

autometa.validation package

Submodules

autometa.validation.assess_metagenome_deconvolution module

autometa.validation.benchmark module

autometa.validation.build_protein_marker_aln module

autometa.validation.calculate_f1_scores module

autometa.validation.cami module

autometa.validation.cluster_process module

autometa.validation.cluster_process_docker module

autometa.validation.cluster_taxonomy module

autometa.validation.compile_reference_training_table module

autometa.validation.confidence_vs_accuracy module

autometa.validation.datasets module

autometa.validation.download_random_bacterial_genomes module

autometa.validation.length_vs_accuracy module

autometa.validation.make_simulated_metagenome module

autometa.validation.make_simulated_metagenome_control_fasta module

autometa.validation.reference_genome_from_quast module

autometa.validation.show_clusters module

autometa.validation.summarize_f1_stats module

autometa.validation.tabulate_bins module

autometa.validation.tabulate_metaquast_alignments module

autometa.validation.vizualize_assembly_graph_by_bin module

Module contents