autometa.validation packageο
Submodulesο
autometa.validation.assess_metagenome_deconvolution moduleο
autometa.validation.benchmark moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
Autometa taxon-profiling, clustering and binning-classification evaluation benchmarking.
Script to benchmark Autometa taxon-profiling, clustering and binning-classification results using clustering and classification evaluation metrics.
- class autometa.validation.benchmark.Labels(true, pred)ο
Bases:
tuple
- predο
Alias for field number 1
- trueο
Alias for field number 0
- class autometa.validation.benchmark.Targets(true, pred, target_names)ο
Bases:
tuple
- predο
Alias for field number 1
- target_namesο
Alias for field number 2
- trueο
Alias for field number 0
- autometa.validation.benchmark.compute_classification_metrics(labels: namedtuple) dict ο
Retrieve classification report from using scikit-learnβs metrics.classification_report method.
- Parameters:
labels (namedtuple) β Labels namedtuple containing βtrueβ and βpredβ fields
- Returns:
Computed classification metrics corresponding to labels
- Return type:
dict
- autometa.validation.benchmark.compute_clustering_metrics(labels: NamedTuple, average_method: str) Dict[str, float] ο
Calculate various clustering performance metrics listed below.
Note
Some of these clustering performance evaluation metrics adjust for chance. This is discussed in more detail in the scikit-learn user guide. - Adjustment for chance in clustering performance evaluation
This analysis suggests the most robust and reliable metrics to use as the number of clusters increases are adjusted rand index and adjusted mutual info score.
- Parameters:
labels (Labels) β Labels NamedTuple with Labels.pred (predictions) and Labels.true (reference) namespaces
average_method (str) β Normalizer to select for normalized mutual information score clustering metric. choices: min, geometric, arithmetic, max
- Returns:
computed clustering evaluation metrics keyed by their metric
- Return type:
Dict[str, float]
- Raises:
ValueError β The input arguments are not the correct type (pd.DataFrame or str)
ValueError β The provided reference community and predictions do not match!
- autometa.validation.benchmark.evaluate_binning_classification(predictions: Iterable, reference: str) pandas.DataFrame ο
- autometa.validation.benchmark.evaluate_classification(predictions: Iterable, reference: str, ncbi: Union[str, NCBI], keep_averages=['weighted avg', 'samples avg']) Tuple[pandas.DataFrame, List[Dict[str, str]]] ο
Evaluate classification predictions against provided reference
- Parameters:
predictions (Iterable) β Paths to taxonomic predictions (tab-delimited files of contig and taxid columns)
reference (str) β Path to ground truths (tab-delimited file containing at least contig and taxid columns)
ncbi (Union[str, NCBI]) β Path to NCBI databases directory or instance of autometa NCBI class
keep_averages (list, optional) β averages to keep from classification report, by default [βweighted avgβ, βsamples avgβ]
- Returns:
Metrics
- Return type:
Tuple[pd.DataFrame, List[dict]]
- autometa.validation.benchmark.evaluate_clustering(predictions: Iterable, reference: str, average_method: str) pandas.DataFrame ο
Evaluate clustering performance of predictions against reference
- Parameters:
predictions (Iterable) β Paths to binning predictions. Paths should be tab-delimited files with βclusterβ and βcontigβ columns.
reference (str) β Path to ground truth reference genome assignments. Should be tab-delimited file with βcontigβ and βreference_genomeβ columns.
average_method (str) β Normalizer to select for normalized mutual information score clustering metric. choices: min, geometric, arithmetic, max
- Returns:
Dataframe of clustering metrics indexed by βdatasetβ computed as each predictions basename
- Return type:
pd.DataFrame
- autometa.validation.benchmark.get_categorical_labels(predictions: str, reference: Union[str, pandas.DataFrame]) NamedTuple ο
Retrieve categorical labels from predictions and reference
- Parameters:
predictions (str) β Path to tab-delimited file containing contig clustering predictions with columns βcontigβ and βclusterβ.
reference (Union[str, pd.DataFrame]) β Path to tab-delimited file containing ground truth reference genome assignments with columns βcontigβ and βreference_genomeβ.
- Returns:
Labels namedtuple with βtrueβ and βpredβ fields containing respective dataframes of categorical values
- Return type:
NamedTuple
- Raises:
ValueError β Provided reference is not a pd.DataFrame or path to ground-truth reference genome assignments file.
ValueError β The provided reference community contigs do not match the predictions contigs
- autometa.validation.benchmark.get_target_labels(prediction: str, reference: Union[str, pandas.DataFrame], ncbi: Union[str, NCBI]) namedtuple ο
Retrieve taxid lineage as target labels from merge of reference and prediction.
Note
The exact label value matters for these metrics as we are looking at the available target labels for classification (not clustering)
- Parameters:
prediction (str) β Path to contig taxid predictions
reference (Union[str, pd.DataFrame]) β Path to ground truth contig taxids
ncbi (Union[str, NCBI]) β Path to NCBI databases directory or instance of autometa NCBI class.
- Returns:
Targets namedtuple with fields βtrueβ, βpredβ and βtarget_namesβ
- Return type:
namedtuple
- Raises:
ValueError β Provided reference is not a pd.DataFrame or path to reference assignments file.
ValueError β The provided reference community and predictions do not match
- autometa.validation.benchmark.main()ο
- autometa.validation.benchmark.write_reports(reports: Iterable[Dict], outdir: str, ncbi: NCBI) None ο
Write taxid multi-label classification reports in reports
- Parameters:
reports (Iterable[Dict]) β List of classification report dicts from each classification benchmarking evaluation
outdir (str) β Directory path to write reports
ncbi (NCBI) β autometa.taxonomy.ncbi.NCBI instance for taxid name and rank look-up.
- Return type:
NoneType
autometa.validation.build_protein_marker_aln moduleο
autometa.validation.calculate_f1_scores moduleο
autometa.validation.cami moduleο
Autometa module to format Autometa results to bioboxes data format (compatible with CAMI tools)
Reformats Autometa binning results such that they may be submitted to CAMI for benchmarking assessment
- autometa.validation.cami.format_genome_binning(df: pandas.DataFrame, sample_id: str, version: str = '0.9.0') str ο
Format autometa genome binning results to be compatible with specified BioBox version.
- Parameters:
genome_binning (Union[str,pd.DataFrame]) β Path to (to-be-formatted) genome_binning results or pd.DataFrame(index_col=range(0, n_rows), columns=[contig, taxid, β¦])
sample_id (str) β Sample identifier, not the generating user or program name. It MUST match the regular expression [A-Za-z0-9._]+.
version (str, optional) β Biobox version to format results, by default β0.9β
- Returns:
formatted results ready to be written to a file path
- Return type:
str
- Raises:
NotImplementedError β Specified version is not implemented
TypeError β genome_binning must be a path to a taxon results file or pandas
ValueError β genome_binning does not contain the required columns
- autometa.validation.cami.format_profiling(df: pandas.DataFrame, sample_id: str, version: str) str ο
- autometa.validation.cami.format_taxon_binning(df: pandas.DataFrame, sample_id: str, version: str = '0.9.0') str ο
Format autometa taxon binning results to be compatible with specified BioBox version.
- Parameters:
taxon_binning (Union[str,pd.DataFrame]) β Path to (to-be-formatted) taxon_binning results or pd.DataFrame(index_col=range(0, n_rows), columns=[contig, taxid, β¦])
sample_id (str) β Sample identifier, not the generating user or program name. It MUST match the regular expression [A-Za-z0-9._]+.
version (str, optional) β Biobox version to format results, by default β0.9β
- Returns:
formatted results ready to be written to a file path
- Return type:
str
- Raises:
NotImplementedError β Specified version is not implemented
TypeError β taxon_binning must be a path to a taxon results file or pandas
ValueError β taxon_binning does not contain the required columns
- autometa.validation.cami.get_biobox_format(predictions: Union[str, pandas.DataFrame], sample_id: str, results_type: Literal['profiling', 'genome_binning', 'taxon_binning'], version: str) str ο
- autometa.validation.cami.main()ο
autometa.validation.cluster_process moduleο
Processes metagenome assembled genomes from autometa binning results
- autometa.validation.cluster_process.assess_assembly(seq_record_list)ο
- autometa.validation.cluster_process.main(args)ο
- autometa.validation.cluster_process.run_command(command_string, stdout_path=None)ο
autometa.validation.cluster_process_docker moduleο
autometa.validation.cluster_taxonomy moduleο
autometa.validation.compile_reference_training_table moduleο
autometa.validation.confidence_vs_accuracy moduleο
autometa.validation.datasets moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
pulling data from google drive dataset with simulated or synthetic communities
- autometa.validation.datasets.download(community_type: str, community_sizes: list, file_names: list, dir_path: str) None ο
Downloads the files specified in a dictionary.
- Parameters:
community_type (str) β specifies the type of dataset that the user would like to download from
community_sizes (list) β specifies the size of dataset that the user would like to download
file_names (list) β specifies the file(s) that the user would like to download
dir_path (str) β output path where the user wants to download the file(s)
- Returns:
download is completed through gdown
- Return type:
None
- autometa.validation.datasets.main()ο