autometa.common package

Submodules

autometa.common.coverage module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Calculates coverage of contigs

autometa.common.coverage.from_spades_names(records: List[Bio.SeqRecord.SeqRecord]) → pandas.DataFrame

Retrieve coverages from SPAdes scaffolds headers.

Example SPAdes header : NODE_83_length_162517_cov_224.639

Parameters:: records (List[SeqRecord]) – [SeqRecord,…]
Returns:: index=contig, name=’coverage’, dtype=float
Return type:: pd.DataFrame

autometa.common.coverage.get(fasta: str, out: str, from_spades: bool = False, fwd_reads: List[str] = None, rev_reads: List[str] = None, se_reads: List[str] = None, sam: str = None, bam: str = None, bed: str = None, cpus: int = 1) → pandas.DataFrame

Get coverages for assembly fasta file using provided files or if the metagenome assembly was generated from SPAdes, use the k-mer coverages provided in each contig’s header by specifying from_spades=True.

Either fwd_reads and rev_reads and/or se_reads or,`sam`, or bam, or bed must be provided if from_spades=False.

Note

Will begin coverage calculation based on files provided checking in the following order:

bed
bam
sam
fwd_reads and rev_reads and se_reads

Event sequence to calculate contig coverages:

align reads to generate alignment.sam
sort samfile to generate alignment.bam
calculate assembly coverages to generate alignment.bed
calculate contig coverages to generate coverage.tsv

Parameters:

fasta (str) – </path/to/assembly.fasta>
out (str) – </path/to/output/coverages.tsv>
from_spades (bool, optional) – If True, will attempt to parse record ids for coverage information. This is only compatible with SPAdes assemblies. (the Default is False).
fwd_reads (List[str], optional) – [</path/to/forward_reads.fastq>, …]
rev_reads (List[str], optional) – [</path/to/reverse_reads.fastq>, …]
se_reads (List[str], optional) – [</path/to/single_end_reads.fastq>, …]
sam (str, optional) – </path/to/alignments.sam>
bam (str, optional) – </path/to/alignments.bam>
bed (str, optional) – </path/to/alignments.bed>
cpus (int, optional) – Number of cpus to use for coverage calculation.

Returns:

index=contig cols=[‘coverage’]

Return type:

pd.DataFrame

autometa.common.coverage.main()

autometa.common.exceptions module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

File containing customized AutometaErrors for more specific exception handling

exception autometa.common.exceptions.AutometaError

Bases: Exception

Base class for Autometa Errors.

exception autometa.common.exceptions.BinningError

Bases: AutometaError

BinningError exception class.

Exception called when issues arise during or after the binning process.

This is usually a result of no clusters being recovered.

exception autometa.common.exceptions.ChecksumMismatchError

Bases: AutometaError

ChecksumMismatchError exception class

Exception called when checksums do not match.

exception autometa.common.exceptions.DatabaseOutOfSyncError(value)

Bases: AutometaError

Raised when NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other :param AutometaError: Base class for other exceptions :type AutometaError: class

__str__(): Operator overloading to return the text message written while raising the error, rather than the message of __str__ by base exception :returns: Message written alongside raising the exception :rtype: str

exception autometa.common.exceptions.ExternalToolError(cmd, err)

Bases: AutometaError

Raised when samtools sort is not executed properly.

Parameters:: AutometaError (class) – Base class for other exceptions

exception autometa.common.exceptions.TableFormatError

Bases: AutometaError

TableFormatError exception class.

Exception called when Table format is incorrect.

This is usually a result of a table missing the ‘contig’ column as this is often used as the index.

autometa.common.kmers module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Count, normalize and embed k-mers given nucleotide sequences

autometa.common.kmers.autometa_clr(df: pandas.DataFrame) → pandas.DataFrame

Normalize k-mers by Centered Log Ratio transformation

Steps

Drop any k-mers not present for all contigs

Drop any contigs not containing any kmer counts

Fill any remaining na values with 0

Normalize the k-mer count by the total count of all k-mers for a given contig

Add 1 as 0 can not be utilized for CLR

Perform CLR transformation log(norm. value / geometric mean norm. value)

param df:: K-mers Dataframe where index_col=’contig’ and column values are k-mer frequencies.
type df:: pd.DataFrame

References

Aitchison, J. The Statistical Analysis of Compositional Data (1986)
Pawlowsky-Glahn, Egozcue, Tolosana-Delgado. Lecture Notes on Compositional Data Analysis (2011)
Why ILR is preferred stats stackexchange discussion
Use of CLR transformation prior to PCA stats stackexchange discussion
Lecture notes on Compositional Data Analysis (CoDa) PDF

returns:: index=’contig’, cols=[kmer, kmer, …] Columns have been transformed by CLR normalization.
rtype:: pd.DataFrame

autometa.common.kmers.count(assembly: str, size: int = 5, out: str = None, force: bool = False, verbose: bool = True, cpus: int = 2) → pandas.DataFrame

Counts k-mer frequencies for provided assembly file

First we make a dictionary of all the possible k-mers (discounting reverse complements). Each k-mer’s count is updated by index when encountered in the record.

Parameters:

assembly (str) – Description of parameter assembly.
size (int, optional) – length of k-mer to count size (the default is 5).
out (str, optional) – Path to write k-mer counts table.
force (bool, optional) – Whether to overwrite existing out k-mer counts table (the default is False).
verbose (bool, optional) – Enable progress bar verbose (the default is True).
cpus (int, optional) – Number of cpus to use. (the default will use all available).

Returns:

index_col=’contig’, tab-delimited, cols=unique_kmers i.e. 5-mer columns=[AAAAA, AAAAT, AAAAC, AAAAG, …, GCCGC]

Return type:

pandas.DataFrames

Raises:

TypeError – size must be an int

autometa.common.kmers.embed(kmers: Union[str, pandas.DataFrame], out: Optional[str] = None, force: bool = False, embed_dimensions: int = 2, pca_dimensions: int = 50, method: str = 'bhsne', perplexity: float = 30.0, seed: int = 42, n_jobs: int = -1, **method_kwargs: Dict[str, Any]) → pandas.DataFrame

Embed k-mers using provided method.

Notes

Parameters:

kmers (str or pd.DataFrame) – </path/to/input/kmers.normalized.tsv>
out (str, optional) – </path/to/output/kmers.out.tsv> If provided will write to out.
force (bool, optional) – Whether to overwrite existing out file.
embed_dimensions (int, optional) –
embed_dimensions` to embed k-mer frequencies (the default is 2).

The output embedded kmers will follow columns of x_1 to x_{embed_dimensions}

NOTE: The columns are 1-indexed, i.e. at x_1 not x_0
pca_dimensions (int, optional) – Reduce k-mer frequencies dimensions to pca_dimensions (the default is 50). If zero, will skip this step.
method (str, optional) – embedding method to use (the default is ‘bhsne’). choices include sksne, bhsne, umap, trimap and densmap.
perplexity (float, optional) – hyperparameter used to tune sksne and bhsne (the default is 30.0).
seed (int, optional) – Seed to use for method. Allows for reproducibility from random state.
n_jobs (int, optional) –
Used with sksne, densmap and umap, (the default is -1 which will attempt to use all available CPUs)
Note

For n_jobs below -1, (CPUS + 1 + n_jobs) are used. For example with n_jobs=-2, all CPUs but one are used.
- scikit-learn TSNE n_jobs glossary
- UMAP and DensMAP’s
invocation use this with pynndescent

**method_kwargs : Dict[str, Any], optional

Other keyword arguments (kwargs) to be supplied to respective method.

Set UMAP(verbose=True, output_dens=True) using **method_kwargs >>> embed_df = kmers.embed(

norm_df, method=’densmap’, embed_dimensions=2, n_jobs=None, **{

‘verbose’: True, ‘output_dens’: True,

}

)

NOTE: Setting duplicate arguments will result in an error

Here we specify UMAP(densmap=True) using method='densmap' and also attempt to overwrite to UMAP(densmap=False) with the method_kwargs, **{'densmap':False}, resulting in a TypeError.
>>> embed_df = kmers.embed(
    df,
    method='densmap',
    embed_dimensions=2,
    n_jobs=4,
    **{'densmap': False}
)
TypeError: umap.umap_.UMAP() got multiple values for keyword argument 'densmap'
Typically, you will not require the use of method_kwargs as this is only available for applying advanced parameter settings to any of the available embedding methods.

Returns:

out dataframe with index=’contig’ and cols=[‘x’,’y’,’z’]

Return type:

pd.DataFrame

Raises:

TypeError – Provided kmers is not a str or pd.DataFrame.
TableFormatError – Provided kmers or out are not formatted correctly for use.
ValueError – Provided method is not an available choice.
FileNotFoundError – kmers type must be a pd.DataFrame or filepath.

autometa.common.kmers.init_kmers(kmer_size: int = 5) → Dict[str, int]

Initialize k-mers from kmer_size. Respective reverse complements will be removed.

Parameters:: kmer_size (int, optional) – pattern size of k-mer to intialize dict (the default is 5).
Returns:: {kmer:index, …}
Return type:: dict

autometa.common.kmers.load(kmers_fpath: str) → pandas.DataFrame

Load in a previously counted k-mer frequencies table.

Parameters:

kmers_fpath (str) – Path to kmer frequency table

Returns:

index=’contig’, cols=[kmer, kmer, …]

Return type:

pd.DataFrame

Raises:

FileNotFoundError – kmers_fpath does not exist or is empty
TableFormatError – kmers_fpath file format is invalid

autometa.common.kmers.main()

autometa.common.kmers.mp_counter(assembly: str, ref_kmers: Dict[str, int], cpus: int = 2) → List

Multiprocessing k-mer counter used in count. (Should not be used directly).

Parameters:

assembly (str) – </path/to/assembly.fasta> (nucleotides)
ref_kmers (dict) – {kmer:index, …}
cpus (int, optional) – Number of cpus to use. (the default will use all available).

Returns:

[{record:counts}, {record:counts}, …]

Return type:

list

autometa.common.kmers.normalize(df: pandas.DataFrame, method: str = 'am_clr', out: Optional[str] = None, force: bool = False) → pandas.DataFrame

Normalize raw k-mer counts by center or isometric log-ratio transform.

Parameters:

df (pd.DataFrame) – k-mer counts dataframe. i.e. for 3-mers; Index=’contig’, columns=[AAA, AAT, …]
method (str, optional) – Normalize k-mer counts using CLR or ILR transformation (the default is Autometa’s CLR implementation). choices = [‘ilr’, ‘clr’, ‘am_clr’] Other transformations come from the skbio.stats.composition module
out (str, optional) – Path to write normalized k-mers.
force (bool, optional) – Whether to overwrite existing out file path, by default False.

Returns:

Normalized counts using provided method.

Return type:

pd.DataFrame

Raises:

ValueError – Provided method is not available.

autometa.common.kmers.record_counter(args: Tuple[Bio.SeqIO.SeqRecord, Dict[str, int]]) → Dict[str, List[int]]

single record counter used when multiprocessing.

Parameters:: args (2-tuple) – (record, ref_kmers) - record : SeqIO.SeqRecord - ref_kmers : {kmer:index, …}
Returns:: {contig:[count,count,…]} count index is respective to ref_kmers.keys()
Return type:: dict

autometa.common.kmers.seq_counter(assembly: str, ref_kmers: Dict[str, int], verbose: bool = True) → Dict[str, List[int]]

Sequentially count k-mer frequencies.

Parameters:

assembly (str) – </path/to/assembly.fasta> (nucleotides)
ref_kmers (dict) – {kmer:index, …}
verbose (bool, optional) – enable progress bar verbose (the default is True).

Returns:

{contig:[count,count,…]} count index is respective to ref_kmers.keys()

Return type:

dict

autometa.common.markers module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Autometa Marker class consisting of various methods to annotate sequences with marker sets depending on sequence set taxonomy

autometa.common.markers.get(kingdom: str, orfs: str, hmmdb: Optional[str] = None, cutoffs: Optional[str] = None, dbdir: str = './autometa/databases/markers', scans: Optional[str] = None, out: Optional[str] = None, force: bool = False, cpus: int = 8, parallel: bool = True, gnu_parallel: bool = False, seed: int = 42) → pandas.DataFrame

Retrieve contigs’ markers from markers database that pass cutoffs filter.

Parameters:

kingdom (str) – kingdom to annotate markers choices = [‘bacteria’, ‘archaea’]
orfs (str) – Path to amino-acid ORFs file
dbdir – Optional directory containing hmmdb and cutoffs files
hmmdb – Path to marker genes database file, previously hmmpressed.
cutoffs – Path to marker genes cutoff tsv.
scans (str, optional) – Path to existing hmmscan table to filter by cutoffs
out (str, optional) – Path to write annotated markers table.
force (bool, optional) – Whether to overwrite existing out file path, by default False.
cpus (int, optional) – Number of cores to use if running in parallel, by default all available.
parallel (bool, optional) – Whether to run hmmscan using its parallel option, by default True.
gnu_parallel (bool, optional) – Whether to run hmmscan using gnu parallel, by default False.
seed (int, optional) – Seed to pass into hmmscan for determinism, by default 42.

Returns:

wide - pd.DataFrame(index_col=contig, columns=[PFAM,…])
long - pd.DataFrame(index_col=contig, columns=[‘sacc’,’count’])
list - {contig:[pfam,pfam,…],contig:[…],…}
counts - {contig:count, contig:count,…}

Return type:

pd.Dataframe or dict

Raises:

ValueError – Why the exception is raised.

autometa.common.markers.load(fpath, format='wide')

Read markers table into specified format.

Parameters:

fpath (str) – </path/to/kingdom.markers.tsv>
format (str, optional) –
- wide - index=contig, cols=[domain sacc,..] (default)
- long - index=contig, cols=[‘sacc’,’count’]
- list - {contig:[sacc,…],…}
- counts - {contig:len([sacc,…]), …}

Returns:

wide - index=contig, cols=[domain sacc,..] (default)
long - index=contig, cols=[‘sacc’,’count’]
list - {contig:[sacc,…],…}
counts - {contig:len([sacc,…]), …}

Return type:

pd.DataFrame or dict

Raises:

FileNotFoundError – Provided fpath does not exist
ValueError – Provided format is not in choices: choices = wide, long, list or counts

autometa.common.markers.main()

autometa.common.metagenome module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Script containing Metagenome class for general handling of metagenome assembly

class autometa.common.metagenome.Metagenome(assembly)

Bases: object

Autometa Metagenome Class.

Parameters:: assembly (str) – </path/to/metagenome/assembly.fasta>

sequences

[seq,…]

Type:: list

seqrecords

[SeqRecord,…]

Type:: list

nseqs

Number of sequences in assembly.

Type:: int

length_weighted_gc

Length weighted average GC% of assembly.

Type:: float

size

Total assembly size in bp.

Type:: int

largest_seq

id of longest sequence in assembly

Type:: str

\* self.fragmentation_metric()

\* self.describe()

\* self.length_filter()

describe() → pandas.DataFrame

Return dataframe of details.

Columns

# assembly : Assembly input into Metagenome(…) [index column] # nseqs : Number of sequences in assembly # size : Size or total sum of all sequence lengths # N50 : # N10 : # N90 : # length_weighted_gc_content : Length weighted average GC content # largest_seq : Largest sequence in assembly

rtype:: pd.DataFrame

fragmentation_metric(quality_measure: float = 0.5) → int

Describes the quality of assembled genomes that are fragmented in contigs of different length.

Note

For more information see this metagenomics wiki from Matthias Scholz

Parameters:: quality_measure (0 < float < 1) – Description of parameter quality_measure (the default is .50). I.e. default measure is N50, but could use .1 for N10 or .9 for N90
Returns:: Minimum contig length to cover quality_measure of genome (i.e. length weighted median)
Return type:: int

gc_content() → pandas.DataFrame

Retrieves GC content from sequences in assembly

Returns:: index=”contig”, columns=[“gc_content”,”length”]
Return type:: pd.DataFrame

property largest_seq: str

Retrieve the name of the largest sequence in the provided assembly.

Returns:: record ID of the largest sequence in assembly.
Return type:: str

length_filter(out: str, cutoff: int = 3000, force: bool = False)

Filters sequences by length with provided cutoff.

Note

A WARNING will be emitted and the original metagenome will be returned if no contigs pass the length filter cutoff.

Parameters:

out (str) – Path to write length filtered output fasta file.
cutoff (int, optional) – Lengths above or equal to cutoff that will be retained (the default is 3000).
force (bool, optional) – Overwrite existing out file (the default is False).

Returns:

autometa Metagenome object with only assembly sequences above the cutoff threshold.

Return type:

Metagenome

Raises:

TypeError – cutoff value must be a float or integer
ValueError – cutoff value must be a positive real number
FileExistsError – filepath consisting of sequences that passed filter already exists

property length_weighted_gc: float

Retrieve the length weighted average GC percentage of provided assembly.

Returns:: GC percentage weighted by contig length.
Return type:: float

property nseqs: int

Retrieve the number of sequences in provided assembly.

Returns:: Number of sequences parsed from assembly
Return type:: int

property seqrecords: list

Retrieve SeqRecord objects from provided assembly.

Returns:: [SeqRecord, SeqRecord, …]
Return type:: list

property sequences: list

Retrieve the sequences from provided assembly.

Returns:: [seq, seq, …]
Return type:: list

property size: int

Retrieve the summation of sizes for each contig in the provided assembly.

Returns:: Total summation of contig sizes in assembly
Return type:: int

autometa.common.metagenome.main()

autometa.common.utilities module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

File containing common utilities functions to be used by Autometa scripts.

autometa.common.utilities.calc_checksum(fpath: str) → str

Retrieve md5 checksum from provided fpath.

See:
https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file

fpathstr
</path/to/file>

str
space-delimited hexdigest of fpath using md5sum and basename of fpath. e.g. ‘hash filename

‘

FileNotFoundError
Provided fpath does not exist

TypeError
fpath is not a string

autometa.common.utilities.file_length(fpath: str, approximate: bool = False) → int

Retrieve the number of lines in fpath

See: https://stackoverflow.com/q/845058/13118765

Parameters:

fpath (str) – Description of parameter fpath.
approximate (bool) – If True, will approximate the length of the file from the file size.

Returns:

Number of lines in fpath

Return type:

int

Raises:

FileNotFoundError – provided fpath does not exist

autometa.common.utilities.gunzip(infpath: str, outfpath: str, delete_original: bool = False, block_size: int = 65536) → str

Decompress gzipped infpath to outfpath and write checksum of outfpath upon successful decompression.

Parameters:

infpath (str) – </path/to/file.gz>
outfpath (str) – </path/to/file>
delete_original (bool) – Will delete the original file after successfully decompressing infpath (Default is False).
block_size (int) – Amount of infpath to read in to memory before writing to outfpath (Default is 65536 bytes).

Returns:

</path/to/file>

Return type:

str

Raises:

FileExistsError – outfpath already exists and is not empty

autometa.common.utilities.internet_is_connected(host: str = '8.8.8.8', port: int = 53, timeout: int = 2) → bool

autometa.common.utilities.is_gz_file(filepath) → bool

Check if the given file is gzipped compressed or not.

Parameters:: filepath (str) – Filepath to check
Returns:: True if file is gzipped else False
Return type:: bool

autometa.common.utilities.make_pickle(obj: Any, outfpath: str) → str

Serialize a python object (obj) to outfpath. Note: Opposite of unpickle()

Parameters:

obj (any) – Python object to serialize to outfpath.
outfpath (str) – </path/to/pickled/file>.

Returns:

</path/to/pickled/file.pkl>

Return type:

str

Raises:

ExceptionName – Why the exception is raised.

autometa.common.utilities.ncbi_is_connected(filepath: str = 'rsync://ftp.ncbi.nlm.nih.gov/genbank/GB_Release_Number') → bool

Check if ncbi databases are reachable. This can be used instead of a check for internet connection.

Parameters:

filepath (string) – filepath to NCBI’s rsync server. Default is rsync://ftp.ncbi.nlm.nih.gov/genbank/GB_Release_Number, which should be a very small file that is unlikely to move. This may need to be updated if NCBI changes their file organization.
Outputs –
------- –
False (True or) – True if the rsync server can be contacted without an error False if the rsync process returns any error

autometa.common.utilities.read_checksum(fpath: str) → str

Read checksum from provided checksum formatted fpath.

Note: See write_checksum for how a checksum file is generated.

Parameters:

fpath (str) – </path/to/file.md5>

Returns:

checksum retrieved from fpath.

Return type:

str

Raises:

TypeError – Provided fpath was not a string.
FileNotFoundError – Provided fpath does not exist.

autometa.common.utilities.tarchive_results(outfpath: str, src_dirpath: str) → str

Generate a tar archive of Autometa Results

See: https://stackoverflow.com/questions/2032403/how-to-create-full-compressed-tar-file-using-python

Parameters:

outfpath (str) – </path/to/output/tar/archive.tar.gz || </path/to/output/tar/archive.tgz
src_dirpath (str) – </paths/to/directory/to/archive>

Returns:

</path/to/output/tar/archive.tar.gz || </path/to/output/tar/archive.tgz

Return type:

str

Raises:

FileExistsError – outfpath already exists

autometa.common.utilities.timeit(func: function) → function

Time function run time (to be used as a decorator). I.e. when defining a function use python’s decorator syntax

Example

@timeit
def your_function(args):
    ...

Notes

See: https://docs.python.org/2/library/functools.html#functools.wraps

Parameters:: func (function) – function to decorate timer
Returns:: timer decorated func.
Return type:: function

autometa.common.utilities.unpickle(fpath: str) → Any

Load a serialized fpath from make_pickle().

Parameters:: fpath (str) – </path/to/file.pkl>.
Returns:: Python object that was serialized to file via make_pickle()
Return type:: any
Raises:: ExceptionName – Why the exception is raised.

autometa.common.utilities.untar(tarchive: str, outdir: str, member: Optional[str] = None) → str

Decompress a tar archive (may be gzipped or bzipped). passing in member requires an outdir also be provided.

See: https://docs.python.org/3.8/library/tarfile.html#module-tarfile

Parameters:

tarchive (str) – </path/tarchive.tar.[compression]>
outdir (str) – </path/to/output/directory>
member (str, optional) – member file to extract.

Returns:

</path/to/extracted/member/file> if member else </path/to/output/directory>

Return type:

str

Raises:

IsADirectoryError – outdir already exists
ValueError – tarchive is not a tar archive
KeyError – member was not found in tarchive

autometa.common.utilities.write_checksum(infpath: str, outfpath: str) → str

Calculate checksum for infpath and write to outfpath.

Parameters:

infpath (str) – </path/to/input/file>
outfpath (str) – </path/to/output/checksum/file>

Returns:

Description of returned object.

Return type:

NoneType

Raises:

FileNotFoundError – Provided infpath does not exist
TypeError – infpath or outfpath is not a string

autometa.common package

Subpackages

Submodules

autometa.common.coverage module

autometa.common.exceptions module

autometa.common.kmers module

Steps

autometa.common.markers module

autometa.common.metagenome module

Columns

autometa.common.utilities module

Module contents