autometa.common packageο
Subpackagesο
- autometa.common.external package
- Submodules
- autometa.common.external.bedtools module
- autometa.common.external.bowtie module
- autometa.common.external.diamond module
- autometa.common.external.hmmscan module
- autometa.common.external.hmmsearch module
- autometa.common.external.prodigal module
- autometa.common.external.samtools module
- Module contents
Submodulesο
autometa.common.coverage moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
Calculates coverage of contigs
- autometa.common.coverage.from_spades_names(records: List[Bio.SeqRecord.SeqRecord]) pandas.DataFrame ο
Retrieve coverages from SPAdes scaffolds headers.
Example SPAdes header : NODE_83_length_162517_cov_224.639
- Parameters:
records (List[SeqRecord]) β [SeqRecord,β¦]
- Returns:
index=contig, name=βcoverageβ, dtype=float
- Return type:
pd.DataFrame
- autometa.common.coverage.get(fasta: str, out: str, from_spades: bool = False, fwd_reads: List[str] = None, rev_reads: List[str] = None, se_reads: List[str] = None, sam: str = None, bam: str = None, bed: str = None, cpus: int = 1) pandas.DataFrame ο
Get coverages for assembly fasta file using provided files or if the metagenome assembly was generated from SPAdes, use the k-mer coverages provided in each contigβs header by specifying from_spades=True.
Either fwd_reads and rev_reads and/or se_reads or,`sam`, or bam, or bed must be provided if from_spades=False.
Note
Will begin coverage calculation based on files provided checking in the following order:
bed
bam
sam
fwd_reads and rev_reads and se_reads
Event sequence to calculate contig coverages:
align reads to generate alignment.sam
sort samfile to generate alignment.bam
calculate assembly coverages to generate alignment.bed
calculate contig coverages to generate coverage.tsv
- Parameters:
fasta (str) β </path/to/assembly.fasta>
out (str) β </path/to/output/coverages.tsv>
from_spades (bool, optional) β If True, will attempt to parse record ids for coverage information. This is only compatible with SPAdes assemblies. (the Default is False).
fwd_reads (List[str], optional) β [</path/to/forward_reads.fastq>, β¦]
rev_reads (List[str], optional) β [</path/to/reverse_reads.fastq>, β¦]
se_reads (List[str], optional) β [</path/to/single_end_reads.fastq>, β¦]
sam (str, optional) β </path/to/alignments.sam>
bam (str, optional) β </path/to/alignments.bam>
bed (str, optional) β </path/to/alignments.bed>
cpus (int, optional) β Number of cpus to use for coverage calculation.
- Returns:
index=contig cols=[βcoverageβ]
- Return type:
pd.DataFrame
- autometa.common.coverage.main()ο
autometa.common.exceptions moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
File containing customized AutometaErrors for more specific exception handling
- exception autometa.common.exceptions.AutometaErrorο
Bases:
Exception
Base class for Autometa Errors.
- exception autometa.common.exceptions.BinningErrorο
Bases:
AutometaError
BinningError exception class.
Exception called when issues arise during or after the binning process.
This is usually a result of no clusters being recovered.
- exception autometa.common.exceptions.ChecksumMismatchErrorο
Bases:
AutometaError
ChecksumMismatchError exception class
Exception called when checksums do not match.
- exception autometa.common.exceptions.DatabaseOutOfSyncError(value)ο
Bases:
AutometaError
Raised when NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other :param AutometaError: Base class for other exceptions :type AutometaError: class
- __str__()ο
Operator overloading to return the text message written while raising the error, rather than the message of __str__ by base exception :returns: Message written alongside raising the exception :rtype: str
- exception autometa.common.exceptions.ExternalToolError(cmd, err)ο
Bases:
AutometaError
Raised when samtools sort is not executed properly.
- Parameters:
AutometaError (class) β Base class for other exceptions
- exception autometa.common.exceptions.TableFormatErrorο
Bases:
AutometaError
TableFormatError exception class.
Exception called when Table format is incorrect.
This is usually a result of a table missing the βcontigβ column as this is often used as the index.
autometa.common.kmers moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
Count, normalize and embed k-mers given nucleotide sequences
- autometa.common.kmers.autometa_clr(df: pandas.DataFrame) pandas.DataFrame ο
Normalize k-mers by Centered Log Ratio transformation
Stepsο
Drop any k-mers not present for all contigs
Drop any contigs not containing any kmer counts
Fill any remaining na values with 0
Normalize the k-mer count by the total count of all k-mers for a given contig
Add 1 as 0 can not be utilized for CLR
Perform CLR transformation log(norm. value / geometric mean norm. value)
- param df:
K-mers Dataframe where index_col=βcontigβ and column values are k-mer frequencies.
- type df:
pd.DataFrame
References
Aitchison, J. The Statistical Analysis of Compositional Data (1986)
Pawlowsky-Glahn, Egozcue, Tolosana-Delgado. Lecture Notes on Compositional Data Analysis (2011)
Why ILR is preferred stats stackexchange discussion
Use of CLR transformation prior to PCA stats stackexchange discussion
Lecture notes on Compositional Data Analysis (CoDa) PDF
- returns:
index=βcontigβ, cols=[kmer, kmer, β¦] Columns have been transformed by CLR normalization.
- rtype:
pd.DataFrame
- autometa.common.kmers.count(assembly: str, size: int = 5, out: str = None, force: bool = False, verbose: bool = True, cpus: int = 2) pandas.DataFrame ο
Counts k-mer frequencies for provided assembly file
First we make a dictionary of all the possible k-mers (discounting reverse complements). Each k-merβs count is updated by index when encountered in the record.
- Parameters:
assembly (str) β Description of parameter assembly.
size (int, optional) β length of k-mer to count size (the default is 5).
out (str, optional) β Path to write k-mer counts table.
force (bool, optional) β Whether to overwrite existing out k-mer counts table (the default is False).
verbose (bool, optional) β Enable progress bar verbose (the default is True).
cpus (int, optional) β Number of cpus to use. (the default will use all available).
- Returns:
index_col=βcontigβ, tab-delimited, cols=unique_kmers i.e. 5-mer columns=[AAAAA, AAAAT, AAAAC, AAAAG, β¦, GCCGC]
- Return type:
pandas.DataFrames
- Raises:
TypeError β size must be an int
- autometa.common.kmers.embed(kmers: Union[str, pandas.DataFrame], out: Optional[str] = None, force: bool = False, embed_dimensions: int = 2, pca_dimensions: int = 50, method: str = 'bhsne', perplexity: float = 30.0, seed: int = 42, n_jobs: int = -1, **method_kwargs: Dict[str, Any]) pandas.DataFrame ο
Embed k-mers using provided method.
Notes
- Parameters:
kmers (str or pd.DataFrame) β </path/to/input/kmers.normalized.tsv>
out (str, optional) β </path/to/output/kmers.out.tsv> If provided will write to out.
force (bool, optional) β Whether to overwrite existing out file.
embed_dimensions (int, optional) β
embed_dimensions` to embed k-mer frequencies (the default is 2).
The output embedded kmers will follow columns of x_1 to x_{embed_dimensions}
NOTE: The columns are 1-indexed, i.e. at x_1 not x_0
pca_dimensions (int, optional) β Reduce k-mer frequencies dimensions to pca_dimensions (the default is 50). If zero, will skip this step.
method (str, optional) β embedding method to use (the default is βbhsneβ). choices include sksne, bhsne, umap, trimap and densmap.
perplexity (float, optional) β hyperparameter used to tune sksne and bhsne (the default is 30.0).
seed (int, optional) β Seed to use for method. Allows for reproducibility from random state.
n_jobs (int, optional) β
Used with sksne, densmap and umap, (the default is -1 which will attempt to use all available CPUs)
Note
For n_jobs below -1, (CPUS + 1 + n_jobs) are used. For example with n_jobs=-2, all CPUs but one are used.
scikit-learn TSNE n_jobs glossary
UMAP and DensMAPβs
invocation use this with pynndescent
**method_kwargs : Dict[str, Any], optional
Other keyword arguments (kwargs) to be supplied to respective method.
Set UMAP(verbose=True, output_dens=True) using **method_kwargs >>> embed_df = kmers.embed(
norm_df, method=βdensmapβ, embed_dimensions=2, n_jobs=None, **{
βverboseβ: True, βoutput_densβ: True,
}
)
NOTE: Setting duplicate arguments will result in an error
Here we specify
UMAP(densmap=True)
usingmethod='densmap'
and also attempt to overwrite toUMAP(densmap=False)
with the method_kwargs,**{'densmap':False}
, resulting in a TypeError.>>> embed_df = kmers.embed( df, method='densmap', embed_dimensions=2, n_jobs=4, **{'densmap': False} ) TypeError: umap.umap_.UMAP() got multiple values for keyword argument 'densmap'
Typically, you will not require the use of method_kwargs as this is only available for applying advanced parameter settings to any of the available embedding methods.
- Returns:
out dataframe with index=βcontigβ and cols=[βxβ,βyβ,βzβ]
- Return type:
pd.DataFrame
- Raises:
TypeError β Provided kmers is not a str or pd.DataFrame.
TableFormatError β Provided kmers or out are not formatted correctly for use.
ValueError β Provided method is not an available choice.
FileNotFoundError β kmers type must be a pd.DataFrame or filepath.
- autometa.common.kmers.init_kmers(kmer_size: int = 5) Dict[str, int] ο
Initialize k-mers from kmer_size. Respective reverse complements will be removed.
- Parameters:
kmer_size (int, optional) β pattern size of k-mer to intialize dict (the default is 5).
- Returns:
{kmer:index, β¦}
- Return type:
dict
- autometa.common.kmers.load(kmers_fpath: str) pandas.DataFrame ο
Load in a previously counted k-mer frequencies table.
- Parameters:
kmers_fpath (str) β Path to kmer frequency table
- Returns:
index=βcontigβ, cols=[kmer, kmer, β¦]
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError β kmers_fpath does not exist or is empty
TableFormatError β kmers_fpath file format is invalid
- autometa.common.kmers.main()ο
- autometa.common.kmers.mp_counter(assembly: str, ref_kmers: Dict[str, int], cpus: int = 2) List ο
Multiprocessing k-mer counter used in count. (Should not be used directly).
- Parameters:
assembly (str) β </path/to/assembly.fasta> (nucleotides)
ref_kmers (dict) β {kmer:index, β¦}
cpus (int, optional) β Number of cpus to use. (the default will use all available).
- Returns:
[{record:counts}, {record:counts}, β¦]
- Return type:
list
- autometa.common.kmers.normalize(df: pandas.DataFrame, method: str = 'am_clr', out: Optional[str] = None, force: bool = False) pandas.DataFrame ο
Normalize raw k-mer counts by center or isometric log-ratio transform.
- Parameters:
df (pd.DataFrame) β k-mer counts dataframe. i.e. for 3-mers; Index=βcontigβ, columns=[AAA, AAT, β¦]
method (str, optional) β Normalize k-mer counts using CLR or ILR transformation (the default is Autometaβs CLR implementation). choices = [βilrβ, βclrβ, βam_clrβ] Other transformations come from the skbio.stats.composition module
out (str, optional) β Path to write normalized k-mers.
force (bool, optional) β Whether to overwrite existing out file path, by default False.
- Returns:
Normalized counts using provided method.
- Return type:
pd.DataFrame
- Raises:
ValueError β Provided method is not available.
- autometa.common.kmers.record_counter(args: Tuple[Bio.SeqIO.SeqRecord, Dict[str, int]]) Dict[str, List[int]] ο
single record counter used when multiprocessing.
- Parameters:
args (2-tuple) β (record, ref_kmers) - record : SeqIO.SeqRecord - ref_kmers : {kmer:index, β¦}
- Returns:
{contig:[count,count,β¦]} count index is respective to ref_kmers.keys()
- Return type:
dict
- autometa.common.kmers.seq_counter(assembly: str, ref_kmers: Dict[str, int], verbose: bool = True) Dict[str, List[int]] ο
Sequentially count k-mer frequencies.
- Parameters:
assembly (str) β </path/to/assembly.fasta> (nucleotides)
ref_kmers (dict) β {kmer:index, β¦}
verbose (bool, optional) β enable progress bar verbose (the default is True).
- Returns:
{contig:[count,count,β¦]} count index is respective to ref_kmers.keys()
- Return type:
dict
autometa.common.markers moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
Autometa Marker class consisting of various methods to annotate sequences with marker sets depending on sequence set taxonomy
- autometa.common.markers.get(kingdom: str, orfs: str, hmmdb: Optional[str] = None, cutoffs: Optional[str] = None, dbdir: str = './autometa/databases/markers', scans: Optional[str] = None, out: Optional[str] = None, force: bool = False, cpus: int = 8, parallel: bool = True, gnu_parallel: bool = False, seed: int = 42) pandas.DataFrame ο
Retrieve contigsβ markers from markers database that pass cutoffs filter.
- Parameters:
kingdom (str) β kingdom to annotate markers choices = [βbacteriaβ, βarchaeaβ]
orfs (str) β Path to amino-acid ORFs file
dbdir β Optional directory containing hmmdb and cutoffs files
hmmdb β Path to marker genes database file, previously hmmpressed.
cutoffs β Path to marker genes cutoff tsv.
scans (str, optional) β Path to existing hmmscan table to filter by cutoffs
out (str, optional) β Path to write annotated markers table.
force (bool, optional) β Whether to overwrite existing out file path, by default False.
cpus (int, optional) β Number of cores to use if running in parallel, by default all available.
parallel (bool, optional) β Whether to run hmmscan using its parallel option, by default True.
gnu_parallel (bool, optional) β Whether to run hmmscan using gnu parallel, by default False.
seed (int, optional) β Seed to pass into hmmscan for determinism, by default 42.
- Returns:
wide - pd.DataFrame(index_col=contig, columns=[PFAM,β¦])
long - pd.DataFrame(index_col=contig, columns=[βsaccβ,βcountβ])
list - {contig:[pfam,pfam,β¦],contig:[β¦],β¦}
counts - {contig:count, contig:count,β¦}
- Return type:
pd.Dataframe or dict
- Raises:
ValueError β Why the exception is raised.
- autometa.common.markers.load(fpath, format='wide')ο
Read markers table into specified format.
- Parameters:
fpath (str) β </path/to/kingdom.markers.tsv>
format (str, optional) β
wide - index=contig, cols=[domain sacc,..] (default)
long - index=contig, cols=[βsaccβ,βcountβ]
list - {contig:[sacc,β¦],β¦}
counts - {contig:len([sacc,β¦]), β¦}
- Returns:
wide - index=contig, cols=[domain sacc,..] (default)
long - index=contig, cols=[βsaccβ,βcountβ]
list - {contig:[sacc,β¦],β¦}
counts - {contig:len([sacc,β¦]), β¦}
- Return type:
pd.DataFrame or dict
- Raises:
FileNotFoundError β Provided fpath does not exist
ValueError β Provided format is not in choices: choices = wide, long, list or counts
- autometa.common.markers.main()ο
autometa.common.metagenome moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
Script containing Metagenome class for general handling of metagenome assembly
- class autometa.common.metagenome.Metagenome(assembly)ο
Bases:
object
Autometa Metagenome Class.
- Parameters:
assembly (str) β </path/to/metagenome/assembly.fasta>
- sequencesο
[seq,β¦]
- Type:
list
- seqrecordsο
[SeqRecord,β¦]
- Type:
list
- nseqsο
Number of sequences in assembly.
- Type:
int
- length_weighted_gcο
Length weighted average GC% of assembly.
- Type:
float
- sizeο
Total assembly size in bp.
- Type:
int
- largest_seqο
id of longest sequence in assembly
- Type:
str
- \* self.fragmentation_metric()
- \* self.describe()
- \* self.length_filter()
- describe() pandas.DataFrame ο
Return dataframe of details.
Columnsο
# assembly : Assembly input into Metagenome(β¦) [index column] # nseqs : Number of sequences in assembly # size : Size or total sum of all sequence lengths # N50 : # N10 : # N90 : # length_weighted_gc_content : Length weighted average GC content # largest_seq : Largest sequence in assembly
- rtype:
pd.DataFrame
- fragmentation_metric(quality_measure: float = 0.5) int ο
Describes the quality of assembled genomes that are fragmented in contigs of different length.
Note
For more information see this metagenomics wiki from Matthias Scholz
- Parameters:
quality_measure (0 < float < 1) β Description of parameter quality_measure (the default is .50). I.e. default measure is N50, but could use .1 for N10 or .9 for N90
- Returns:
Minimum contig length to cover quality_measure of genome (i.e. length weighted median)
- Return type:
int
- gc_content() pandas.DataFrame ο
Retrieves GC content from sequences in assembly
- Returns:
index=βcontigβ, columns=[βgc_contentβ,βlengthβ]
- Return type:
pd.DataFrame
- property largest_seq: strο
Retrieve the name of the largest sequence in the provided assembly.
- Returns:
record ID of the largest sequence in assembly.
- Return type:
str
- length_filter(out: str, cutoff: int = 3000, force: bool = False)ο
Filters sequences by length with provided cutoff.
Note
A WARNING will be emitted and the original metagenome will be returned if no contigs pass the length filter cutoff.
- Parameters:
out (str) β Path to write length filtered output fasta file.
cutoff (int, optional) β Lengths above or equal to cutoff that will be retained (the default is 3000).
force (bool, optional) β Overwrite existing out file (the default is False).
- Returns:
autometa Metagenome object with only assembly sequences above the cutoff threshold.
- Return type:
- Raises:
TypeError β cutoff value must be a float or integer
ValueError β cutoff value must be a positive real number
FileExistsError β filepath consisting of sequences that passed filter already exists
- property length_weighted_gc: floatο
Retrieve the length weighted average GC percentage of provided assembly.
- Returns:
GC percentage weighted by contig length.
- Return type:
float
- property nseqs: intο
Retrieve the number of sequences in provided assembly.
- Returns:
Number of sequences parsed from assembly
- Return type:
int
- property seqrecords: listο
Retrieve SeqRecord objects from provided assembly.
- Returns:
[SeqRecord, SeqRecord, β¦]
- Return type:
list
- property sequences: listο
Retrieve the sequences from provided assembly.
- Returns:
[seq, seq, β¦]
- Return type:
list
- property size: intο
Retrieve the summation of sizes for each contig in the provided assembly.
- Returns:
Total summation of contig sizes in assembly
- Return type:
int
- autometa.common.metagenome.main()ο
autometa.common.utilities moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
File containing common utilities functions to be used by Autometa scripts.
- autometa.common.utilities.calc_checksum(fpath: str) str ο
Retrieve md5 checksum from provided fpath.
- fpathstr
</path/to/file>
- str
space-delimited hexdigest of fpath using md5sum and basename of fpath. e.g. βhash filename
β
- FileNotFoundError
Provided fpath does not exist
- TypeError
fpath is not a string
- autometa.common.utilities.file_length(fpath: str, approximate: bool = False) int ο
Retrieve the number of lines in fpath
See: https://stackoverflow.com/q/845058/13118765
- Parameters:
fpath (str) β Description of parameter fpath.
approximate (bool) β If True, will approximate the length of the file from the file size.
- Returns:
Number of lines in fpath
- Return type:
int
- Raises:
FileNotFoundError β provided fpath does not exist
- autometa.common.utilities.gunzip(infpath: str, outfpath: str, delete_original: bool = False, block_size: int = 65536) str ο
Decompress gzipped infpath to outfpath and write checksum of outfpath upon successful decompression.
- Parameters:
infpath (str) β </path/to/file.gz>
outfpath (str) β </path/to/file>
delete_original (bool) β Will delete the original file after successfully decompressing infpath (Default is False).
block_size (int) β Amount of infpath to read in to memory before writing to outfpath (Default is 65536 bytes).
- Returns:
</path/to/file>
- Return type:
str
- Raises:
FileExistsError β outfpath already exists and is not empty
- autometa.common.utilities.internet_is_connected(host: str = '8.8.8.8', port: int = 53, timeout: int = 2) bool ο
- autometa.common.utilities.is_gz_file(filepath) bool ο
Check if the given file is gzipped compressed or not.
- Parameters:
filepath (str) β Filepath to check
- Returns:
True if file is gzipped else False
- Return type:
bool
- autometa.common.utilities.make_pickle(obj: Any, outfpath: str) str ο
Serialize a python object (obj) to outfpath. Note: Opposite of
unpickle()
- Parameters:
obj (any) β Python object to serialize to outfpath.
outfpath (str) β </path/to/pickled/file>.
- Returns:
</path/to/pickled/file.pkl>
- Return type:
str
- Raises:
ExceptionName β Why the exception is raised.
- autometa.common.utilities.ncbi_is_connected(filepath: str = 'rsync://ftp.ncbi.nlm.nih.gov/genbank/GB_Release_Number') bool ο
Check if ncbi databases are reachable. This can be used instead of a check for internet connection.
- Parameters:
filepath (string) β filepath to NCBIβs rsync server. Default is rsync://ftp.ncbi.nlm.nih.gov/genbank/GB_Release_Number, which should be a very small file that is unlikely to move. This may need to be updated if NCBI changes their file organization.
Outputs β
------- β
False (True or) β True if the rsync server can be contacted without an error False if the rsync process returns any error
- autometa.common.utilities.read_checksum(fpath: str) str ο
Read checksum from provided checksum formatted fpath.
Note: See write_checksum for how a checksum file is generated.
- Parameters:
fpath (str) β </path/to/file.md5>
- Returns:
checksum retrieved from fpath.
- Return type:
str
- Raises:
TypeError β Provided fpath was not a string.
FileNotFoundError β Provided fpath does not exist.
- autometa.common.utilities.tarchive_results(outfpath: str, src_dirpath: str) str ο
Generate a tar archive of Autometa Results
See: https://stackoverflow.com/questions/2032403/how-to-create-full-compressed-tar-file-using-python
- Parameters:
outfpath (str) β </path/to/output/tar/archive.tar.gz || </path/to/output/tar/archive.tgz
src_dirpath (str) β </paths/to/directory/to/archive>
- Returns:
</path/to/output/tar/archive.tar.gz || </path/to/output/tar/archive.tgz
- Return type:
str
- Raises:
FileExistsError β outfpath already exists
- autometa.common.utilities.timeit(func: function) function ο
Time function run time (to be used as a decorator). I.e. when defining a function use pythonβs decorator syntax
Example
@timeit def your_function(args): ...
Notes
See: https://docs.python.org/2/library/functools.html#functools.wraps
- Parameters:
func (function) β function to decorate timer
- Returns:
timer decorated func.
- Return type:
function
- autometa.common.utilities.unpickle(fpath: str) Any ο
Load a serialized fpath from
make_pickle()
.- Parameters:
fpath (str) β </path/to/file.pkl>.
- Returns:
Python object that was serialized to file via make_pickle()
- Return type:
any
- Raises:
ExceptionName β Why the exception is raised.
- autometa.common.utilities.untar(tarchive: str, outdir: str, member: Optional[str] = None) str ο
Decompress a tar archive (may be gzipped or bzipped). passing in member requires an outdir also be provided.
See: https://docs.python.org/3.8/library/tarfile.html#module-tarfile
- Parameters:
tarchive (str) β </path/tarchive.tar.[compression]>
outdir (str) β </path/to/output/directory>
member (str, optional) β member file to extract.
- Returns:
</path/to/extracted/member/file> if member else </path/to/output/directory>
- Return type:
str
- Raises:
IsADirectoryError β outdir already exists
ValueError β tarchive is not a tar archive
KeyError β member was not found in tarchive
- autometa.common.utilities.write_checksum(infpath: str, outfpath: str) str ο
Calculate checksum for infpath and write to outfpath.
- Parameters:
infpath (str) β </path/to/input/file>
outfpath (str) β </path/to/output/checksum/file>
- Returns:
Description of returned object.
- Return type:
NoneType
- Raises:
FileNotFoundError β Provided infpath does not exist
TypeError β infpath or outfpath is not a string