autometa.common.external packageο
Submodulesο
autometa.common.external.bedtools moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt. Script containing wrapper functions for bedtools.
- autometa.common.external.bedtools.genomecov(ibam: str, out: str, force: bool = False) str ο
Run bedtools genomecov with input ibam and lengths to retrieve metagenome coverages.
- Parameters:
ibam (str) β </path/to/indexed/BAM/file.ibam>. Note: BAM must be sorted by position.
out (str) β </path/to/alignment.bed> The bedtools genomecov output is a tab-delimited file with the following columns: 1. Chromosome 2. Depth of coverage 3. Number of bases on chromosome with that coverage 4. Size of chromosome 5. Fraction of bases on that chromosome with that coverage See also: http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html
force (bool) β force overwrite of out if it already exists (default is False).
- Returns:
</path/to/alignment.bed>
- Return type:
str
- Raises:
FileExistsError β out file already exists and force is False
OSError β Why the exception is raised.
- autometa.common.external.bedtools.main()ο
- autometa.common.external.bedtools.parse(bed: str, out: Optional[str] = None, force: bool = False) pandas.DataFrame ο
Calculate coverages from bed file.
- Parameters:
bed (str) β </path/to/file.bed>
out (str) β if provided will write to out. I.e. </path/to/coverage.tsv>
force (bool) β force overwrite of out if it already exists (default is False).
- Returns:
index=βcontigβ, col=βcoverageβ
- Return type:
pd.DataFrame
- Raises:
ValueError β out incorrectly formatted to be read as pandas DataFrame.
FileNotFoundError β bed does not exist
autometa.common.external.bowtie moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt. Script containing wrapper functions for bowtie2.
- autometa.common.external.bowtie.align(db: str, sam: str, fwd_reads: Optional[List[str]] = None, rev_reads: Optional[List[str]] = None, se_reads: Optional[List[str]] = None, cpus: int = 0, **kwargs) str ο
Align reads to bowtie2-index db (at least one *_reads argument is required).
- Parameters:
db (str) β </path/to/prefix/bowtie2/database>. I.e. db.{#}.bt2
sam (str) β </path/to/out.sam>
fwd_reads (list, optional) β [</path/to/forward_reads.fastq>, β¦]
rev_reads (list, optional) β [</path/to/reverse_reads.fastq>, β¦]
se_reads (list, optional) β [</path/to/single_end_reads.fastq>, β¦]
cpus (int, optional) β Num. processors to use (the default is 0).
**kwargs (dict, optional) β Additional optional args to supply to bowtie2. Must be in format: key = flag value = flag-value
- Returns:
</path/to/out.sam>
- Return type:
str
- Raises:
ChildProcessError β bowtie2 failed
- autometa.common.external.bowtie.build(assembly: str, out: str) str ο
Build bowtie2 index.
- Parameters:
assembly (str) β </path/to/assembly.fasta>
out (str) β </path/to/output/database> Note: Indices written will resemble </path/to/output/database.{#}.bt2>
- Returns:
</path/to/output/database>
- Return type:
str
- Raises:
ChildProcessError β bowtie2-build failed
- autometa.common.external.bowtie.main()ο
- autometa.common.external.bowtie.run(cmd: str) bool ο
Run cmd via subprocess.
- Parameters:
cmd (str) β Executable input str
- Returns:
True if no returncode from subprocess.call else False
- Return type:
bool
autometa.common.external.diamond moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt. Class and functions related to running diamond on metagenome sequences
- autometa.common.external.diamond.blast(fasta: str, database: str, outfpath: str, blast_type: str = 'blastp', evalue: float = 1e-05, maxtargetseqs: int = 200, cpus: int = 2, tmpdir: Optional[str] = None, force: bool = False, verbose: bool = False) str ο
Performs diamond blastp search using query sequence against diamond formatted database
- Parameters:
fasta (str) β Path to fasta file having the query sequences. Should be amino acid sequences in case of BLASTP and nucleotide sequences in case of BLASTX
database (str) β Path to diamond formatted database
outfpath (str) β Path to output file
blast_type (str, optional) β blastp to align protein query sequences against a protein reference database, blastx to align translated DNA query sequences against a protein reference database, by default βblastpβ
evalue (float, optional) β cutoff e-value to count hit as significant, by default float(β1e-5β)
maxtargetseqs (int, optional) β max number of target sequences to retrieve per query by diamond, by default 200
cpus (int, optional) β Number of processors to be used, by default uses all the processors of the system
tmpdir (str, optional) β Path to temporary directory. By default, same as the output directory
force (bool, optional) β overwrite existing diamond results, by default False
verbose (bool, optional) β log progress to terminal, by default False
- Returns:
Path to BLAST results
- Return type:
str
- Raises:
FileNotFoundError β fasta file does not exist
ValueError β provided blast_type is not βblastpβ or βblastxβ
subprocess.CalledProcessError β Failed to run blast
- autometa.common.external.diamond.makedatabase(fasta: str, database: str, cpus: int = 2) str ο
Creates a database against which the query sequence would be blasted
- Parameters:
fasta (str) β Path to fasta file whose database needs to be made e.g. β<path/to/fasta/file>β
database (str) β Path to the output diamond formatted database file e.g. β<path/to/database/file>β
cpus (int, optional) β Number of processors to be used. By default uses all the processors of the system
- Returns:
Path to diamond formatted database
- Return type:
str
- Raises:
subprocess.CalledProcessError β Failed to create diamond formatted database
- autometa.common.external.diamond.parse(results: str, bitscore_filter: float = 0.9, verbose: bool = False) Dict[str, Set[str]] ο
Retrieve diamond results from output table
- Parameters:
results (str) β Path to BLASTP output file in outfmt6
bitscore_filter (0 < float <= 1, optional) β Bitscore filter applied to each sseqid, by default 0.9 Used to determine whether the bitscore is above a threshold value. For example, if it is 0.9 then only bitscores >= 0.9 * the top bitscore are accepted
verbose (bool, optional) β log progress to terminal, by default False
- Returns:
{qseqid: {sseqid, sseqid, β¦}, β¦}
- Return type:
dict
- Raises:
FileNotFoundError β diamond results table does not exist
ValueError β bitscore_filter value is not a float or not in range of 0 to 1
autometa.common.external.hmmscan moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt. Functions related to running hmmer on metagenome sequences
- autometa.common.external.hmmscan.annotate_parallel(orfs, hmmdb, outfpath, cpus, seed=42)ο
- autometa.common.external.hmmscan.annotate_sequential(orfs, hmmdb, outfpath, cpus, seed=42)ο
- autometa.common.external.hmmscan.filter_tblout_markers(infpath: str, cutoffs: str, outfpath: Optional[str] = None, orfs: Optional[str] = None, force: bool = False) pandas.DataFrame ο
Filter markers from hmmscan tblout output table using provided cutoff values file.
- Parameters:
infpath (str) β Path to hmmscan tblout output file
cutoffs (str) β Path to marker set inclusion cutoffs
outfpath (str, optional) β Path to write filtered markers to tab-delimited file
orfs (str, optional) β Default will attempt to translate recovered qseqids to contigs </path/to/prodigal/called/orfs.fasta>
force (bool, optional) β Overwrite existing outfpath (the default is False).
- Returns:
</path/to/output.markers.tsv>
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError β infpath or cutoffs not found
FileExistsError β outfpath already exists and force=False
AssertionError β No returned markers pass the cutoff thresholds. I.e. final df is empty.
- autometa.common.external.hmmscan.hmmpress(fpath)ο
Runs hmmpress on fpath.
- Parameters:
fpath (str) β </path/to/kindom.markers.hmm>
- Returns:
</path/to/hmmpressed/kindom.markers.hmm>
- Return type:
str
- Raises:
FileNotFoundError β fpath not found.
subprocess.CalledProcessError β hmmpress failed
- autometa.common.external.hmmscan.main()ο
- autometa.common.external.hmmscan.read_domtblout(fpath: str) pandas.DataFrame ο
Read hmmscan domtblout-format results into pandas DataFrame
For more detailed column descriptions see the βtabular output formatsβ section in the [HMMER manual](http://eddylab.org/software/hmmer/Userguide.pdf#tabular-output-formats βHMMER Manualβ)
- Parameters:
fpath (str) β Path to hmmscan domtblout file
- Returns:
index=range(0,n_hits) cols=β¦
- Return type:
pd.DataFrame
- autometa.common.external.hmmscan.read_tblout(infpath: str) pandas.DataFrame ο
Read hmmscan tblout-format results into pd.DataFrame
For more detailed column descriptions see the βtabular output formatsβ section in the [HMMER manual](http://eddylab.org/software/hmmer/Userguide.pdf#tabular-output-formats βHMMER Manualβ)
- Parameters:
infpath (str) β Path to hmmscan_results.tblout
- Returns:
DataFrame of raw hmmscan results
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError β Path to infpath was not found
- autometa.common.external.hmmscan.run(orfs, hmmdb, outfpath, cpus=0, force=False, parallel=True, gnu_parallel=False, seed=42) str ο
Runs hmmscan on dataset ORFs and provided hmm database.
Note
Only one of parallel and gnu_parallel may be provided as True
- Parameters:
orfs (str) β </path/to/orfs.faa>
hmmdb (str) β </path/to/hmmpressed/database.hmm>
outfpath (str) β </path/to/output.hmmscan.tsv>
cpus (int, optional) β Num. cpus to use. 0 will run as many cpus as possible (the default is 0).
force (bool, optional) β Overwrite existing outfpath (the default is False).
parallel (bool, optional) β Will use multithreaded parallelization offered by hmmscan (the default is True).
gnu_parallel (bool, optional) β Will parallelize hmmscan using GNU parallel (the default is False).
seed (int, optional) β set RNG seed to <n> (if 0: one-time arbitrary seed) (the default is 42).
- Returns:
</path/to/output.hmmscan.tsv>
- Return type:
str
- Raises:
ValueError β Both parallel and gnu_parallel were provided as True
FileExistsError β outfpath already exists
subprocess.CalledProcessError β hmmscan failed
autometa.common.external.hmmsearch moduleο
Module to filter the domtbl file from hmmsearch βdomtblout <filepath> using provided cutoffs
- autometa.common.external.hmmsearch.filter_domtblout(infpath: str, cutoffs: str, orfs: str, outfpath: Optional[str] = None) pandas.DataFrame ο
- autometa.common.external.hmmsearch.main()ο
autometa.common.external.prodigal moduleο
# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.
Functions to retrieve orfs from provided assembly using prodigal
- autometa.common.external.prodigal.aggregate_orfs(search_str: str, outfpath: str) None ο
- autometa.common.external.prodigal.annotate_parallel(assembly: str, prots_out: str, nucls_out: str, cpus: int) None ο
- autometa.common.external.prodigal.annotate_sequential(assembly: str, prots_out: str, nucls_out: str) None ο
- autometa.common.external.prodigal.contigs_from_headers(fpath: str) Mapping[str, str] ο
Get ORF id to contig id translations using prodigal assigned ID from description.
First determines if all of ID=3495691_2 from description is in header. β3495691_2β represents the 3,495,691st gene in the 2nd sequence.
Example
#: prodigal versions < 2.6 record >>>record.id 'k119_1383959_3495691_2' >>>record.description 'k119_1383959_3495691_2 # 688 # 1446 # 1 # ID=3495691_2;partial=01;start_type=ATG;rbs_motif=None;rbs_spacer=None' >>>record.description.split('#')[-1].split(';')[0].strip() 'ID=3495691_2' >>>orf_id = '3495691_2' '3495691_2' >>>record.id.replace(f'_{orf_id}', '') 'k119_1383959' #: prodigal versions >= 2.6 record >>>record.id 'k119_1383959_2' >>>record.id.rsplit('_',1)[0] 'k119_1383959'
- Parameters:
fpath (str) β </path/to/prodigal/called/orfs.fasta>
- Returns:
contigs translated from prodigal ORF description. {orf_id:contig_id, β¦}
- Return type:
dict
- autometa.common.external.prodigal.main()ο
- autometa.common.external.prodigal.orf_records_from_contigs(contigs: Union[List, Set], fpath: str) List[Bio.SeqIO.SeqRecord] ο
Retrieve list of ORFs headers from contigs. Prodigal annotated ORFs are required as the input fpath.
- Parameters:
contigs (iterable) β iterable of contigs from which to retrieve ORFs
fpath (str) β </path/to/prodigal/called/orfs.fasta>
- Returns:
ORF SeqIO.SeqRecords from provided contigs. i.e. [SeqRecord, β¦]
- Return type:
list
- Raises:
ExceptionName β Why the exception is raised.
- autometa.common.external.prodigal.run(assembly: str, nucls_out: str, prots_out: str, force: bool = False, cpus: int = 0) Tuple[str, str] ο
Calls ORFs from provided input assembly
- Parameters:
assembly (str) β </path/to/assembly.fasta>
nucls_out (str) β </path/to/nucls.out>
prots_out (str) β </path/to/prots.out>
force (bool) β overwrite outfpath if it already exists (the default is False).
cpus (int) β num cpus to use. Default (cpus=0) will run as many `cpus` as possible
- Returns:
(nucls_out, prots_out)
- Return type:
2-Tuple
- Raises:
FileExistsError β nucls_out or prots_out already exists
subprocess.CalledProcessError β prodigal Failed
ChildProcessError β nucls_out or prots_out not written
IOError β nucls_out or prots_out incorrectly formatted
autometa.common.external.samtools moduleο
Script containing wrapper functions for samtools
- autometa.common.external.samtools.main()ο
- autometa.common.external.samtools.sort(sam, bam, cpus=2)ο
Views then sorts sam file by leftmost coordinates and outputs to bam.
- Parameters:
sam (str) β </path/to/alignment.sam>
bam (str) β </path/to/output/alignment.bam>
cpus (int, optional) β Number of processors to be used. By default uses all the processors of the system
- Raises:
TypeError β cpus must be an integer greater than zero
FileNotFoundError β Specified path is incorrect or the file is empty
ExternalToolError β Samtools did not run successfully, returns subprocess traceback and command run