autometa.common.external package

Submodules

autometa.common.external.bedtools module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt. Script containing wrapper functions for bedtools.

autometa.common.external.bedtools.genomecov(ibam: str, out: str, force: bool = False) str

Run bedtools genomecov with input ibam and lengths to retrieve metagenome coverages.

Parameters:
  • ibam (str) – </path/to/indexed/BAM/file.ibam>. Note: BAM must be sorted by position.

  • out (str) – </path/to/alignment.bed> The bedtools genomecov output is a tab-delimited file with the following columns: 1. Chromosome 2. Depth of coverage 3. Number of bases on chromosome with that coverage 4. Size of chromosome 5. Fraction of bases on that chromosome with that coverage See also: http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html

  • force (bool) – force overwrite of out if it already exists (default is False).

Returns:

</path/to/alignment.bed>

Return type:

str

Raises:
  • FileExistsError – out file already exists and force is False

  • OSError – Why the exception is raised.

autometa.common.external.bedtools.main()
autometa.common.external.bedtools.parse(bed: str, out: Optional[str] = None, force: bool = False) pandas.DataFrame

Calculate coverages from bed file.

Parameters:
  • bed (str) – </path/to/file.bed>

  • out (str) – if provided will write to out. I.e. </path/to/coverage.tsv>

  • force (bool) – force overwrite of out if it already exists (default is False).

Returns:

index=’contig’, col=’coverage’

Return type:

pd.DataFrame

Raises:
  • ValueError – out incorrectly formatted to be read as pandas DataFrame.

  • FileNotFoundError – bed does not exist

autometa.common.external.bowtie module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt. Script containing wrapper functions for bowtie2.

autometa.common.external.bowtie.align(db: str, sam: str, fwd_reads: Optional[List[str]] = None, rev_reads: Optional[List[str]] = None, se_reads: Optional[List[str]] = None, cpus: int = 0, **kwargs) str

Align reads to bowtie2-index db (at least one *_reads argument is required).

Parameters:
  • db (str) – </path/to/prefix/bowtie2/database>. I.e. db.{#}.bt2

  • sam (str) – </path/to/out.sam>

  • fwd_reads (list, optional) – [</path/to/forward_reads.fastq>, …]

  • rev_reads (list, optional) – [</path/to/reverse_reads.fastq>, …]

  • se_reads (list, optional) – [</path/to/single_end_reads.fastq>, …]

  • cpus (int, optional) – Num. processors to use (the default is 0).

  • **kwargs (dict, optional) – Additional optional args to supply to bowtie2. Must be in format: key = flag value = flag-value

Returns:

</path/to/out.sam>

Return type:

str

Raises:

ChildProcessError – bowtie2 failed

autometa.common.external.bowtie.build(assembly: str, out: str) str

Build bowtie2 index.

Parameters:
  • assembly (str) – </path/to/assembly.fasta>

  • out (str) – </path/to/output/database> Note: Indices written will resemble </path/to/output/database.{#}.bt2>

Returns:

</path/to/output/database>

Return type:

str

Raises:

ChildProcessError – bowtie2-build failed

autometa.common.external.bowtie.main()
autometa.common.external.bowtie.run(cmd: str) bool

Run cmd via subprocess.

Parameters:

cmd (str) – Executable input str

Returns:

True if no returncode from subprocess.call else False

Return type:

bool

autometa.common.external.diamond module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt. Class and functions related to running diamond on metagenome sequences

autometa.common.external.diamond.blast(fasta: str, database: str, outfpath: str, blast_type: str = 'blastp', evalue: float = 1e-05, maxtargetseqs: int = 200, cpus: int = 2, tmpdir: Optional[str] = None, force: bool = False, verbose: bool = False) str

Performs diamond blastp search using query sequence against diamond formatted database

Parameters:
  • fasta (str) – Path to fasta file having the query sequences. Should be amino acid sequences in case of BLASTP and nucleotide sequences in case of BLASTX

  • database (str) – Path to diamond formatted database

  • outfpath (str) – Path to output file

  • blast_type (str, optional) – blastp to align protein query sequences against a protein reference database, blastx to align translated DNA query sequences against a protein reference database, by default β€˜blastp’

  • evalue (float, optional) – cutoff e-value to count hit as significant, by default float(β€˜1e-5’)

  • maxtargetseqs (int, optional) – max number of target sequences to retrieve per query by diamond, by default 200

  • cpus (int, optional) – Number of processors to be used, by default uses all the processors of the system

  • tmpdir (str, optional) – Path to temporary directory. By default, same as the output directory

  • force (bool, optional) – overwrite existing diamond results, by default False

  • verbose (bool, optional) – log progress to terminal, by default False

Returns:

Path to BLAST results

Return type:

str

Raises:
  • FileNotFoundError – fasta file does not exist

  • ValueError – provided blast_type is not β€˜blastp’ or β€˜blastx’

  • subprocess.CalledProcessError – Failed to run blast

autometa.common.external.diamond.makedatabase(fasta: str, database: str, cpus: int = 2) str

Creates a database against which the query sequence would be blasted

Parameters:
  • fasta (str) – Path to fasta file whose database needs to be made e.g. β€˜<path/to/fasta/file>’

  • database (str) – Path to the output diamond formatted database file e.g. β€˜<path/to/database/file>’

  • cpus (int, optional) – Number of processors to be used. By default uses all the processors of the system

Returns:

Path to diamond formatted database

Return type:

str

Raises:

subprocess.CalledProcessError – Failed to create diamond formatted database

autometa.common.external.diamond.parse(results: str, bitscore_filter: float = 0.9, verbose: bool = False) Dict[str, Set[str]]

Retrieve diamond results from output table

Parameters:
  • results (str) – Path to BLASTP output file in outfmt6

  • bitscore_filter (0 < float <= 1, optional) – Bitscore filter applied to each sseqid, by default 0.9 Used to determine whether the bitscore is above a threshold value. For example, if it is 0.9 then only bitscores >= 0.9 * the top bitscore are accepted

  • verbose (bool, optional) – log progress to terminal, by default False

Returns:

{qseqid: {sseqid, sseqid, …}, …}

Return type:

dict

Raises:
  • FileNotFoundError – diamond results table does not exist

  • ValueError – bitscore_filter value is not a float or not in range of 0 to 1

autometa.common.external.hmmscan module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt. Functions related to running hmmer on metagenome sequences

autometa.common.external.hmmscan.annotate_parallel(orfs, hmmdb, outfpath, cpus, seed=42)
autometa.common.external.hmmscan.annotate_sequential(orfs, hmmdb, outfpath, cpus, seed=42)
autometa.common.external.hmmscan.filter_tblout_markers(infpath: str, cutoffs: str, outfpath: Optional[str] = None, orfs: Optional[str] = None, force: bool = False) pandas.DataFrame

Filter markers from hmmscan tblout output table using provided cutoff values file.

Parameters:
  • infpath (str) – Path to hmmscan tblout output file

  • cutoffs (str) – Path to marker set inclusion cutoffs

  • outfpath (str, optional) – Path to write filtered markers to tab-delimited file

  • orfs (str, optional) – Default will attempt to translate recovered qseqids to contigs </path/to/prodigal/called/orfs.fasta>

  • force (bool, optional) – Overwrite existing outfpath (the default is False).

Returns:

</path/to/output.markers.tsv>

Return type:

pd.DataFrame

Raises:
  • FileNotFoundError – infpath or cutoffs not found

  • FileExistsError – outfpath already exists and force=False

  • AssertionError – No returned markers pass the cutoff thresholds. I.e. final df is empty.

autometa.common.external.hmmscan.hmmpress(fpath)

Runs hmmpress on fpath.

Parameters:

fpath (str) – </path/to/kindom.markers.hmm>

Returns:

</path/to/hmmpressed/kindom.markers.hmm>

Return type:

str

Raises:
  • FileNotFoundError – fpath not found.

  • subprocess.CalledProcessError – hmmpress failed

autometa.common.external.hmmscan.main()
autometa.common.external.hmmscan.read_domtblout(fpath: str) pandas.DataFrame

Read hmmscan domtblout-format results into pandas DataFrame

For more detailed column descriptions see the β€˜tabular output formats’ section in the [HMMER manual](http://eddylab.org/software/hmmer/Userguide.pdf#tabular-output-formats β€œHMMER Manual”)

Parameters:

fpath (str) – Path to hmmscan domtblout file

Returns:

index=range(0,n_hits) cols=…

Return type:

pd.DataFrame

autometa.common.external.hmmscan.read_tblout(infpath: str) pandas.DataFrame

Read hmmscan tblout-format results into pd.DataFrame

For more detailed column descriptions see the β€˜tabular output formats’ section in the [HMMER manual](http://eddylab.org/software/hmmer/Userguide.pdf#tabular-output-formats β€œHMMER Manual”)

Parameters:

infpath (str) – Path to hmmscan_results.tblout

Returns:

DataFrame of raw hmmscan results

Return type:

pd.DataFrame

Raises:

FileNotFoundError – Path to infpath was not found

autometa.common.external.hmmscan.run(orfs, hmmdb, outfpath, cpus=0, force=False, parallel=True, gnu_parallel=False, seed=42) str

Runs hmmscan on dataset ORFs and provided hmm database.

Note

Only one of parallel and gnu_parallel may be provided as True

Parameters:
  • orfs (str) – </path/to/orfs.faa>

  • hmmdb (str) – </path/to/hmmpressed/database.hmm>

  • outfpath (str) – </path/to/output.hmmscan.tsv>

  • cpus (int, optional) – Num. cpus to use. 0 will run as many cpus as possible (the default is 0).

  • force (bool, optional) – Overwrite existing outfpath (the default is False).

  • parallel (bool, optional) – Will use multithreaded parallelization offered by hmmscan (the default is True).

  • gnu_parallel (bool, optional) – Will parallelize hmmscan using GNU parallel (the default is False).

  • seed (int, optional) – set RNG seed to <n> (if 0: one-time arbitrary seed) (the default is 42).

Returns:

</path/to/output.hmmscan.tsv>

Return type:

str

Raises:
  • ValueError – Both parallel and gnu_parallel were provided as True

  • FileExistsError – outfpath already exists

  • subprocess.CalledProcessError – hmmscan failed

autometa.common.external.hmmsearch module

Module to filter the domtbl file from hmmsearch –domtblout <filepath> using provided cutoffs

autometa.common.external.hmmsearch.filter_domtblout(infpath: str, cutoffs: str, orfs: str, outfpath: Optional[str] = None) pandas.DataFrame
autometa.common.external.hmmsearch.main()

autometa.common.external.prodigal module

# License: GNU Affero General Public License v3 or later # A copy of GNU AGPL v3 should have been included in this software package in LICENSE.txt.

Functions to retrieve orfs from provided assembly using prodigal

autometa.common.external.prodigal.aggregate_orfs(search_str: str, outfpath: str) None
autometa.common.external.prodigal.annotate_parallel(assembly: str, prots_out: str, nucls_out: str, cpus: int) None
autometa.common.external.prodigal.annotate_sequential(assembly: str, prots_out: str, nucls_out: str) None
autometa.common.external.prodigal.contigs_from_headers(fpath: str) Mapping[str, str]

Get ORF id to contig id translations using prodigal assigned ID from description.

First determines if all of ID=3495691_2 from description is in header. β€œ3495691_2” represents the 3,495,691st gene in the 2nd sequence.

Example

#: prodigal versions < 2.6 record
>>>record.id
'k119_1383959_3495691_2'

>>>record.description
'k119_1383959_3495691_2 # 688 # 1446 # 1 # ID=3495691_2;partial=01;start_type=ATG;rbs_motif=None;rbs_spacer=None'

>>>record.description.split('#')[-1].split(';')[0].strip()
'ID=3495691_2'

>>>orf_id = '3495691_2'
'3495691_2'

>>>record.id.replace(f'_{orf_id}', '')
'k119_1383959'

#: prodigal versions >= 2.6 record
>>>record.id
'k119_1383959_2'
>>>record.id.rsplit('_',1)[0]
'k119_1383959'
Parameters:

fpath (str) – </path/to/prodigal/called/orfs.fasta>

Returns:

contigs translated from prodigal ORF description. {orf_id:contig_id, …}

Return type:

dict

autometa.common.external.prodigal.main()
autometa.common.external.prodigal.orf_records_from_contigs(contigs: Union[List, Set], fpath: str) List[Bio.SeqIO.SeqRecord]

Retrieve list of ORFs headers from contigs. Prodigal annotated ORFs are required as the input fpath.

Parameters:
  • contigs (iterable) – iterable of contigs from which to retrieve ORFs

  • fpath (str) – </path/to/prodigal/called/orfs.fasta>

Returns:

ORF SeqIO.SeqRecords from provided contigs. i.e. [SeqRecord, …]

Return type:

list

Raises:

ExceptionName – Why the exception is raised.

autometa.common.external.prodigal.run(assembly: str, nucls_out: str, prots_out: str, force: bool = False, cpus: int = 0) Tuple[str, str]

Calls ORFs from provided input assembly

Parameters:
  • assembly (str) – </path/to/assembly.fasta>

  • nucls_out (str) – </path/to/nucls.out>

  • prots_out (str) – </path/to/prots.out>

  • force (bool) – overwrite outfpath if it already exists (the default is False).

  • cpus (int) – num cpus to use. Default (cpus=0) will run as many `cpus` as possible

Returns:

(nucls_out, prots_out)

Return type:

2-Tuple

Raises:
  • FileExistsError – nucls_out or prots_out already exists

  • subprocess.CalledProcessError – prodigal Failed

  • ChildProcessError – nucls_out or prots_out not written

  • IOError – nucls_out or prots_out incorrectly formatted

autometa.common.external.samtools module

Script containing wrapper functions for samtools

autometa.common.external.samtools.main()
autometa.common.external.samtools.sort(sam, bam, cpus=2)

Views then sorts sam file by leftmost coordinates and outputs to bam.

Parameters:
  • sam (str) – </path/to/alignment.sam>

  • bam (str) – </path/to/output/alignment.bam>

  • cpus (int, optional) – Number of processors to be used. By default uses all the processors of the system

Raises:
  • TypeError – cpus must be an integer greater than zero

  • FileNotFoundError – Specified path is incorrect or the file is empty

  • ExternalToolError – Samtools did not run successfully, returns subprocess traceback and command run

Module contents