🐚 Bash Workflow 🐚

Getting Started

Compute Environment Setup
Download Workflow Template
Configure Required Inputs

Compute Environment Setup

If you have not previously installed/used Conda, you can get it using the Miniconda installer appropriate to your system, here: https://docs.conda.io/en/latest/miniconda.html

After installing conda, running the following command will create a minimal Conda environment named “autometa”.

conda env create --file=https://raw.githubusercontent.com/KwanLab/Autometa/main/autometa-env.yml

If you receive the message…

CondaValueError: prefix already exists:

…it means you have already created the environment. If you want to overwrite/update the environment then add the --force flag to the end of the command.

conda env create --file=https://raw.githubusercontent.com/KwanLab/Autometa/main/autometa-env.yml --force

Once Conda has finished creating the environment be sure to activate it:

conda activate autometa

Download Workflow Template

To run Autometa using the bash workflow you will simply need to download and configure the workflow template to your metagenomes specifications.

Here are a few download commands if you do not want to navigate to the workflow on GitHub

via curl

curl -o autometa.sh https://raw.githubusercontent.com/KwanLab/Autometa/main/workflows/autometa.sh

via wget

wget https://raw.githubusercontent.com/KwanLab/Autometa/main/workflows/autometa.sh

Note

The autometa-large-data-mode workflow is also available and is configured similarly to the autometa bash workflow.

Configure Required Inputs

The Autometa bash workflow requires the following input file and directory paths. To see how to prepare each input, see Data preparation

Assembly (assembly)
Alignments (bam)
ORFs (orfs)
Diamond blastp results table (blast)
NCBI database directory (ncbi)
Input sample name (simpleName)
Output directory (outdir)

Data preparation

Metagenome Assembly (assembly)
Alignments Preparation (bam)
ORFs (orfs)
Diamond blastp Preparation (blast)
NCBI Preparation (ncbi)

Metagenome Assembly

You will first need to assemble your shotgun metagenome, to provide to Autometa as input.

The following is a typical workflow for metagenome assembly:

Trim adapter sequences from the reads

We usually use Trimmomatic.
Quality check the trimmed reads to ensure the adapters have been removed

We usually use FastQC.
Assemble the trimmed reads

We usually use MetaSPAdes which is a part of the SPAdes package.
Check the quality of your assembly (Optional)

We usually use metaQuast for this (use --min-contig 1 option to get an accurate N50). This tool can compute a variety of assembly statistics one of which is N50. This can often be useful for selecting an appropriate length cutoff value for pre-processing the metagenome.

Alignments Preparation

Note

The following example requires bwa, kart and samtools

conda install -c bioconda bwa kart samtools

# First index metagenome assembly
bwa index \\
    -b 550000000000 \\ # block size for the bwtsw algorithm (effective with -a bwtsw) [default=10000000]
    metagenome.fna     # Path to input metagenome

# Now perform alignments (we are using kart, but you can use another alignment tool if you'd like)
kart \\
    -i metagenome.fna                   \\ # Path to input metagenome
    -t 20                               \\ # Number of cpus to use
    -f /path/to/forward_reads.fastq.gz  \\ # Path to forward paired-end reads
    -f2 /path/to/reverse_reads.fastq.gz \\ # Path to reverse paired-end reads
    -o alignments.sam                      # Path to alignments output

# Now sort alignments and convert to bam format
samtools sort \\
    -@ 40              \\ # Number of cpus to use
    -m 10G             \\ # Amount of memory to use
    alignments.sam     \\ # Input alignments file path
    -o alignments.bam     # Output alignments file path

ORFs

Note

The following example requires prodigal. e.g. conda install -c bioconda prodigal

prodigal -i metagenome.fna \\
    -f "gbk" \\
    -d "metagenome.orfs.fna" \\
    -o "metagenome.orfs.gbk" \\
    -a "metagenome.orfs.faa" \\ # This generated file is required as input to the bash workflow
    -s "metagenome.all_orfs.txt"

Diamond blastp Preparation

Note

The following example requires diamond. e.g. conda install -c bioconda diamond

diamond blastp \\
    --query "metagenome.orfs.faa" \\ # See prodigal output from above
    --db /path/to/nr.dmnd         \\ # See NCBI section
    --threads <num cpus to use>   \\
    --out blastp.tsv # This generated file is required as input to the bash workflow

NCBI Preparation

If you are running Autometa for the first time you’ll have to download the NCBI databases.

# First configure where you want to download the NCBI databases
autometa-config \\
    --section databases --option ncbi \\
    --value <path/to/your/ncbi/database/directory>

# Now download and format the NCBI databases
autometa-update-databases --update-ncbi

Note

You can check the default config paths using autometa-config --print.

See autometa-update-databases -h and autometa-config -h for full list of options.

The previous command will download the following NCBI databases:

Non-redundant nr database
- ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
prot.accession2taxid.gz
- ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
nodes.dmp, names.dmp and merged.dmp - Found within
- ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz

Input Sample Name

A crucial step prior to running the Autometa bash workflow is specifying the metagenome sample name and where to output Autometa’s results.

# Default
simpleName="TemplateAssemblyName"
# Replace with your sample name
simpleName="MySample"

Note

The simpleName that is provided will be used as a prefix to all of the resulting autometa output files.

Output directory

Immediately following the simpleName parameter, you will need to specify where to write all results.

# Default
outdir="AutometaOutdir"
# Replace with your output directory...
outdir="MySampleAutometaResults"

Running the pipeline

After you are finished configuring/double-checking your parameter settings..

You may run the pipeline via bash:

bash autometa.sh

or submit the pipeline into a queue:

For example, with slurm:

sbatch autometa.sh

Caution

Make sure your conda autometa environment is activated or the autometa entrypoints will not be available.

Additional parameters

You can also adjust other pipeline parameters that ultimately control how binning is performed. These are located at the top of the workflow just under the required inputs.

length_cutoff : Smallest contig you want binned (default is 3000bp)

kmer_size : kmer size to use

norm_method : Which kmer frequency normalization method to use. See Advanced Usage section for details

pca_dimensions : Number of dimensions of which to reduce the initial k-mer frequencies matrix (default is 50). See Advanced Usage section for details

embed_method : Choices are sksne, bhsne, umap, densmap, trimap (default is bhsne) See Advanced Usage section for details

embed_dimensions : Final dimensions of the kmer frequencies matrix (default is 2). See Advanced Usage section for details

cluster_method : Cluster contigs using which clustering method. Choices are “dbscan” and “hdbscan” (default is “dbscan”). See Advanced Usage section for details

binning_starting_rank : Which taxonomic rank to start the binning from. Choices are superkingdom, phylum, class, order, family, genus, species (default is superkingdom). See Advanced Usage section for details

classification_method : Which clustering method to use for unclustered recruitment step. Choices are decision_tree and random_forest (default is decision_tree). See Advanced Usage section for details

completeness : Minimum completeness needed to keep a cluster (default is at least 20% complete, e.g. 20). See Advanced Usage section for details

purity : Minimum purity needed to keep a cluster (default is at least 95% pure, e.g. 95). See Advanced Usage section for details

cov_stddev_limit : Which clusters to keep depending on the coverage std.dev (default is 25%, e.g. 25). See Advanced Usage section for details

gc_stddev_limit : Which clusters to keep depending on the GC% std.dev (default is 5%, e.g. 5). See Advanced Usage section for details

Note

If you are configuring an autometa job using the autometa-large-data-mode.sh template, there will be an additional parameter called, max_partition_size (default=10,000). This is the maximum size partition the Autometa clustering algorithm will consider. Any taxon partitions larger than this setting will be skipped.