🐚 Bash Workflow πŸšοƒ

Getting Started

  1. Compute Environment Setup

  2. Download Workflow Template

  3. Configure Required Inputs

Compute Environment Setup

If you have not previously installed/used mamba, you can get it from Mambaforge.

You may either create a new mamba environment named β€œautometa”…

mamba create -n autometa -c conda-forge -c bioconda autometa
# Then, once mamba has finished creating the environment
# you may activate it:
mamba activate autometa

... or install Autometa into any of your existing environments.

This installs Autometa in your current active environment:

mamba install -c conda-forge -c bioconda autometa

The next command installs Autometa in the provided environment:

mamba install -n <your-env-name> -c conda-forge -c bioconda autometa

Download Workflow Template

To run Autometa using the bash workflow you will simply need to download and configure the workflow template to your metagenomes specifications.

Here are a few download commands if you do not want to navigate to the workflow on GitHub

via curl

curl -o autometa.sh https://raw.githubusercontent.com/KwanLab/Autometa/main/workflows/autometa.sh

via wget

wget https://raw.githubusercontent.com/KwanLab/Autometa/main/workflows/autometa.sh

Note

The autometa-large-data-mode workflow is also available and is configured similarly to the autometa bash workflow.

Configure Required Inputs

The Autometa bash workflow requires the following input file and directory paths. To see how to prepare each input, see Data preparation

  1. Assembly (assembly)

  2. Alignments (bam)

  3. ORFs (orfs)

  4. Diamond blastp results table (blast)

  5. NCBI database directory (ncbi)

  6. Input sample name (simpleName)

  7. Output directory (outdir)

Data preparation

  1. Metagenome Assembly (assembly)

  2. Alignments Preparation (bam)

  3. ORFs (orfs)

  4. Diamond blastp Preparation (blast)

  5. NCBI Preparation (ncbi)

Metagenome Assembly

You will first need to assemble your shotgun metagenome, to provide to Autometa as input.

The following is a typical workflow for metagenome assembly:

  1. Trim adapter sequences from the reads

    We usually use Trimmomatic.

  2. Quality check the trimmed reads to ensure the adapters have been removed

    We usually use FastQC.

  3. Assemble the trimmed reads

    We usually use MetaSPAdes which is a part of the SPAdes package.

  4. Check the quality of your assembly (Optional)

    We usually use metaQuast for this (use --min-contig 1 option to get an accurate N50). This tool can compute a variety of assembly statistics one of which is N50. This can often be useful for selecting an appropriate length cutoff value for pre-processing the metagenome.

Alignments Preparation

Note

The following example requires bwa, kart and samtools

mamba install -c bioconda bwa kart samtools

# First index metagenome assembly
bwa index \
    -b 550000000000 \ # block size for the bwtsw algorithm (effective with -a bwtsw) [default=10000000]
    metagenome.fna     # Path to input metagenome

# Now perform alignments (we are using kart, but you can use another alignment tool if you'd like)
kart \
    -i metagenome.fna                   \ # Path to input metagenome
    -t 20                               \ # Number of cpus to use
    -f /path/to/forward_reads.fastq.gz  \ # Path to forward paired-end reads
    -f2 /path/to/reverse_reads.fastq.gz \ # Path to reverse paired-end reads
    -o alignments.sam                      # Path to alignments output

# Now sort alignments and convert to bam format
samtools sort \
    -@ 40              \ # Number of cpus to use
    -m 10G             \ # Amount of memory to use
    alignments.sam     \ # Input alignments file path
    -o alignments.bam     # Output alignments file path

ORFs

Note

The following example requires prodigal. e.g. mamba install -c bioconda prodigal

prodigal -i metagenome.fna \
    -f "gbk" \
    -d "metagenome.orfs.fna" \
    -o "metagenome.orfs.gbk" \
    -a "metagenome.orfs.faa" \ # This generated file is required as input to the bash workflow
    -s "metagenome.all_orfs.txt"

Diamond blastp Preparation

Note

The following example requires diamond. e.g. mamba install -c bioconda diamond

diamond blastp \
    --query "metagenome.orfs.faa" \ # See prodigal output from above
    --db /path/to/nr.dmnd         \ # See NCBI section
    --threads <num cpus to use>   \
    --out blastp.tsv # This generated file is required as input to the bash workflow

NCBI Preparation

If you are running Autometa for the first time you’ll have to download the NCBI databases.

# First configure where you want to download the NCBI databases
autometa-config \
    --section databases \
    --option ncbi \
    --value <path/to/your/ncbi/database/directory>

# Now download and format the NCBI databases
autometa-update-databases --update-ncbi

Note

You can check the default config paths using autometa-config --print.

See autometa-update-databases -h and autometa-config -h for full list of options.

The previous command will download the following NCBI databases:

Input Sample Name

A crucial step prior to running the Autometa bash workflow is specifying the metagenome sample name and where to output Autometa’s results.

# Default
simpleName="TemplateAssemblyName"
# Replace with your sample name
simpleName="MySample"

Note

The simpleName that is provided will be used as a prefix to all of the resulting autometa output files.

Output directory

Immediately following the simpleName parameter, you will need to specify where to write all results.

# Default
outdir="AutometaOutdir"
# Replace with your output directory...
outdir="MySampleAutometaResults"

Running the pipeline

After you are finished configuring/double-checking your parameter settings..

You may run the pipeline via bash:

bash autometa.sh

or submit the pipeline into a queue:

For example, with slurm:

sbatch autometa.sh

Caution

Make sure your mamba autometa environment is activated or the autometa entrypoints will not be available.

Additional parameters

You can also adjust other pipeline parameters that ultimately control how binning is performed. These are located at the top of the workflow just under the required inputs.

length_cutoff : Smallest contig you want binned (default is 3000bp)

kmer_size : kmer size to use

norm_method : Which kmer frequency normalization method to use. See Advanced Usage section for details

pca_dimensions : Number of dimensions of which to reduce the initial k-mer frequencies matrix (default is 50). See Advanced Usage section for details

embed_method : Choices are sksne, bhsne, umap, densmap, trimap (default is bhsne) See Advanced Usage section for details

embed_dimensions : Final dimensions of the kmer frequencies matrix (default is 2). See Advanced Usage section for details

cluster_method : Cluster contigs using which clustering method. Choices are β€œdbscan” and β€œhdbscan” (default is β€œdbscan”). See Advanced Usage section for details

binning_starting_rank : Which taxonomic rank to start the binning from. Choices are superkingdom, phylum, class, order, family, genus, species (default is superkingdom). See Advanced Usage section for details

classification_method : Which clustering method to use for unclustered recruitment step. Choices are decision_tree and random_forest (default is decision_tree). See Advanced Usage section for details

completeness : Minimum completeness needed to keep a cluster (default is at least 20% complete, e.g. 20). See Advanced Usage section for details

purity : Minimum purity needed to keep a cluster (default is at least 95% pure, e.g. 95). See Advanced Usage section for details

cov_stddev_limit : Which clusters to keep depending on the coverage std.dev (default is 25%, e.g. 25). See Advanced Usage section for details

gc_stddev_limit : Which clusters to keep depending on the GC% std.dev (default is 5%, e.g. 5). See Advanced Usage section for details

Note

If you are configuring an autometa job using the autometa-large-data-mode.sh template, there will be an additional parameter called, max_partition_size (default=10,000). This is the maximum size partition the Autometa clustering algorithm will consider. Any taxon partitions larger than this setting will be skipped.