π Bash Workflow πο
Getting Startedο
Compute Environment Setupο
If you have not previously installed/used Conda, you can get it using the Miniconda installer appropriate to your system, here: https://docs.conda.io/en/latest/miniconda.html
After installing conda, running the following command will create a minimal Conda environment named βautometaβ.
conda env create --file=https://raw.githubusercontent.com/KwanLab/Autometa/main/autometa-env.yml
If you receive the messageβ¦
CondaValueError: prefix already exists:
β¦it means you have already created the environment. If you want to overwrite/update
the environment then add the --force
flag to the end of the command.
conda env create --file=https://raw.githubusercontent.com/KwanLab/Autometa/main/autometa-env.yml --force
Once Conda has finished creating the environment be sure to activate it:
conda activate autometa
Download Workflow Templateο
To run Autometa using the bash workflow you will simply need to download and configure the workflow template to your metagenomes specifications.
Here are a few download commands if you do not want to navigate to the workflow on GitHub
via curlο
curl -o autometa.sh https://raw.githubusercontent.com/KwanLab/Autometa/main/workflows/autometa.sh
via wgetο
wget https://raw.githubusercontent.com/KwanLab/Autometa/main/workflows/autometa.sh
Note
The autometa-large-data-mode
workflow is also available and is configured similarly to the autometa
bash workflow.
Configure Required Inputsο
The Autometa bash workflow requires the following input file and directory paths. To see how to prepare each input, see Data preparation
Assembly (
assembly
)Alignments (
bam
)ORFs (
orfs
)Diamond blastp results table (
blast
)NCBI database directory (
ncbi
)Input sample name (
simpleName
)Output directory (
outdir
)
Data preparationο
Metagenome Assembly (
assembly
)Alignments Preparation (
bam
)ORFs (
orfs
)Diamond blastp Preparation (
blast
)NCBI Preparation (
ncbi
)
Metagenome Assemblyο
You will first need to assemble your shotgun metagenome, to provide to Autometa as input.
The following is a typical workflow for metagenome assembly:
Trim adapter sequences from the reads
We usually use Trimmomatic.
Quality check the trimmed reads to ensure the adapters have been removed
We usually use FastQC.
Assemble the trimmed reads
We usually use MetaSPAdes which is a part of the SPAdes package.
Check the quality of your assembly (Optional)
We usually use metaQuast for this (use
--min-contig 1
option to get an accurate N50). This tool can compute a variety of assembly statistics one of which is N50. This can often be useful for selecting an appropriate length cutoff value for pre-processing the metagenome.
Alignments Preparationο
Note
The following example requires bwa
, kart
and samtools
conda install -c bioconda bwa kart samtools
# First index metagenome assembly
bwa index \\
-b 550000000000 \\ # block size for the bwtsw algorithm (effective with -a bwtsw) [default=10000000]
metagenome.fna # Path to input metagenome
# Now perform alignments (we are using kart, but you can use another alignment tool if you'd like)
kart \\
-i metagenome.fna \\ # Path to input metagenome
-t 20 \\ # Number of cpus to use
-f /path/to/forward_reads.fastq.gz \\ # Path to forward paired-end reads
-f2 /path/to/reverse_reads.fastq.gz \\ # Path to reverse paired-end reads
-o alignments.sam # Path to alignments output
# Now sort alignments and convert to bam format
samtools sort \\
-@ 40 \\ # Number of cpus to use
-m 10G \\ # Amount of memory to use
alignments.sam \\ # Input alignments file path
-o alignments.bam # Output alignments file path
ORFsο
Note
The following example requires prodigal
. e.g. conda install -c bioconda prodigal
prodigal -i metagenome.fna \\
-f "gbk" \\
-d "metagenome.orfs.fna" \\
-o "metagenome.orfs.gbk" \\
-a "metagenome.orfs.faa" \\ # This generated file is required as input to the bash workflow
-s "metagenome.all_orfs.txt"
Diamond blastp Preparationο
Note
The following example requires diamond
. e.g. conda install -c bioconda diamond
diamond blastp \\
--query "metagenome.orfs.faa" \\ # See prodigal output from above
--db /path/to/nr.dmnd \\ # See NCBI section
--threads <num cpus to use> \\
--out blastp.tsv # This generated file is required as input to the bash workflow
NCBI Preparationο
If you are running Autometa for the first time youβll have to download the NCBI databases.
# First configure where you want to download the NCBI databases
autometa-config \\
--section databases --option ncbi \\
--value <path/to/your/ncbi/database/directory>
# Now download and format the NCBI databases
autometa-update-databases --update-ncbi
Note
You can check the default config paths using autometa-config --print
.
See autometa-update-databases -h
and autometa-config -h
for full list of options.
The previous command will download the following NCBI databases:
- Non-redundant nr database
- prot.accession2taxid.gz
- nodes.dmp, names.dmp and merged.dmp - Found within
Input Sample Nameο
A crucial step prior to running the Autometa bash workflow is specifying the metagenome sample name and where to output Autometaβs results.
# Default
simpleName="TemplateAssemblyName"
# Replace with your sample name
simpleName="MySample"
Note
The simpleName
that is provided will be used as a prefix to all of the resulting autometa output files.
Output directoryο
Immediately following the simpleName
parameter, you will need to specify where to write all results.
# Default
outdir="AutometaOutdir"
# Replace with your output directory...
outdir="MySampleAutometaResults"
Running the pipelineο
After you are finished configuring/double-checking your parameter settings..
You may run the pipeline via bash:
bash autometa.sh
or submit the pipeline into a queue:
For example, with slurm:
sbatch autometa.sh
Caution
Make sure your conda autometa environment is activated or the autometa entrypoints will not be available.
Additional parametersο
You can also adjust other pipeline parameters that ultimately control how binning is performed. These are located at the top of the workflow just under the required inputs.
length_cutoff
: Smallest contig you want binned (default is 3000bp)
kmer_size
: kmer size to use
norm_method
: Which kmer frequency normalization method to use. See
Advanced Usage section for details
pca_dimensions
: Number of dimensions of which to reduce the initial k-mer frequencies
matrix (default is 50
). See Advanced Usage section for details
embed_method
: Choices are sksne
, bhsne
, umap
, densmap
, trimap
(default is bhsne
) See Advanced Usage section for details
embed_dimensions
: Final dimensions of the kmer frequencies matrix (default is 2
).
See Advanced Usage section for details
cluster_method
: Cluster contigs using which clustering method. Choices are βdbscanβ and βhdbscanβ
(default is βdbscanβ). See Advanced Usage section for details
binning_starting_rank
: Which taxonomic rank to start the binning from. Choices are superkingdom
, phylum
,
class
, order
, family
, genus
, species
(default is superkingdom
). See Advanced Usage section for details
classification_method
: Which clustering method to use for unclustered recruitment step.
Choices are decision_tree
and random_forest
(default is decision_tree
). See Advanced Usage section for details
completeness
: Minimum completeness needed to keep a cluster (default is at least 20% complete, e.g. 20
).
See Advanced Usage section for details
purity
: Minimum purity needed to keep a cluster (default is at least 95% pure, e.g. 95
).
See Advanced Usage section for details
cov_stddev_limit
: Which clusters to keep depending on the coverage std.dev (default is 25%, e.g. 25
).
See Advanced Usage section for details
gc_stddev_limit
: Which clusters to keep depending on the GC% std.dev (default is 5%, e.g. 5
).
See Advanced Usage section for details
Note
If you are configuring an autometa job using the autometa-large-data-mode.sh
template,
there will be an additional parameter called, max_partition_size
(default=10,000). This is the maximum size
partition the Autometa clustering algorithm will consider. Any taxon partitions larger than this setting
will be skipped.