Databasesο
If you are running Autometa for the first time you will need to download and format a few databases. You may do this manually or using a few Autometa helper scripts. If you would like to use Autometaβs scripts for this, you will first need to install Autometa (See Installation).
The following sections use a pair of commands to configure autometa such that the database is updated according to its respective path.
Markersο
# Point Autometa to where you would like your markers database directory
autometa-config \
--section databases --option markers \
--value <path/to/your/markers/database/directory>
# Update your markers database directory
autometa-update-databases --update-markers
Links to these markers files and their associated cutoff values are below:
NCBIο
# First configure where you want to download the NCBI databases
autometa-config \
--section databases --option ncbi \
--value <path/to/your/ncbi/database/directory>
# Now download and format the NCBI databases
autometa-update-databases --update-ncbi
Note
You can check the config paths using autometa-config --print.
See autometa-update-databases -h and autometa-config -h for full list of options.
The previous command will download the following NCBI databases:
- Non-redundant nr database
- prot.accession2taxid.gz
- nodes.dmp, names.dmp, merged.dmp and delnodes.dmp - Found within
After these files are downloaded, the taxdump.tar.gz tarballβs files are extracted and the non-redundant protein database (nr.gz)
is formatted as a diamond database (i.e. nr.dmnd). This will significantly speed-up the diamond blastp searches.
Genome Taxonomy Database (GTDB)ο
If you would like to incorporate the benefits of using the Genome Taxonomy Database, you can either run the following script or manually download the respective databases. GTDB version 220 or later is required.
# First configure where you want to download the GTDB databases
autometa-config \
--section databases --option gtdb \
--value <path/to/your/gtdb/database/directory>
# To use a specific GTDB release
autometa-config \
--section gtdb --option release \
--value latest
# Or a version number like `--value 220`, or `--value 220.0`, etc.
# Download and format the configured GTDB databases release
autometa-update-databases --update-gtdb
Note
You can check the default config paths using autometa-config --print.
See autometa-update-databases -h and autometa-config -h for full list of options.
The previous command will download the following GTDB databases and format them for use by Autometa. The filenames will be modified to include the release version number for reproducibility.
The original files
- Amino acid sequences of representative genome
- gtdb-taxdump.tar.gz from shenwei356/gtdb-taxdump
The initial download and formatting of the GTDB databases can take some time. The GTDB databases are large, and downloading/formatting requires ~283 GB of hard disk space.
For version 220, the file sizes are approximately:
77 MB gtdb-taxdump-version-220.tar.gz
67 GB gtdb_proteins_aa_reps-version-220.tar.gz
149 GB autometa_formatted_gtdb-version-220.0.dmnd
103 MB ./gtdb_taxdump-version-220/