Databases

If you are running Autometa for the first time you will need to download and format a few databases. You may do this manually or using a few Autometa helper scripts. If you would like to use Autometa’s scripts for this, you will first need to install Autometa (See Installation).

The following sections use a pair of commands to configure autometa such that the database is updated according to its respective path.

Markers

# Point Autometa to where you would like your markers database directory
autometa-config \
    --section databases --option markers \
    --value <path/to/your/markers/database/directory>

# Update your markers database directory
autometa-update-databases --update-markers

Links to these markers files and their associated cutoff values are below:

  • bacteria single-copy-markers - link

  • bacteria single-copy-markers cutoffs - link

  • archaea single-copy-markers - link

  • archaea single-copy-markers cutoffs - link

NCBI

# First configure where you want to download the NCBI databases
autometa-config \
    --section databases --option ncbi \
    --value <path/to/your/ncbi/database/directory>

# Now download and format the NCBI databases
autometa-update-databases --update-ncbi

Note

You can check the config paths using autometa-config --print.

See autometa-update-databases -h and autometa-config -h for full list of options.

The previous command will download the following NCBI databases:

After these files are downloaded, the taxdump.tar.gz tarball’s files are extracted and the non-redundant protein database (nr.gz) is formatted as a diamond database (i.e. nr.dmnd). This will significantly speed-up the diamond blastp searches.

Genome Taxonomy Database (GTDB)

If you would like to incorporate the benefits of using the Genome Taxonomy Database, you can either run the following script or manually download the respective databases. GTDB version 220 or later is required.

# First configure where you want to download the GTDB databases
autometa-config \
    --section databases --option gtdb \
    --value <path/to/your/gtdb/database/directory>

# To use a specific GTDB release
autometa-config \
    --section gtdb --option release \
    --value latest
    # Or a version number like `--value 220`, or `--value 220.0`, etc.

# Download and format the configured GTDB databases release
autometa-update-databases --update-gtdb

Note

You can check the default config paths using autometa-config --print.

See autometa-update-databases -h and autometa-config -h for full list of options.

The previous command will download the following GTDB databases and format them for use by Autometa. The filenames will be modified to include the release version number for reproducibility.

The original files

The initial download and formatting of the GTDB databases can take some time. The GTDB databases are large, and downloading/formatting requires ~283 GB of hard disk space.

For version 220, the file sizes are approximately:

  • 77 MB gtdb-taxdump-version-220.tar.gz

  • 67 GB gtdb_proteins_aa_reps-version-220.tar.gz

  • 149 GB autometa_formatted_gtdb-version-220.0.dmnd

  • 103 MB ./gtdb_taxdump-version-220/