large_data_mode.py

usage: large_data_mode.py

Autometa Large-data-mode binning by contig set selection using max-partition-
size

optional arguments:
  -h, --help            show this help message and exit
  --kmers filepath      Path to k-mer counts table (default: None)
  --coverages filepath  Path to metagenome coverages table (default: None)
  --gc-content filepath
                        Path to metagenome GC contents table (default: None)
  --markers filepath    Path to Autometa annotated markers table (default:
                        None)
  --taxonomy filepath   Path to Autometa assigned taxonomies table (default:
                        None)
  --output-binning filepath
                        Path to write Autometa binning results (default: None)
  --output-main filepath
                        Path to write Autometa main table used during/after
                        binning (default: None)
  --clustering-method {dbscan,hdbscan}
                        Clustering algorithm to use for recursive binning.
                        (default: dbscan)
  --completeness 0 < float <= 100
                        completeness cutoff to retain cluster. e.g. cluster
                        completeness >= `completeness` (default: 20.0)
  --purity 0 < float <= 100
                        purity cutoff to retain cluster. e.g. cluster purity
                        >= `purity` (default: 95.0)
  --cov-stddev-limit float
                        coverage standard deviation limit to retain cluster
                        e.g. cluster coverage standard deviation <= `cov-
                        stddev-limit` (default: 25.0)
  --gc-stddev-limit float
                        GC content standard deviation limit to retain cluster
                        e.g. cluster GC content standard deviation <= `gc-
                        content-stddev-limit` (default: 5.0)
  --norm-method {am_clr,ilr,clr}
                        kmer normalization method to use on kmer counts
                        (default: am_clr)
  --pca-dims int        PCA dimensions to reduce normalized kmer frequencies
                        prior to embedding (default: 50)
  --embed-method {bhsne,umap,sksne,trimap}
                        kmer embedding method to use on normalized kmer
                        frequencies (default: bhsne)
  --embed-dims int      Embedding dimensions to reduce normalized kmers table
                        after PCA. (default: 2)
  --max-partition-size int
                        Maximum number of contigs to consider for a recursive
                        binning batch. (default: 10000)
  --starting-rank {superkingdom,phylum,class,order,family,genus,species}
                        Canonical rank at which to begin subsetting taxonomy
                        (default: superkingdom)
  --reverse-ranks       Reverse order at which to split taxonomy by canonical-
                        rank. When `--reverse-ranks` is given, contigs will be
                        split in order of species, genus, family, order,
                        class, phylum, superkingdom. (default: False)
  --cache dirpath       Directory to store itermediate checkpoint files during
                        binning (If this is provided and the job fails, the
                        script will attempt to begin from the checkpoints in
                        this cache directory). (default: None)
  --binning-checkpoints filepath
                        File path to store itermediate contig binning results
                        (The `--cache` argument is required for this feature).
                        If `--cache` is provided without this argument, a
                        binning checkpoints file will be created. (default:
                        None)
  --rank-filter {superkingdom,phylum,class,order,family,genus,species}
                        Taxonomy column canonical rank to subset by provided
                        value of `--rank-name-filter` (default: superkingdom)
  --rank-name-filter RANK_NAME_FILTER
                        Only retrieve contigs with this name corresponding to
                        `--rank-filter` column (default: bacteria)
  --verbose             log debug information (default: False)
  --cpus int            Number of cores to use by clustering method (default
                        will try to use as many as are available) (default:
                        -1)