Usage

Installation

βš™οΈ To use AMPcombi, first install it using:

  • Using conda:

    conda create -n ampcombi python==3.11 mmseqs2==15.6f452 ampcombi
    

    or

    conda env create -f environment.yml
    
  • Using singularity and docker:

    singularity pull ampcombi:0.2.2--pyhdfd78af_0
    
  • From git repository:

    git clone https://github.com/Darcy220606/AMPcombi.git
    

πŸ“œ For full usage documentation of AMPcombi and it’s submodules, please refer to the help documentation accessed by:

ampcombi --help

Submodules

parse_tables

The parse_tables submodule is used to parse and filter the output files generated by the different AMP prediction tools described in About. It further aligns the amino acid sequences to different reference databases to grab structural and functional metadata for similar AMPs. One of the three following databases can be chosen, DRAMP, APD, UniRef100, however DRAMP is set as default. If a custom database is required, a path to a folder for e.g. ref_database should be passed to --amp_database_dir ./ref_database/. πŸ’‘ The folder must contain the database in fasta format with file extension *.fasta. Additionally, it estimates the physiochemical properties of the entire prepropeptide sequence of teh recovered AMP hits.

We have set many filtering parameters to defaults according to our development process, however do feel free to adjust them according to your dataset specific thresholds. To get a full list of options available and their defaults please refer to the help documentation of the submodule

ampcombi parse_tables --help

Example Usage (1)

ampcombi parse_tables \
--amp_results path/to/my/result_folder/ \          #required
--faa path/to/sample_faa_files/ \                  #required
--gbk path/to/sample_gbk_or_gbff_files/ \          #required
--interproscan_output path/to/interproscan_files \ #optional
--sample_list sample_1 sample_2 \                  #required
--contig_metadata path/to/contig_metadata.tsv \    #optional
--amp_database 'DRAMP' \
--<tool_1>_file '.tsv' \
--<tool_2>_file '.txt' \
--log true \
--threads 10

In this case, we use the --amp_results option to supply AMP tool prediction results from many samples in a folder format. The folder must follow this structure:

amp_results/
β”œβ”€β”€ tool_1/
β”‚   β”œβ”€β”€ sample_1/
β”‚   β”‚   └── sample_1.tsv
β”‚   └── sample_2/
β”‚       └── sample_2.tsv
β”œβ”€β”€ tool_2/
β”‚   β”œβ”€β”€ sample_1/
β”‚   β”‚   └── sample_1.txt
β”‚   └── sample_2/
β”‚       └── sample_2.txt
β”œβ”€β”€ tool_3/
    β”œβ”€β”€ sample_1/
    β”‚   └── sample_1.fasta
    └── sample_2/
        └── sample_2.fasta
  • –<tool>_file The <tool> should be changed to one of the following: ampir, macrel, amplify, neubi, hmmsearch, ensemblamppred, ampgram, amptransformer. The argument value should be a suffix of the files generated by that tool. Defaults are assigned for each tool, but the user can change these defaults according to their input file extensions. An exampl of the input files can be found here.

  • –contig_metadata A *.tsv file that must contain the sample name in the first column and the contig ID/name in the second column. Note: Column headers will be overwritten. An example of the input file can be found here

  • –faa A folder containing annotated files of the AMP hits with a suffix *.faa. This can be generated by any annotation tool (e.g., PROKKA or PYRODIGAL). Note: The files must include the sample name, for example, <samplename>.faa. An example of the input file can be found here

  • –gbk A folder containing annotated files of the AMP hits with a suffix *.gbk or *.gbff. This can be generated by any annotation tool (e.g., PROKKA or PYRODIGAL). Note: The files must include the sample name, for example, <samplename>.gbk or <samplename>.gbff. An example of the input file can be found here

  • –amp_database The database used for AMP prediction. Can either be 'DRAMP', 'APD' or 'UniRef100'.

  • –interproscan_output A path to a directory or file that contains the results generated by running InterProScan on the annotated sequences (*.faa). Note: The file names must match <sample_name>.tsv. Additionally, coding sequences classified as β€˜ribosomal proteins’ can be filtered out using: --interproscan_filter 'ribosomal proteins,ribosomal', which is done by default. An example of the input file can be found here. An example of how to run InterProScan to prepare the files is provided in test.

Example Usage (2)

ampcombi parse_tables \
--path_list path_to_sample_1_tool_1.csv path_to_sample_1_tool_2.txt \
--sample_list sample_1 \
--faa path/to/sample_faa_files/sample_1.faa \
--gbk path/to/sample_gbk_or_gbff_files/sample_1.<gbk><gbff> \
--<tool_1>_file '.tsv' \
--<tool_2>_file '.txt'

In this case, we use the --path_list option to supply AMP tool prediction results from a single sample in a list format.

Some optional parameters that can be tweaked:

Parameter

Description

Default

Allowed values

–amp_cutoff

Probability cutoff to filter AMPs by probability (not applicable for hmmsearch)

0.0

0.5

–hmm_evalue

Probability cutoff to filter AMPs by E-value (only applicable for HMMsearch)

None

0.05

–db_evalue

Probability cutoff to filter database classifications by E-value - any hit with an E-value below this will have its database classification removed

None

0.05

–aminoacid_length

Probability cutoff to filter AMP hits by the length of the amino acid sequence

100

60

–window_size_stop_codon

The length of the window size required to look for stop codons downstream and upstream of the CDS hits

60

40

–window_size_transporter

The length of the window size required to look for a β€˜transporter’ e.g. ABC transporter downstream and upstream of the CDS hits

11

20

–remove_stop_codons

Removes any AMP hits that don’t have a stop codon found in the window downstream or upstream of the CDS assigned by β€˜β€“window_size_stop_codon’. Must be turned on if hits are to be removed

False

True

–sample_metadata

Path to a tsv-file containing sample metadata, e.g. β€˜path/to/sample_metadata.tsv’. The metadata table can have more information for sample identification that will be added to the output summary. The table needs to contain the sample names in the first column.

None

./sample_metadata.tsv/

–contig_metadata

Path to a tsv-file containing contig metadata, e.g. β€˜path/to/contig_metadata.tsv’. The metadata table can have more information for contig classification that will be added to the output summary. The table needs to contain the sample names in the first column and the contig_ID in the second column. The metadata table can be the output from MMseqs2, pydamage, and MetaWrap.

None

./contig_metadata.tsv/

–interproscan_filter

A comma-separated list of all keywords that describe the protein that is not required in the analysis.

β€˜ribosomal protein,ribosomal proteins,ribosome protein,ribosomal rna,Ribosomal protein,RIBOSOMAL PROTEIN’

β€˜16S’

Output

The output will be written into your working directory, containing the following files and folders:

<pwd>/
β”œβ”€β”€ amp_DRAMP_database/
β”‚   └──mmseqs2
β”‚    β”‚  β”œβ”€β”€ ref_DB
β”‚    β”‚  β”œβ”€β”€ ref_DB_h
β”‚    β”‚  β”œβ”€β”€ ref_DB_h.dbtype
β”‚    β”‚  β”œβ”€β”€ ref_DB_h.index
β”‚    β”‚  β”œβ”€β”€ ref_DB.dbtype
β”‚    β”‚  β”œβ”€β”€ ref_DB.index
β”‚    β”‚  β”œβ”€β”€ ref_DB.lookup
β”‚    β”‚  └── ref_DB.source
β”‚    β”œβ”€β”€ general_amps_<Date>_clean.fasta
β”‚    └── general_amps_<Date>.tsv
β”œβ”€β”€ sample_1/
β”‚   β”œβ”€β”€ contig_gbks/
β”‚   β”œβ”€β”€ sample_1_amp.faa
β”‚   β”œβ”€β”€ sample_1_ampcombi.tsv
β”‚   β”œβ”€β”€ sample_1_mmseqs_matches.txt
β”‚   └── sample_1_ampcombi.log
β”œβ”€β”€ sample_2/
β”‚   β”œβ”€β”€ contig_gbks/
β”‚   β”œβ”€β”€ sample_2_amp.faa
β”‚   β”œβ”€β”€ sample_2_ampcombi.tsv
β”‚   β”œβ”€β”€ sample_2_mmseqs_matches.txt
β”‚   └── sample_2_ampcombi.log
└── Ampcombi_parse_tables.log

complete

The complete submodule allows AMPcombi to be integrated in portable pipelines for example nf-core/funcscan that can parallelize processing of data. It takes in as input the output from parse_tables to parse all sample tables into one final *.tsv.

To get a full list of options available and their defaults please refer to the help documentation of the submodule:

ampcombi complete --help

Example Usage (1)

ampcombi complete \
--summaries_directory path/to/ampcombi_parse_tables_results_folder/

In this case we use the –summaries_directory option to supply the samples’ result folder from –ampcombi parse_tables, which should contain the folder structure from ampcombi parse_tables in a parent folder, for example named ./ampcombi/….

Example Usage (2)

ampcombi complete \
--summaries_files path/to/ampcombi_parse_tables/sample_1_ampcombi.tsv path/to/ampcombi_parse_tables/sample_2_ampcombi.tsv/

In this case we use the –summaries_files option to supply the ampcombi_parse_tables AMPcombi summary files in a list format.

Output

The output will be written into your working directory, containing the following files:

<pwd>/
└── Ampcombi_summary.tsv
└── Ampcombi_complete.log

cluster

The cluster submodule clusters the output from complete (i.e., Ampcombi_summary.tsv) into subclasses of somewhat similar AMP families. This relies primarily on MMSeqs2 cluster v.15.6f452. Only some parameters that were deemed important for the purpose of AMPcombi were incorporated as optional arguments.

To get a full list of options available and their defaults please refer to the help documentation of the submodule:

ampcombi cluster --help

Example Usage

ampcombi cluster \
--ampcombi_summary path/to/Ampcombi_summary.tsv

The –ampcombi_summary takes in ampcombi complete output summary table Ampcombi_summary.tsv as input.

Some optional parameters that can be tweaked:

Parameter

Description

Default

Allowed values

--cluster_cov_mode

This assigns the cov. mode to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here.

0

2

--cluster_mode

This assigns the cluster mode to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here.

1

2

--cluster_coverage

This assigns the coverage to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here.

0.8

0.9

--cluster_seq_id

This assigns the seqsID to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here.

0.4

0.7

--cluster_sensitivity

This assigns sensitivity of alignment to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here.

4.0

7.0

--cluster_remove_singletons

This removes any hits that did not form a cluster.

True

False

--cluster_retain_label

This removes any cluster that only has a certain label in the sample name. For example, if you have sample labels with β€˜S1_metaspades’ and β€˜S1_megahit’, you can retain clusters that have samples with suffix β€˜_megahit’ by running --retain_clusters_label megahit.

β€˜β€™

β€˜megahit’

--cluster_min_member

This removes any cluster that has a hit number lower than assigned here.

3

1

Output

The output will be written into your working directory, containing the following files:

<pwd>/
  └── Ampcombi_summary_cluster.tsv
  β”œβ”€β”€ Ampcombi_summary_cluster_representative_seq.tsv
  └── Ampcombi_cluster.log
  • `Ampcombi_summary_cluster.tsv`includes the contents of the complete summary plus a column with cluster IDs.

  • `Ampcombi_summary_cluster_representative_seq.tsv`includes the table with all the representative hits from each cluster.

signal_peptide

The signal_peptide submodule predicts whether a signal peptide was found on the filtered and clustered AMP hits. This only works if the user installs SignalP separately. For licensing issues, SignalP can only be downloaded and used by academic users; other users are requested to contact DTU Health Technology Software Package before using it. For further details about the usage of SignalP please refer to their documentation.

To get a full list of options available and their defaults please refer to the help documentation of the submodule:

ampcombi signal_peptide --help

Example Usage

ampcombi signal_peptide \
--signalp_model path/to/signalp_model/ \
--ampcombi_cluster path/to/Ampcombi_summary_cluster.tsv \
--log true

The –ampcombi_cluster takes in ampcombi cluster or ampcombi complete output summary table Ampcombi_summary <or _cluster>.tsv as input.

Output

The output will be written into your working directory, containing the following files:

<pwd>/
  └── Ampcombi_summary_cluster_SP.tsv
  β”œβ”€β”€ Ampcombi_summary_cluster_SP_onlyclusterswithSP.tsv
  β”œβ”€β”€ signalp
  |   β”œβ”€β”€ output_*.png/
  |   β”œβ”€β”€ prediction_results_index.tsv
  |   β”œβ”€β”€ prediction_results.tsv
  |   β”œβ”€β”€ representative_seq.txt
  └── Ampcombi_signalpeptide.log
  • Ampcombi_summary_cluster_SP.tsv includes the contents of the cluster summary plus a column with yes/no indicating the presence of a signal peptide sequence.

  • Ampcombi_summary_cluster_SP_onlyclusterswithSP.tsv includes the contents of the cluster summary plus a column with yes/no indicating the presence of a signal peptide sequence. But in this case clusters are retained only if they contain a hit or more with a signaling peptide.

  • signalp directory containing the results from the tool SignalP in *.png format showing the location of the predicted signaling peptide.

The prediction_results.tsv contains a table with the location of the signaling peptide and the identity. The prediction_results_index.tsv contains a table that gives an index number to every hit found in ./AMPcombi_summary_ao_human_nonhuman_clusters_SP_onlyclusterswithSP.tsv. This can be used to rename the files generated by running LocalColabFold on the AMP cluster representatives found in Ampcombi_summary_cluster_representative_seq.tsv for further downstream analysis on the secondary structure.