Usage

Installation

⚙️ To use AMPcombi, first install it via any of the three options:

Using conda with correctly set-up bioconda channel (see Bioconda usage docs):

conda create -n ampcombi python==3.11 mmseqs2==15.6f452 ampcombi

or

conda env create -f environment.yml

Using singularity or docker:

singularity pull ampcombi:3.0.0--pyhdfd78af_0

Clone GitHub repository:

git clone https://github.com/paleobiotechnology/AMPcombi.git

📜 For full parameter list and usage documentation of AMPcombi and its submodules, please refer to the CLI help message accessed by:

ampcombi --help

Submodules

parse_tables

The parse_tables submodule is used to parse and filter the output files generated by the different AMP prediction tools described in About. It further aligns the amino acid sequences to different reference databases to grab structural and functional metadata for similar AMPs. Additionally, the physiochemical properties of the entire prepropeptide sequence of the recovered AMP hits are estimated.

One of three reference databases (DRAMP, APD, or UniRef100) can be chosen to via the --amp_database parameter, with DRAMP set as the default. The database will be auto-downloaded if not provided by the --amp_database_dir parameter.

Note

A pre-downloaded or custom database can be provided using the flag --amp_database_dir with the path to the database folder (e.g. --amp_database_dir ./ref_database/).

💡 The folder must contain the database in fasta format with file extension *.fasta and the parameter --amp_database still needs to be set to the correct database (DRAMP, APD, or UniRef100).

We have set default values for many filtering parameters as we saw fit for most use cases. However, feel free to adjust them to your dataset-specific thresholds.

To get a full list of available options and their default values please refer to the help documentation of the parse_tables submodule:

ampcombi parse_tables --help

Example Usage (1)

ampcombi parse_tables \
--amp_results path/to/my/result_folder/ \          # required
--faa path/to/sample_faa_files/ \                  # required
--gbk path/to/sample_gbk_or_gbff_files/ \          # required
--interproscan_output path/to/interproscan_files \ # optional
--sample_list sample_1 sample_2 \                  # required
--contig_metadata path/to/contig_metadata.tsv \    # optional
--amp_database 'DRAMP' \
--<tool_1>_file '.tsv' \
--<tool_2>_file '.txt' \
--log \
--threads 10

Explanation of parameters:

--amp_results

In this case, we use the --amp_results option to supply AMP prediction tool results from many samples in a folder format. The folder must follow this structure:

amp_results/
├── tool_1/
│   ├── sample_1/
│   │   └── sample_1.tsv
│   └── sample_2/
│       └── sample_2.tsv
├── tool_2/
│   ├── sample_1/
│   │   └── sample_1.txt
│   └── sample_2/
│       └── sample_2.txt
├── tool_3/
    ├── sample_1/
    │   └── sample_1.fasta
    └── sample_2/
        └── sample_2.fasta

--<tool>_file

The <tool> should be changed to one of the following: ampir, macrel, amplify, neubi, hmmsearch, ensemblamppred, ampgram, amptransformer. The argument value should be a suffix of the files generated by that tool, e.g. '.tsv'. Defaults are assigned for each tool, but the user can change these defaults according to their input file extensions. An example of the input files can be found here.
--contig_metadata

A TSV file that must contain the sample name in the first column and the contig ID/name in the second column. Note: Column headers will be overwritten. An example of the input file can be found here.
--faa

A folder containing annotated files of the AMP hits with the suffix *.faa. This can be generated by any annotation tool (e.g., Prokka or Pyrodigal). Note: The files must include the sample name, for example, <samplename>.faa. An example of the input file can be found here.
--gbk

A folder containing annotated files of the AMP hits with the suffix *.gbk or *.gbff. This can be generated by any annotation tool (e.g., Prokka or Pyrodigal). Note: The files must include the sample name, for example, <samplename>.gbk or <samplename>.gbff. An example of the input file can be found here.
--amp_database

The database used for AMP prediction. Can either be 'DRAMP', 'APD', or 'UniRef100'.
--interproscan_output

A path to a directory or file that contains the results generated by running InterProScan on the annotated sequences (*.faa). Note: The file names must match <sample_name>.tsv. Additionally, coding sequences classified as ‘ribosomal proteins’ can be filtered out using: --interproscan_filter 'ribosomal proteins,ribosomal', which is done by default. An example of the input file can be found here. An example of how to run InterProScan to prepare the files is provided in Test runs.

Example Usage (2)

ampcombi parse_tables \
--path_list <path/to/sample_1_tool_1>.csv <path/to/sample_1_tool_2>.txt \
--sample_list sample_1 \
--faa path/to/sample_faa_files/sample_1.faa \
--gbk path/to/sample_gbk_or_gbff_files/sample_1.<gbk,gbff> \
--<tool_1>_file '.tsv' \
--<tool_2>_file '.txt'

Explanation of parameters:

--path_list

In this case, we use the --path_list option to supply AMP prediction tool results from a single sample in a list format.

Some optional parameters that can be tweaked:

Parameter	Description	Default	Different example value
`--amp_cutoff`	Probability cutoff to filter AMPs by probability (not applicable for hmmsearch)	0.0	0.5
`--hmm_evalue`	Probability cutoff to filter AMPs by E-value (only applicable for hmmsearch)	None	0.05
`--db_evalue`	Probability cutoff to filter database classifications by E-value - any hit with an E-value below this will have its database classification removed	None	0.05
`--aminoacid_length`	Probability cutoff to filter AMP hits by the length of the amino acid sequence	100	60
`--window_size_stop_codon`	The length of the window size required to look for stop codons downstream and upstream of the CDS hits	60	40
`--window_size_transporter`	The length of the window size required to look for a ‘transporter’ e.g. ABC transporter downstream and upstream of the CDS hits	11	20
`--remove_stop_codons`	Removes all AMP hits that don’t have a stop codon found in the window downstream or upstream of the CDS assigned by `--window_size_stop_codon`. Must be turned on if hits are to be removed.	False	True
`--sample_metadata`	Path to a tsv-file containing sample metadata, e.g. `path/to/sample_metadata.tsv`. The metadata table can have more information for sample identification that will be added to the output summary. The table needs to contain the sample names in the first column.	None	./sample_metadata.tsv/
`--contig_metadata`	Path to a tsv-file containing contig metadata, e.g. `path/to/contig_metadata.tsv`. The metadata table can have more information for contig classification that will be added to the output summary. The table needs to contain the sample names in the first column and the contig_ID in the second column. The metadata table can be the output from MMseqs2, pydamage, and MetaWrap.	None	./contig_metadata.tsv/
`--write_gbk`	Write a GBK file to disk containing contigs of filtered AMPs (e.g. if they include stop codons and transporter proteins in the vicinity). File name: `<sample>_filtered_AMP_contigs.gbk`	None	None
`--interproscan_filter`	A comma-separated list of all keywords that describe the protein that is not required in the analysis.	‘ribosomal protein,ribosomal proteins,ribosome protein,ribosomal rna,Ribosomal protein,RIBOSOMAL PROTEIN’	‘16S’

Output

The output will be written into your working directory, containing the following files and folders:

<pwd>/
├── amp_DRAMP_database/
│   └──mmseqs2
│    │  ├── ref_DB
│    │  ├── ref_DB_h
│    │  ├── ref_DB_h.dbtype
│    │  ├── ref_DB_h.index
│    │  ├── ref_DB.dbtype
│    │  ├── ref_DB.index
│    │  ├── ref_DB.lookup
│    │  └── ref_DB.source
│    ├── general_amps_<Date>_clean.fasta
│    └── general_amps_<Date>.tsv
├── sample_1/
│   ├── sample_1_filtered_AMP_contigs.gbk
│   ├── sample_1_amp.faa
│   ├── sample_1_ampcombi.tsv
│   ├── sample_1_mmseqs_matches.txt
│   └── sample_1_ampcombi.log
├── sample_2/
│   ├── sample_2_filtered_AMP_contigs.gbk
│   ├── sample_2_amp.faa
│   ├── sample_2_ampcombi.tsv
│   ├── sample_2_mmseqs_matches.txt
│   └── sample_2_ampcombi.log
└── Ampcombi_parse_tables.log

complete

The complete submodule allows AMPcombi to be integrated in portable pipelines (e.g. nf-core/funcscan that can screen (meta)genome sequences with muliple tools simultaneously). The complete submodule takes in as input the output from parse_tables to combine all sample tables into one final TSV file.

To get a full list of options available and their default values please refer to the help documentation of the submodule:

ampcombi complete --help

Example Usage (1)

ampcombi complete \
--summaries_directory path/to/ampcombi_parse_tables_results_folder/

In this case we use the –summaries_directory option to supply the samples’ result folder from –ampcombi parse_tables, which should contain the folder structure from ampcombi parse_tables in a parent folder, for example named ./ampcombi/….

Example Usage (2)

ampcombi complete \
--summaries_files path/to/ampcombi_parse_tables/sample_1_ampcombi.tsv path/to/ampcombi_parse_tables/sample_2_ampcombi.tsv/

In this case we use the –summaries_files option to supply the ampcombi_parse_tables AMPcombi summary files in a list format.

Output

The output will be written into your working directory, containing the following files:

<pwd>/
└── Ampcombi_summary.tsv
└── Ampcombi_complete.log

Description of columns in Ampcombi_summary.tsv:

Column	Description
`sample_id`	Sample ID as given by the user in `--sample_list`
`CDS_id`	ID of the coding sequence (CDS) as annotated in input GBK file
`prob_amplify`	Probability of correct AMP prediction as given by AMPlify (value range 0-1)
`prob_ampir`	Probability of correct AMP prediction as given by ampir (value range 0-1)
`prob_macrel`	Probability of correct AMP prediction as given by Macrel (value range 0-1)
`aa_sequence`	Amino-acid sequence of the annotated AMP
`accession`	Accession number(s) as provided by the optional InterProScan results files
`description`	Protein description as provided by the optional InterProScan results files
`interpro_accession`	InterProScan accession number(s) as provided by the optional InterProScan results files
`interpro_description`	Additional protein description as provided by the optional InterProScan results files
`query`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): ID of the coding sequence (is redundant with `CDS_id`)
`target`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): ID of the target sequence in reference database
`evalue`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): E-value of the alignment by MMseqs2
`pident`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): Percent of identical matches by MMseqs2
`nident`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): Number of identical matches by MMseqs2
`tlen`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): Target amino acid sequence length
`tstart`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): 1-indexed alignment start position in target sequence
`tend`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): 1-indexed alignment end position in target sequence
`taln`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): Alignment sequence (amino acids, gaps)
`theader`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): ID of the target sequence in reference database (is redundant with `target`)
`alnlen`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): Alignment length
`qcov`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): Fraction of coverage of the query sequence
`tcov`	If AMP hit was found in reference database (DRAMP, APD, or UniRef100): Fraction of coverage of the target sequence
`DRAMP_ID`	If AMP hit was found in DRAMP database: ID of the target sequence in reference database (is redundant with `target`)
`Sequence`	If AMP hit was found in DRAMP database: Target sequence
`Name`	If AMP hit was found in DRAMP database: Target protein name
`Swiss_Prot_Entry`	If AMP hit was found in DRAMP database: Target protein UniProt/Swiss-Prot ID (if available)
`Family`	If AMP hit was found in DRAMP database: Protein family information (if available)
`Gene`	If AMP hit was found in DRAMP database: Associated gene (if available)
`Source`	If AMP hit was found in DRAMP database: Biological or synthetic source of the reference hit (if available)
`PDB_ID`	If AMP hit was found in DRAMP database: Protein Data Bank ID (if available)
`Target_Organism`	If AMP hit was found in DRAMP database: More information on the target organism (if available)
`molecular_weight`	Molecular weight as identified by Biopython (ProteinAnalysis)
`helix_fraction`	Fraction of amino acids in helix secondary structure as identified by Biopython (ProteinAnalysis)
`turn_fraction`	Fraction of amino acids in turn secondary structure as identified by Biopython (ProteinAnalysis)
`sheet_fraction`	Fraction of amino acids in beta sheet secondary structure as identified by Biopython (ProteinAnalysis)
`isoelectric_point`	Isoelectric point as identified by Biopython (ProteinAnalysis)
`hydrophobicity`	Hydrophobicity as identified by Biopython (ProteinAnalysis)
`transporter_protein`	Presence or absence of transporter protein in the genomic vicinity of the AMP
`contig_id`	Contig ID of the AMP
`CDS_start`	AMP CDS start position on contig
`CDS_end`	AMP CDS end position on contig
`CDS_dir`	Forward or reverse AMP CDS on contig
`CDS_stop_codon_found`	DNA sequence of stop codon in the vicinity of the AMP CDS if present

cluster

The cluster submodule clusters the output from complete (i.e., Ampcombi_summary.tsv) into subclasses of similar AMP families. This relies primarily on MMSeqs2 cluster v.15.6f452. Only some parameters that were deemed important for the purpose of AMPcombi were incorporated as optional arguments.

To get a full list of available options and their defaults please refer to the help documentation of the submodule:

ampcombi cluster --help

Example Usage

ampcombi cluster \
--ampcombi_summary path/to/Ampcombi_summary.tsv

The --ampcombi_summary parameter takes the output of ampcombi complete (i.e. the summary file Ampcombi_summary.tsv).

Some optional parameters that can be tweaked:

Parameter	Description	Default	Different example value
`--cluster_cov_mode`	This assigns the coverage mode to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here.	0	2
`--cluster_mode`	This assigns the cluster mode to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here.	1	2
`--cluster_coverage`	This assigns the coverage to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here.	0.8	0.9
`--cluster_seq_id`	This assigns the sequence identity to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here.	0.4	0.7
`--cluster_sensitivity`	This assigns sensitivity of alignment to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here.	4.0	7.0
`--cluster_keep_singletons`	This keeps any hits that did not form a cluster.	False	True
`--cluster_retain_label`	This retains only clusters that have a certain label in the sample name. For example, if you have sample labels with ‘S1_metaspades’ and ‘S1_megahit’, you can retain clusters that have samples with suffix ‘_megahit’ by running `--retain_clusters_label megahit`.	‘’	‘megahit’
`--cluster_min_member`	This removes any cluster that has a hit number lower than assigned here.	3	1

Output

The output will be written into your working directory, containing the following files:

<pwd>/
  ├── Ampcombi_summary_cluster.tsv
  ├── Ampcombi_summary_cluster_representative_seq.tsv
  └── Ampcombi_cluster.log

Ampcombi_summary_cluster.tsv includes the contents of the complete summary (Ampcombi_summary.tsv) plus two additional columns:
- seq_headers: Sequence header of the representative AMP of the cluster
- cluster_id: ID of the cluster to which the AMP belongs
Ampcombi_summary_cluster_representative_seq.tsv:
- This file contains a short summary of the identified clusters, i.e. the header of their representative AMP sequence (seq_headers), the cluster ID (index), and the size of the cluster (total_cluster_members).
- Clusters of interest can be investigated in further detail in the comprehensive summary file Ampcombi_summary_cluster.tsv described above.

signal_peptide

The signal_peptide submodule predicts whether a signal peptide was found on the filtered and clustered AMP hits. This only works if the user installs SignalP separately. SignalP may only be downloaded and used by academic users according to its license; other users are requested to contact DTU Health Technology Software Package before using it. For further details about the usage of SignalP please refer to their documentation.

To get a full list of options available and their default values please refer to the help documentation of the submodule:

ampcombi signal_peptide --help

Example Usage

ampcombi signal_peptide \
--signalp_model path/to/signalp_model/ \
--ampcombi_cluster path/to/Ampcombi_summary_cluster.tsv \
--log

The --ampcombi_cluster parameter takes the output of ampcombi complete or ampcombi cluster (i.e. the file Ampcombi_summary.tsv or Ampcombi_cluster.tsv).

Output

The output will be written into your working directory, containing the following files:

<pwd>/
  ├── Ampcombi_summary_cluster_SP.tsv
  ├── Ampcombi_summary_cluster_SP_onlyclusterswithSP.tsv
  ├── signalp/
  |   ├── output_*.png
  |   ├── prediction_results_index.tsv
  |   ├── prediction_results.tsv
  |   ├── representative_seq.txt
  └── Ampcombi_signalpeptide.log

Ampcombi_summary_cluster_SP.tsv includes the contents of the cluster summary plus a column with yes/no indicating the presence or absence of a signal peptide sequence.
Ampcombi_summary_cluster_SP_onlyclusterswithSP.tsv includes the contents of the cluster summary plus a column with yes/no indicating the presence or absence of a signal peptide sequence. In this case clusters are retained only if they contain a hit or more with a signaling peptide.
signalp directory containing the results from the tool SignalP in PNG format showing the location of the predicted signaling peptide.
prediction_results.tsv contains a table with the location of the signaling peptide and the identity.
prediction_results_index.tsv contains a table that gives an index number to every hit found in ./AMPcombi_summary_ao_human_nonhuman_clusters_SP_onlyclusterswithSP.tsv.
- This can be used to rename the files generated by running LocalColabFold on the AMP cluster representatives found in Ampcombi_summary_cluster_representative_seq.tsv for further downstream analysis on the secondary structure.