Usageο
Installationο
βοΈ To use AMPcombi, first install it using:
Using conda:
conda create -n ampcombi python==3.11 mmseqs2==15.6f452 ampcombi
or
conda env create -f environment.yml
Using singularity and docker:
singularity pull ampcombi:0.2.2--pyhdfd78af_0
From git repository:
git clone https://github.com/Darcy220606/AMPcombi.git
π For full usage documentation of AMPcombi and itβs submodules, please refer to the help documentation accessed by:
ampcombi --help
Submodules
parse_tablesο
The parse_tables submodule is used to parse and filter the output files generated by the different AMP prediction tools described in About.
It further aligns the amino acid sequences to different reference databases to grab structural and functional metadata for similar AMPs.
One of the three following databases can be chosen, DRAMP, APD, UniRef100, however DRAMP is set as default.
If a custom database is required, a path to a folder for e.g. ref_database should be passed to --amp_database_dir ./ref_database/
. π‘ The folder must contain the database in fasta format with file extension *.fasta
.
Additionally, it estimates the physiochemical properties of the entire prepropeptide sequence of teh recovered AMP hits.
We have set many filtering parameters to defaults according to our development process, however do feel free to adjust them according to your dataset specific thresholds. To get a full list of options available and their defaults please refer to the help documentation of the submodule
ampcombi parse_tables --help
Example Usage (1)
ampcombi parse_tables \
--amp_results path/to/my/result_folder/ \ #required
--faa path/to/sample_faa_files/ \ #required
--gbk path/to/sample_gbk_or_gbff_files/ \ #required
--interproscan_output path/to/interproscan_files \ #optional
--sample_list sample_1 sample_2 \ #required
--contig_metadata path/to/contig_metadata.tsv \ #optional
--amp_database 'DRAMP' \
--<tool_1>_file '.tsv' \
--<tool_2>_file '.txt' \
--log true \
--threads 10
In this case, we use the --amp_results
option to supply AMP tool prediction results from many samples in a folder format.
The folder must follow this structure:
amp_results/
βββ tool_1/
β βββ sample_1/
β β βββ sample_1.tsv
β βββ sample_2/
β βββ sample_2.tsv
βββ tool_2/
β βββ sample_1/
β β βββ sample_1.txt
β βββ sample_2/
β βββ sample_2.txt
βββ tool_3/
βββ sample_1/
β βββ sample_1.fasta
βββ sample_2/
βββ sample_2.fasta
β<tool>_file The
<tool>
should be changed to one of the following:ampir
,macrel
,amplify
,neubi
,hmmsearch
,ensemblamppred
,ampgram
,amptransformer
. The argument value should be a suffix of the files generated by that tool. Defaults are assigned for each tool, but the user can change these defaults according to their input file extensions. An exampl of the input files can be found here.βcontig_metadata A *.tsv file that must contain the sample name in the first column and the contig ID/name in the second column. Note: Column headers will be overwritten. An example of the input file can be found here
βfaa A folder containing annotated files of the AMP hits with a suffix
*.faa
. This can be generated by any annotation tool (e.g., PROKKA or PYRODIGAL). Note: The files must include the sample name, for example,<samplename>.faa
. An example of the input file can be found hereβgbk A folder containing annotated files of the AMP hits with a suffix
*.gbk
or*.gbff
. This can be generated by any annotation tool (e.g., PROKKA or PYRODIGAL). Note: The files must include the sample name, for example,<samplename>.gbk
or<samplename>.gbff
. An example of the input file can be found hereβamp_database The database used for AMP prediction. Can either be
'DRAMP'
,'APD'
or'UniRef100'
.βinterproscan_output A path to a directory or file that contains the results generated by running InterProScan on the annotated sequences (
*.faa
). Note: The file names must match<sample_name>.tsv
. Additionally, coding sequences classified as βribosomal proteinsβ can be filtered out using:--interproscan_filter 'ribosomal proteins,ribosomal'
, which is done by default. An example of the input file can be found here. An example of how to run InterProScan to prepare the files is provided in test.
Example Usage (2)
ampcombi parse_tables \
--path_list path_to_sample_1_tool_1.csv path_to_sample_1_tool_2.txt \
--sample_list sample_1 \
--faa path/to/sample_faa_files/sample_1.faa \
--gbk path/to/sample_gbk_or_gbff_files/sample_1.<gbk><gbff> \
--<tool_1>_file '.tsv' \
--<tool_2>_file '.txt'
In this case, we use the --path_list
option to supply AMP tool prediction results from a single sample in a list format.
Some optional parameters that can be tweaked:
Parameter |
Description |
Default |
Allowed values |
---|---|---|---|
βamp_cutoff |
Probability cutoff to filter AMPs by probability (not applicable for hmmsearch) |
0.0 |
0.5 |
βhmm_evalue |
Probability cutoff to filter AMPs by E-value (only applicable for HMMsearch) |
None |
0.05 |
βdb_evalue |
Probability cutoff to filter database classifications by E-value - any hit with an E-value below this will have its database classification removed |
None |
0.05 |
βaminoacid_length |
Probability cutoff to filter AMP hits by the length of the amino acid sequence |
100 |
60 |
βwindow_size_stop_codon |
The length of the window size required to look for stop codons downstream and upstream of the CDS hits |
60 |
40 |
βwindow_size_transporter |
The length of the window size required to look for a βtransporterβ e.g. ABC transporter downstream and upstream of the CDS hits |
11 |
20 |
βremove_stop_codons |
Removes any AMP hits that donβt have a stop codon found in the window downstream or upstream of the CDS assigned by ββwindow_size_stop_codonβ. Must be turned on if hits are to be removed |
False |
True |
βsample_metadata |
Path to a tsv-file containing sample metadata, e.g. βpath/to/sample_metadata.tsvβ. The metadata table can have more information for sample identification that will be added to the output summary. The table needs to contain the sample names in the first column. |
None |
./sample_metadata.tsv/ |
βcontig_metadata |
Path to a tsv-file containing contig metadata, e.g. βpath/to/contig_metadata.tsvβ. The metadata table can have more information for contig classification that will be added to the output summary. The table needs to contain the sample names in the first column and the contig_ID in the second column. The metadata table can be the output from MMseqs2, pydamage, and MetaWrap. |
None |
./contig_metadata.tsv/ |
βinterproscan_filter |
A comma-separated list of all keywords that describe the protein that is not required in the analysis. |
βribosomal protein,ribosomal proteins,ribosome protein,ribosomal rna,Ribosomal protein,RIBOSOMAL PROTEINβ |
β16Sβ |
Output
The output will be written into your working directory, containing the following files and folders:
<pwd>/
βββ amp_DRAMP_database/
β βββmmseqs2
β β βββ ref_DB
β β βββ ref_DB_h
β β βββ ref_DB_h.dbtype
β β βββ ref_DB_h.index
β β βββ ref_DB.dbtype
β β βββ ref_DB.index
β β βββ ref_DB.lookup
β β βββ ref_DB.source
β βββ general_amps_<Date>_clean.fasta
β βββ general_amps_<Date>.tsv
βββ sample_1/
β βββ contig_gbks/
β βββ sample_1_amp.faa
β βββ sample_1_ampcombi.tsv
β βββ sample_1_mmseqs_matches.txt
β βββ sample_1_ampcombi.log
βββ sample_2/
β βββ contig_gbks/
β βββ sample_2_amp.faa
β βββ sample_2_ampcombi.tsv
β βββ sample_2_mmseqs_matches.txt
β βββ sample_2_ampcombi.log
βββ Ampcombi_parse_tables.log
completeο
The complete submodule allows AMPcombi to be integrated in portable pipelines for example nf-core/funcscan that can parallelize processing of data.
It takes in as input the output from parse_tables
to parse all sample tables into one final *.tsv
.
To get a full list of options available and their defaults please refer to the help documentation of the submodule:
ampcombi complete --help
Example Usage (1)
ampcombi complete \
--summaries_directory path/to/ampcombi_parse_tables_results_folder/
In this case we use the βsummaries_directory option to supply the samplesβ result folder from βampcombi parse_tables, which should contain the folder structure from ampcombi parse_tables in a parent folder, for example named ./ampcombi/β¦.
Example Usage (2)
ampcombi complete \
--summaries_files path/to/ampcombi_parse_tables/sample_1_ampcombi.tsv path/to/ampcombi_parse_tables/sample_2_ampcombi.tsv/
In this case we use the βsummaries_files option to supply the ampcombi_parse_tables AMPcombi summary files in a list format.
Output
The output will be written into your working directory, containing the following files:
<pwd>/
βββ Ampcombi_summary.tsv
βββ Ampcombi_complete.log
clusterο
The cluster submodule clusters the output from complete
(i.e., Ampcombi_summary.tsv) into subclasses of somewhat similar AMP families.
This relies primarily on MMSeqs2 cluster v.15.6f452.
Only some parameters that were deemed important for the purpose of AMPcombi were incorporated as optional arguments.
To get a full list of options available and their defaults please refer to the help documentation of the submodule:
ampcombi cluster --help
Example Usage
ampcombi cluster \
--ampcombi_summary path/to/Ampcombi_summary.tsv
The βampcombi_summary takes in ampcombi complete output summary table Ampcombi_summary.tsv as input.
Some optional parameters that can be tweaked:
Parameter |
Description |
Default |
Allowed values |
---|---|---|---|
|
This assigns the cov. mode to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here. |
0 |
2 |
|
This assigns the cluster mode to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here. |
1 |
2 |
|
This assigns the coverage to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here. |
0.8 |
0.9 |
|
This assigns the seqsID to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here. |
0.4 |
0.7 |
|
This assigns sensitivity of alignment to the mmseqs2 cluster module. More information can be obtained in mmseqs2 docs here. |
4.0 |
7.0 |
|
This removes any hits that did not form a cluster. |
True |
False |
|
This removes any cluster that only has a certain label in the sample name. For example, if you have sample labels with βS1_metaspadesβ and βS1_megahitβ, you can retain clusters that have samples with suffix β_megahitβ by running |
ββ |
βmegahitβ |
|
This removes any cluster that has a hit number lower than assigned here. |
3 |
1 |
Output
The output will be written into your working directory, containing the following files:
<pwd>/
βββ Ampcombi_summary_cluster.tsv
βββ Ampcombi_summary_cluster_representative_seq.tsv
βββ Ampcombi_cluster.log
signal_peptideο
The signal_peptide submodule predicts whether a signal peptide was found on the filtered and clustered AMP hits. This only works if the user installs SignalP separately. For licensing issues, SignalP can only be downloaded and used by academic users; other users are requested to contact DTU Health Technology Software Package before using it. For further details about the usage of SignalP please refer to their documentation.
To get a full list of options available and their defaults please refer to the help documentation of the submodule:
ampcombi signal_peptide --help
Example Usage
ampcombi signal_peptide \
--signalp_model path/to/signalp_model/ \
--ampcombi_cluster path/to/Ampcombi_summary_cluster.tsv \
--log true
The βampcombi_cluster takes in ampcombi cluster or ampcombi complete output summary table Ampcombi_summary <or _cluster>.tsv as input.
Output
The output will be written into your working directory, containing the following files:
<pwd>/
βββ Ampcombi_summary_cluster_SP.tsv
βββ Ampcombi_summary_cluster_SP_onlyclusterswithSP.tsv
βββ signalp
| βββ output_*.png/
| βββ prediction_results_index.tsv
| βββ prediction_results.tsv
| βββ representative_seq.txt
βββ Ampcombi_signalpeptide.log
Ampcombi_summary_cluster_SP.tsv includes the contents of the cluster summary plus a column with yes/no indicating the presence of a signal peptide sequence.
Ampcombi_summary_cluster_SP_onlyclusterswithSP.tsv includes the contents of the cluster summary plus a column with yes/no indicating the presence of a signal peptide sequence. But in this case clusters are retained only if they contain a hit or more with a signaling peptide.
signalp directory containing the results from the tool SignalP in
*.png
format showing the location of the predicted signaling peptide.
The prediction_results.tsv contains a table with the location of the signaling peptide and the identity. The prediction_results_index.tsv contains a table that gives an index number to every hit found in ./AMPcombi_summary_ao_human_nonhuman_clusters_SP_onlyclusterswithSP.tsv. This can be used to rename the files generated by running LocalColabFold on the AMP cluster representatives found in Ampcombi_summary_cluster_representative_seq.tsv for further downstream analysis on the secondary structure.