Background to SeMPI


A large variety of bioactive compounds used today origin from the secondary metabolism of proteins encoded by genome clusters. The prediction of new compounds based on genome mining becomes ever more significant with decreasing prices for DNA sequencing. There are also many compounds which are known to be secondary metabolites but the species as well as the responsible genome clusters are not identified, yet. In order to face this challenge, we created the Secondary Metabolite Prediction and Identification Pipeline (SeMPI).

The structural prediction is covered by our 'in house' software 'ModPKSFinder' which predicts modular type I PKS structures. In future works we will extend our prediction server to further cluster types. The prediction is then compared with already annotated molecules. Currently this is done by supplying a pipeline to the StreptomeDB - a large database for metabolites produced by streptomyces. But we will try to expand the pipeline to other databases which come into question.

So far unpredictable post-modifications of the metabolites complicate comparison based on standard methods, such as molecular fingerprints. Therefore we developed a tool which estimates the possible structures of the metabolites in their first synthesis steps. These initial structures are then compared to the prediction.

Genome Mining

The PKS gene cluster is localized in the genome sequence by antiSMASH 3.0. [1] If you would like to see the output for the gene cluster annotation, we highly recommend you to visit the antiSMASH website antiSMASH 3.0. and use their excellent tool for cluster analysis with your sequence.

Structure Prediction

The ModPKSFinder polyketide chain prediction comprises the following steps:

Identification of modular PKS gene clusters (1) within the genome sequence (antiSMASH 3.0.) as mentioned above and parsing of active domains (2). Prediction of the domain arrangement within the PKS protein (3). The domain arrangement is based on DEBS subunit interaction described by Broadhurst et al. [2]. Delimitation of modules (4). Prediction of the acyl monomers malonyl (M), methylmalonyl (MM), ethylmalonyl (EM) and methyoxymalonyl (MoM) (5), prediction of ß-keto reduction (6) and delivery of final polyketide chain.

Metabolite databases

The streptomeDB 2.0 database [3] contains over 4000 compounds produced by streptomycetes. Therefore this represents a large source of possible secondary metabolites. Of course this also limits our prediction to this genus. But considering that 455 compounds annotated in the MIBiG database [4] (September 3rd, 2016) arrive from a streptomyces organism, it is a reasonable choice for a single database. In future version we will include as many relevant databases as possible in order to establish a broad identification protocol.

Database Processing

Post-modifications of the metabolites such as glycosylation or cyclization can not be predicted, yet. Therefore, the compounds from the databases need to be processed in a way that allows to compare them to the predicted structure independent of post-modifications. This is achieved by identifying possible type I PKS starter units and subsequent creation of a set of unique paths through the metabolite.

Comparison algorithm

The unique paths are compared to the predicted structure.

Score Name Explanation Examples in figure
Length score Compares the length of the main path. +1 for each matching segment. -1 for each segment too long or too short not shown
Main atom score Compares the type of atoms in the main chain. +1 for each matching atom. -1 for each not matching atom. (Currently the paths are created without Oxygen (O) or Sulfur (S) atoms in the main chain, as we do not expect them in a initial synthesis step) a1 (-1)
Main bond score Compares the type of bonds in the main chain. +1 for each matching bond. -1 for each not matching bond. b1 (+1), b2 (-1), b3 (-1), b4 (-1)
Neighbor score Compares if there are neighbors in general for each path segment. (The atom or bond types are not compared) +1 If both structures have no neighbor. +1 if both structures have a neighbor. 0 if one path has one and the other has 2 neighbors -1 if one path has non and the other path has one neighbor. c1 (+1), c2 (-1), c3 (+1), c4 (+1), c5 (+1), c6 (0)
Side chain score Provides more detailed information about the neighbors. +1 for each segment of the side chain, where atom as well as bond match. -1 for each segment where they do not match. d1 (-1), d2 (+1), d3 (+1 (for the first match), -1 (for the second segment)),

Similarity ranking

The output is a ranking of metabolites based on similarity to the predicted structure. We hope this overview can help researchers to link metabolites with unknown origin to the subjected cluster. A detailed explanation of the result page is given in the help section.


The SeMPI pipeline is based on various open source tools:

The ModPKSFinder prediction software depends on:

antiSMASH 3.0.


Custal Omega


Python 2.7

The pipeline to the sterptomeDB including initial sequence prediction is based on:

rdkit (python package)

PostgreSQL 9.6


  1. [1] Weber, T. et al. antiSMASH 3.0 — a comprehensive resource for the genome mining of biosynthetic gene clusters. Nucleic acids research 43, W237-243 (2015). PubMed
  2. [2] Broadhurst, RW. et al. The structure of docking domains in modular polyketide synthases. Chemistry & biology 10, 723-731 (2003). PubMed
  3. [3] Klementz, D. et al. StreptomeDB 2.0—an extended resource of natural products produced by streptomycetes. Nucleic acids research 44, D509-D514 (2015). PubMed
  4. [4] Medema, M.H. et al. Minimum Information about a Biosynthetic Gene cluster. Nature chemical biology 11, 625-631 (2015).PubMed

Progress text
Message text