Input file format

Currently out server is based on the processing of nucleotide sequences, that are expected to contain at least one functional gene cluster. For convenience purposes we implemented the automatic extraction of the sequence from gbk files as well. The following considerations should be made before submitting a file or in case the pipeline does not process the file:

The file must contain one DNA sequence. Fasta files with multiple entries must be combined or committed separately to the pipeline

If you want to use a sequence from NCBI, the easiest way would be to open your desired record, switch to fasta format, click the send button and then choose the complete record.

If you can only provide protein sequences, you must back-translate it to a DNA code (this is also the case for gbk files where the DNA sequence is missing). EMBOSS Backtranseq is one of the best tools for this purpose, which gives good results for multiple organisms.

In case you still do not get reasonable results. We advice you to submit your file to antiSMASH in a first run, to control if functional clusters can be detected in the sequence.


To start a search in SeMPI just upload a fasta or gbk file or paste the fasta/gbk content into the subjacent box and press the start button. The box also accepts a raw DNA sequence, which allows for example to copy and paste directly from NCBI. You can also load one of the example inputs. We provide an Avermectin.fna or Borrelidin.gbk file for this purpose. Both are available in the MIBiG database. The loaded example files are represented in the input box and can directly be used for a test run. Whole genome files are to big for this approach, but can be submitted as files, for example streptomyces avermitilis (NC_003155.5). If you do not want to wait for the processing, you can also jump to a already processed result page for these examples by pressing the output button.


After processing, which depends on the file size (whole genomes take about 10 minutes), SeMPI will load the results page as shown below.

Please do not reload the processing page.

The easiest way to load the output later is to bookmark the processing page or the result link provided. The page will only be available when the processing is finished. If the file could not be processed you will be prompted to a error page.

Error page

There are two main reasons your file could not be processed: First, if anitSMASH could not find any gene cluster in the file you submitted the following error will show:

In this case we highly recommend to submit your file to antiSMASH to check if there are indeed gene clusters present.

Second, if there are gene clusters present but we can not process them we will show a report which will look like this:

Position and type are derived from the initial antiSMASH genome-mining. In most cases the reason for exclusion column will give you an idea why this particular cluster was not processed. Currently we are focused exclusively on modular PKS clusters.


In case multiple gene clusters could be processed, they will show up in the panel above the result for the current cluster. The numbering is based on the order of presence in the submitted file. A basic report about the found clusters is provided as well, this is basically similar to the error report and will help you to indicate the reason as to why a cluster was processed.

The result page is composed of the predicted structure for your DNA sequence as well as the top 10 best matching metabolites of the streptomeDB. On top you can see the prediction of the structure and the smiles in the main panel. On the left side the basic information for the structure prediction is given.
First column: The module annotation and ordering as calculated by our algorithm. Identified domains are: ketoacyl synthase (KS / KSQ (mutated glutamin)), acetyl transferase (AT), acyl carrier protein (ACP), keto reductase (KR), dehydratase (DH), enoylreductase (ER), thioesterase (TE).
Second column: The predicted acyl monomers (PKS building blocks).
Third column: The expected type of reduction. Each line represents therefore one part of the structure as depicted in the figure below.

The top 10 matching paths are listed in the lower window. It is composed of 3 columns.

The first column gives basic information about the ranking. The rank is based on the total score, multiple compounds can have the same rank if the score is identical. The individual scores are explained in detail in the background page. The name of the compound also provides a link to the streptomeDB, where more detailed information can be found.

The traffic light gives a basic indication about the significance of the results, if it is green the chances are very good, that the found metabolites is indeed strongly related to the predicted structure. Yellow could still be a good match, but the possible scores are less because the predicted path is shorter. Red could either mean, that the predicted compound is very short and it is therefore difficult to match or the predicted structure is very uncommon in the streptomeDB, this could be a good indication for a so far unknown metabolite. Always keep in mind, that longer structures can be batter matched, due to the fact that more properties can be compared.

The second column shows the best matching path as it is extracted from the original metabolite in the database. Aromatic bonds are represented as undefined bonds in the extracted path. In future versions we are working on a tool which will allow to represent a graphical comparison between the predicted and compared path.

The third column lists the metabolites from which the paths were extracted, the path is colored in green and the identified starter unit is depicted in blue.

Progress text
Message text