bioYALP's false positives rate is high, but we hope to improve this in the future. You can help if you want :-).
The analysis is based on DOLOP (the references of the DOLOP's papers are shown by running ./bioYALP -z). As DOLOP's algorithm is not open-source, bioYALP can not imitate DOLOP perfectly. The tools have been benchmarked on a dataset of verified lipoproteins for Escherichia coli subsp. K12.
The Ecogene's website provides nicely a dataset of verified lipoproteins for Escherichia coli subsp. K12. Each sequence of this file have been extracted in another file. We named this file dataset.sequences. Both the tool to convert the .xls file and the final file (dataset.sequences) are not available yet for download.
bioYALP has been run on the dataset with the following command line:
./bioYALP.pl -f dataset.sequences -c ,,1, (one charge permitted in the
n-region instead of two by default).
The output is .sequence file containing a list of identifiers for sequences which has been predicted has lipoproteins (see below).
The DOLOP's page which lists lipoproteins for Escherichia coli subsp. K12 has been converted into a .sequences file. It's just a plain text file containing the GI number, a unique identifier for each sequence. See the NCBI's Sequence Identifiers page for more information.
We compared the two .sequences files using another homemade script (not available yet for download) and we get the following results:
Test run on EcoGene's dataset
of 80 verified lipoproteins for Escherichia coli subsp. K12
1: Percentage of correctly predicted lipoproteins among the dataset
2: Percentage of incorrectly predicted lipoproteins (false-positive)
As we do not know other list of verified lipoproteins, bioYALP and DOLOP has been tested on only one proteome. It is risky to make great conclusions on this, so we will just say that bioYALP has a good power of prediction, despite its high rate of false-positives.
Copyright © 2007-2008 Pierre-Yves Beaudouin