bioYALP: for biologists, Yet Another Lipoprotein Predictor

How it works ?

Input files

bioYALP analyses FASTA files containing protein sequences. You can retrieve bacterial proteomes in the FASTA format on the NCBI's website.

This how a FASTA file looks like:

>protein 1
MLAGALCKQ
>protein 2
MMLSEGITARR
>protein 3
MTLPREE

Before each sequence there is a comment line (a line beginning with a >) which provides information on the sequence below. When a sequence is longer than 70 amino acids, it is split into lines of 70 amino acids. Here is another example:

>gi|16127995|ref|NP_414542.1| thr operon leader peptide [Escherichia coli K12]
MKRISTTITTTITITTGNGAG
>gi|16127997|ref|NP_414544.1| homoserine kinase [Escherichia coli K12]
MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWE
RFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHY
DNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGF
IHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVA
DWLGKNYLQNQEGFVHICRLDTAGARVLEN

In some cases, it may be helpful to analyse files containing one sequence by line. This feature is currently in development and not available yet.

Output files

Once you've run an analysis (see command lines section below), bioYALP writes its results in the output folder. More precisely, bioYALP writes the FASTA comments of the proteins which seems to be lipoproteins.

The syntax of the output filenames is: [name of input file]_predicted_lipoproteins.list

The content of an output file looks like this:
gi|90111299|ref|NP_416102.1| gi|90111309|ref|NP_416156.1| gi|49176129|ref|YP_025304.1|

Use bioYALP (users)

Command lines

Basic use

Open a terminal and go to the bioYALP folder (cd ~/bioYALP for instance). Then you can call the bioYALP script by typing:
./bioYALP.pl -f [FASTA file you want to analyse] [options].

Replace [FASTA file you want to analyse] by the path of your FASTA file. For instance, input/E_coli_K12.fasta if you want to analyse E_coli_K12.fasta located into the folder input. You can also type absolute paths: -f /var/proteomes/E_coli_K12.fasta

If you do not want to use options, do not type [options].

When the script shows the number of predicted lipoproteins, it's done. You can open the results files located in the output folder. They only contain text, so you can use the text editor you want.

Advanced use (options)

bioYALP looks for lipoproteins using the following criteria:

minimal length of the n-region: 5
maximal length of the n-region: 7
minimal charges in the n-region: 2
lipobox (regular expression): [LVI][ASTVI][ASG]C

You can change these with the -c option. Its syntax is:
-c [n-region minimal length],[n-region maximal length],[n-region minimum charges],[lipobox]

If you do not want to change all of them, just leave a criterium blank.

Example 1

./bioYALP.pl -f file.fasta -c 3,10,1,[LV][ASTVI][ASG]C

Looks for sequences with an n-region between 3 and 10 amino acids, having at least one charge, and a lipobox matching the regular expression [LV][ASTVI][ASG]C

Example 2

./bioYALP.pl -f file.fasta -c ,,,[LV][ASTVI][ASG]C

Looks for sequences with default criteria for the n-region (see above) and matching the regular expression [LV][ASTVI][ASG]C

For more information about tuning bioYALP, please read lipoprotein prediction.