VIRGO: Computational Prediction of Gene Functions

Help on how to use VIRGO

This web page documents how a biologist can use VIRGO. The biologist uploads a gene expression data set of interest to her on the VIRGO upload page. VIRGO invokes the GAIN system (publication describing GAIN, GAIN software package) to predict novel functions for genes assayed in the gene expression data set. When GAIN completes, VIRGO adds the predictions to its database and allows the biologist to query and browse the predictions. In more detail, the steps are the following:
  1. Store your gene expression data in a text file. Please ensure that the gene expression data set is tab-delimited and the first column contains an identifier for a gene. All other columns should contain expression values. The first row should contain the names of the samples. For S. cerevisiae, please ensure that a gene identifier is an ORF name (e.g., YFL039C). For H. sapiens, please ensure that a gene identifier is an Entrez-Gene identifier. If you simply want to try out VIRGO or want to see the format of such a data set, please take a look at one of the gene expression data sets we have collected.
  2. Visit the VIRGO upload page. All fields on this page marked with a red star are required.
    1. Select the correct species. If you use the wrong species, VIRGO will not be able to make any predictions since it will use incorrect information for molecular interactions and GO functional annotations.
    2. Select your gene expression file by browsing your file system.
    3. Enter a short description of the experiment you performed. This information is useful when VIRGO presents the results of queries you make on the predictions.
    4. Enter your email address. VIRGO needs a valid email address to communicate with you.
    5. If this data has been published, please enter the PubMed ID (PMID) of the paper that published the data. In the query results, VIRGO will hyperlink the description of the experiment to the PubMed entry for the paper.
    6. Select whether you want to keep the predictions based on your dataset private or public. VIRGO's default policy is keep the predictions computed from your dataset private.
    7. Finally, press the "Analyze" button to upload your dataset.
  3. After you upload the dataset, VIRGO will give you a randomly-generated key. You need this key to query VIRGO for the predictions computed from your data set. If you desire, periodically visit the status web page, and enter your key to check the status of your dataset. After VIRGO finishes processing your data, it will insert predictions into its database. VIRGO will send you an email message to inform that when it has completed making predictions. the email message will contain the key. You can query your predictions as soon as this step completes.
  4. At this stage, if you enter your key on the status page, you have the option of performing leave-one-out cross validation on your dataset. If you choose this option, VIRGO will email you when it completes cross validation. However, you can query your predictions while VIRGO is performing cross validation. Precision and recall results will automatically appear in the query results when VIRGO completes cross validation.
  5. To query your predictions, visit the page for searching predictions. On this web page, select "Predictions" under search type. Enter your key. You need not select a species if you provide the key. However, if you do select a species, please make sure you select the right species! In addition, you can restrict your search by gene name, function name, or function id (in the Gene Ontology). You can also search for predictions with estimated confidence greater than a threshold. You can also specify lower bounds on precision and/or recall; VIRGO will return predictions only for those GO functions for which GAIN achieves cross validation performance satisfying your constraints.
  6. Browse the search results. We hope the predictions are useful and suggest further experiments you can perform and analyse using VIRGO. If you find VIRGO useful, please let us know.

Potentially Asked Questions

VIRGO's privacy policy

  1. What is VIRGO's privacy policy?
    Do no evil. Sorry, that is another organisation's policy. While VIRGO does not have any evil intentions, our default policy is that predictions stored in VIRGO based on a biologist's dataset are available only to that biologist (upon entering the right VIRGO key). On the data upload page, the biologist has the option of declaring that all computed predictions based on the uploaded gene expression dataset are public.
  2. Another user may discover my VIRGO key! Why don't you set up a login-based authentication system?
    Since VIRGO keys are long and we generate them at random, the probability that one user can accidentally access private predictions generated for another user is small. Our goal is to ensure that a user can use VIRGO without the hassle of having to register and log in. We believe the current design allows users to keep their data private. We are open to changing this design based on feedback we receive from users.
  3. I have changed my mind. How do I delete the predictions based on one of my datasets or change the privacy status of the predictions?
    Just send us email with the VIRGO key and the email address you entered when you uploaded the dataset. We need the original email address to verify that you are indeed the person who uploaded the dataset (just in case you are sending email from another address). If you ask us to delete the predictions, we are unable to guarantee that we will delete the predictions from any backups that may be stored on our servers. If the authorities ask us for copies of these backups ... sorry, we are again sounding like someone else.

The GAIN system

  1. Where can I read about the GAIN algorithm?
    Our approach is described in detail in Whole Genome Annotation using Evidence Integration in Functional Linkage networks, Ulas Karaoz, T. M. Murali, Stan Letovsky, Yu Zheng, Chunming Ding, Charles R. Cantor, and Simon Kasif, Proceedings of the National Academy of Sciences, vol 101, pp.2888--2893, 2004. We will add pointers to papers describing some recent improvements to GAIN as soon as the papers are published.
  2. Where can I obtain the GAIN software?
    The software package implementing the GAIN algorithm, which is the function prediction engine underlying VIRGO, is available under the GNU General Public Licence. The current version is 1.6.

Data Formats

  1. What format should my gene expression dataset be in?
    VIRGO can analyse tab-delimited gene expression data sets. Please ensure that the first column contains an identifier for a gene. Your file contain columns entitled "NAME" and "GWEIGHT"; VIRGO will ignore these columns. All other columns should contain expression values. The first row should contain the names of the samples. For S. cerevisiae, please ensure that a gene identifier is an ORF name (e.g., YFL039C). For H. sapiens, please ensure that a gene identifier is an Entrez-Gene identifier. We have collected several publicly-avalable gene expression data sets in this format.
  2. Do you have any test datasets I can try out VIRGO on?
    We have collected a number of publicly-available gene expression data sets. All these files are in the tab-delimited format described above.
  3. Will VIRGO support other formats such as MIAME, NCBI GEO's SOFT format, or Excel spreadsheets?
    We will add support for MIAME-compliant data and data in SOFT format in the future. We are unlikely to support Excel spreadsheets since the Excel format is proprietary.

Controlling GAIN

  1. How long will VIRGO/GAIN take to analyse my dataset?
    The answer depends on the species and size of functional annotation and molecular interaction datasets. Currently, our yeast interaction network contains 4711 genes and 13453 interactions. The yeast functional annotations contain 71813 gene-function pairs and annotations for 1710 GO biological processes. The human interaction network contains 6274 genes and 34087 interactions. The human annotation dataset contains 131832 gene-function pairs and 2645 GO biological processes.

    GAIN processes each function in the Gene Ontology independently. It makes predictions at the rate of approximately 1.5 functions per second for yeast and roughly 0.6 functions per second for human. The prediction stage takes about 20-30 minutes for yeast and about an hour for human. In addition, we automatically lay out propagation diagrams using the Graphviz package. This step increases the running time considerably, by as much as a factor of 10. It is difficult to predict how long all the layout steps will take since the time depends on the number of nodes and edges in each propagation diagram.

  2. I am only interested in a subset of the functions in GO.
    We are working on adding a feature that will allow the user to select specific functions of interest. This feature has the potential to considerably speed up VIRGO.
  3. Can I define my own function?
    Not currently. In the future, we will allow the user to upload a text file containing functions and functional annotations defined by her in a simple text format.
  4. When will you support functional predictions for My favouritespecies?
    We anticipate a major revision of VIRGO in June 2006, which will support a number of other species including A thaliana, C. elegans, D. melanogaster, and mouse.
  5. How can I input my own functional linkage network?
    Once again, we are working on this feature.

Interpreting VIRGO's results

  1. How do I interpret the confidence value VIRGO computes for each prediction?
    We suggest that you treat the confidence value as a relative measure of how sure VIRGO is that a prediction is correct. In other words, by VIRGO's estimation, a prediction with a confidence value of 0.9 is more plausible than a prediction with a confidence value of 0.5. Evaluating the prediction by also considering the associated propagation diagram may help you understand the rationale behind a prediction.
  2. How do I interpret a propagation diagram?

    The propagation diagram above supports the prediction that gene YNL016W (PUB1) is annotated with the biological process ``RNA binding'' (GO:0000023). Red rectangles denote genes annotated with this function. Blue diamonds represent genes annotated with a different function. Octagons represent genes that either have no known function or are annotated with a function that is an ancestor of ``RNA binding.'' Of these, the red octagon is the gene of interest (YNL016W). Other octagons represent genes that are also predicted to have this function. Red edges are incident on annotated nodes and help to visualise the flow of information in this network. In addition, VIRGO's propagation diagrams display edge weights, which are computed as described in the supplementary material. Large edge weights indicate greater belief that the genes connected by the edge share the same function.
  3. Propagation diagrams are missing for some of my predictions?!
    We do not lay out a propagation diagram if it contains more than 50 nodes, since the Graphviz software tends to take a long time and use a lot of memory to lay out large graphs.
  4. Why are edge weights obscured on the propagation diagrams?
    Unfortunately, we have to live with this situation. VIRGO uses the excellent Graphviz software package to lay out propagation diagrams. However, laying out edge labels correctly is known to be a very hard problem. See the discussion in the FAQ for Graphviz under the question "Edge label placement in neato is bad." Any improvements implemented in Graphviz will automatically make their way into VIRGO's propagation diagrams.
If there are any questions that you do not see answered, please contact T. M. Murali (murali AT cs DOT vt DOT edu).
T. M. Murali
Last modified: Sat Mar 25 12:16:41 EST 2006