NSF Soybean Functional Genomics
Vodkin Laboratory, University of Illinois
University of Illinois at Urbana-Champaign logo
NSF Home Overview Investigations

Workshop
Protocols

Soybean
EST Project

NSF Soybean Functional Genomics Project

Statistical Analysis of Expression, page 1-4
Instructor: Robin Shealy

 

Statistics of Expression Analysis

Overview

The volume of data from a single microarray experiment can be enormous. A single yeast cell cycle experiment with 5 time points using the entire genome on a chip (around 6400 ORFs) yields 32,000 data values. Taking associated statistics and gene labels into account for each gene and each time point, the total number of data elements may be as high as 250,000.

In earlier studies of gene expression involving only a few genes, simple statistics and visual analysis was usually sufficient to harvest results from the data. However, more sophisticated statistical techniques will be required to analyze responses of thousands of genes, representing the response of entire biochemical pathways to an environmental or biological stimulus, or the unfolding of a developmental process. Tractable problems in biology to this point have mostly involved only a few variables (e.g., less than 50 or 100) with many replications. Microarray data has thousands of variables, with only a few replications. In many cases, there is no obvious way to reduce the problem to one of a smaller set of variables by restricting attention to a subset of genes, if one is studying a complex process such as a cell's response to a pathogen.

Two Statistical Approaches

There are two approaches that can be taken to study the collective behavior of genes, depending on the amount of foreknowledge about their function and action. These approaches are analogous to the two major types of classical statistical analyses, the observational study and the controlled experiment. Exploratory analysis, analogous to an observational study, is used in stimulus-response experiments when there is little or no knowledge about all of the genes that may significantly respond, directly or indirectly, to the studied stimulus. In studies of collective gene behavior in a gene regulatory network implementing a developmental program, there may be no model that suggests itself; in this case, an exploratory approach is warranted. Exploratory methods typically are formulated as multivariate classification problems in statistics; for example, hierarchical cluster analysis (1) and self-organizing maps (2) have been successfully applied to characterize genes in functionally relevant classes using their expression patterns. These methods can even allow combining expression data from different experiments to give more discriminating power to the methods. Exploratory methods are naturally most relevant in the initial stages of an investigation into an effect on a biological process at the genomic level.

In more focused studies, however, more controlled experiments can be done. The second approach uses models to study behavior of reasonably well-characterized gene sets that are functionally related. An example of such an approach is the Genetics Institute's construction of a small array of 250 cytokine genes, known to be involved in the inflammatory response in mammals. Focused studies can be made by probing this array with mRNA from a cell or tissue induced to express the response. The type of statistical analysis done for these experiments may be more standard: analysis of variance, regression, control theory, for example. However, there will be a need to develop new models directly relevant to and unique to biological processes, to fully utilize genomic scale data effectively.

Statistical analysis: the basic setup
In order to do analysis on spot intensity data, some statistical preparation is required. In the spot quantitation procedure, overall spot intensities were equalized. Normalization must now be done to create a more uniform measure that does not depend on the level of expression. The type of normalization most often used is normalization to a control using a ratio. In a control/treatment experiment, where the control mRNA is labelled with Cy3 and the treatment mRNA with Cy5, say, the ratio of treatment to control is computed for each gene. In a time or developmental course experiment, each nonzero time point's intensity (mRNA labelled with Cy5, say) is divided by the intensity of the time zero probe's labelled mRNA (Cy3). Typically, ratios less than ½ or greater than two are taken to indicate possibility of down- and up-regulated expression, respectively. Ratios of intensities whose values are both low are often discarded due to high error.

Exploratory analysis using hierarchical clustering
Once ratios are calculated and spurious ones removed, they can be used in exploratory analysis. Hierarchical cluster analysis was the first large scale exploratory analysis used in analyzing microarray data (1). Ratios of expression data from eight yeast microarray experiments was combined into a single expression profile for every ORF in the yeast genome, and clustered. Functionally relevant gene classes involved in the cell cycle, signal transduction, and immediate-early transcription process were found.

Inferences from microarray data
The extent to which inferences can be made from microarray data is not yet known. It is generally agreed that one cannot reconstruct a biological process from gene expression data alone. In the words of Dan Pinkel at UCSF, "The statistical problem of going backwards from just thousands or tens of thousands of measurements to fundamental biological processes is not a soluble mathematical problem". Making inferences also depends on the accuracy of the assumptions used in the interpretation of genomic-scale data. Inaccurate assumptions can lead to false interpretations and spurious results in these studies. The probability of false positives (gene activity change inferred from data when in reality there is none) due to wrong assumptions is magnified by the high number of variables in an experiment of this type.

Two fundamental assumptions, according to Patrick O. Brown of Stanford University, are: (1) There is an observable link between the expression of a gene and its function, and (2) Genes that resemble each other in expression are likely to participate in similar physiological processes. In this scenario, post-translational processes are assumed to have negligible effects relative to global gene expression. Testing and clarification of these assumptions, which is an ongoing process, is essential to determine the extent to which inferences can be made.

References

  1. Eisen, M. et al. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95: 14863-14868.

  2. Toronen, P. et al. (1999). Analysis of gene expression data using self-organizing maps. FEBS Letters 451: 142-146.

 

 

* Department of Crop Sciences
* College of Agricultural, Consumer, and Environmental Sciences
* University of Illinois at Urbana-Champaign


Design by: Crop Sciences Computer and Web Support Group
Copyright © 2000 University of Illinois at Urbana-Champaign