|
Overview
The volume of data from a single microarray experiment can be enormous.
A single yeast cell cycle experiment with 5 time points using the
entire genome on a chip (around 6400 ORFs) yields 32,000 data values.
Taking associated statistics and gene labels into account for each
gene and each time point, the total number of data elements may
be as high as 250,000.
In earlier studies of gene expression involving only a few genes,
simple statistics and visual analysis was usually sufficient to
harvest results from the data. However, more sophisticated statistical
techniques will be required to analyze responses of thousands of
genes, representing the response of entire biochemical pathways
to an environmental or biological stimulus, or the unfolding of
a developmental process. Tractable problems in biology to this point
have mostly involved only a few variables (e.g., less than 50 or
100) with many replications. Microarray data has thousands of variables,
with only a few replications. In many cases, there is no obvious
way to reduce the problem to one of a smaller set of variables by
restricting attention to a subset of genes, if one is studying a
complex process such as a cell's response to a pathogen.
Two Statistical Approaches
There are two approaches that can be taken to study the collective
behavior of genes, depending on the amount of foreknowledge about
their function and action. These approaches are analogous to the
two major types of classical statistical analyses, the observational
study and the controlled experiment. Exploratory analysis, analogous
to an observational study, is used in stimulus-response experiments
when there is little or no knowledge about all of the genes that
may significantly respond, directly or indirectly, to the studied
stimulus. In studies of collective gene behavior in a gene regulatory
network implementing a developmental program, there may be no model
that suggests itself; in this case, an exploratory approach is warranted.
Exploratory methods typically are formulated as multivariate classification
problems in statistics; for example, hierarchical cluster analysis
(1) and self-organizing maps (2) have been successfully applied
to characterize genes in functionally relevant classes using their
expression patterns. These methods can even allow combining expression
data from different experiments to give more discriminating power
to the methods. Exploratory methods are naturally most relevant
in the initial stages of an investigation into an effect on a biological
process at the genomic level.
In more focused studies, however, more controlled experiments can
be done. The second approach uses models to study behavior of reasonably
well-characterized gene sets that are functionally related. An example
of such an approach is the Genetics Institute's construction of
a small array of 250 cytokine genes, known to be involved in the
inflammatory response in mammals. Focused studies can be made by
probing this array with mRNA from a cell or tissue induced to express
the response. The type of statistical analysis done for these experiments
may be more standard: analysis of variance, regression, control
theory, for example. However, there will be a need to develop new
models directly relevant to and unique to biological processes,
to fully utilize genomic scale data effectively.
Statistical analysis: the basic setup
In order to do analysis on spot intensity data, some statistical
preparation is required. In the spot quantitation procedure, overall
spot intensities were equalized. Normalization must now be done
to create a more uniform measure that does not depend on the level
of expression. The type of normalization most often used is normalization
to a control using a ratio. In a control/treatment experiment, where
the control mRNA is labelled with Cy3 and the treatment mRNA with
Cy5, say, the ratio of treatment to control is computed for each
gene. In a time or developmental course experiment, each nonzero
time point's intensity (mRNA labelled with Cy5, say) is divided
by the intensity of the time zero probe's labelled mRNA (Cy3). Typically,
ratios less than ½ or greater than two are taken to indicate possibility
of down- and up-regulated expression, respectively. Ratios of intensities
whose values are both low are often discarded due to high error.
Exploratory analysis using hierarchical clustering
Once ratios are calculated and spurious ones removed, they can be
used in exploratory analysis. Hierarchical cluster analysis was
the first large scale exploratory analysis used in analyzing microarray
data (1). Ratios of expression data from eight yeast microarray
experiments was combined into a single expression profile for every
ORF in the yeast genome, and clustered. Functionally relevant gene
classes involved in the cell cycle, signal transduction, and immediate-early
transcription process were found.
Inferences from microarray data
The extent to which inferences can be made from microarray data
is not yet known. It is generally agreed that one cannot reconstruct
a biological process from gene expression data alone. In the words
of Dan Pinkel at UCSF, "The statistical problem of going backwards
from just thousands or tens of thousands of measurements to fundamental
biological processes is not a soluble mathematical problem". Making
inferences also depends on the accuracy of the assumptions used
in the interpretation of genomic-scale data. Inaccurate assumptions
can lead to false interpretations and spurious results in these
studies. The probability of false positives (gene activity change
inferred from data when in reality there is none) due to wrong assumptions
is magnified by the high number of variables in an experiment of
this type.
Two fundamental assumptions, according to Patrick O. Brown of Stanford
University, are: (1) There is an observable link between the expression
of a gene and its function, and (2) Genes that resemble each other
in expression are likely to participate in similar physiological
processes. In this scenario, post-translational processes are assumed
to have negligible effects relative to global gene expression. Testing
and clarification of these assumptions, which is an ongoing process,
is essential to determine the extent to which inferences can be
made.
References
- Eisen, M. et al. (1998). Cluster analysis and display of genome-wide
expression patterns. Proceedings of the National Academy of
Sciences 95: 14863-14868.
- Toronen, P. et al. (1999). Analysis of gene expression data
using self-organizing maps. FEBS Letters 451: 142-146.
|