pects, while the truly significant genes slip
As results from huge new studies roll
in, these challenges are attracting more
attention. The National Institutes of
Health held a special meeting in March
to discuss how to translate genome-wide
association data into clinical research
and practice. And scientific journals are
publishing special papers instructing
researchers in the art of interpreting
genome-wide association studies.
“First you have to go through a sifting
process, filtering the true signals from the
false signals and making sure you don’t
miss any,” says epidemiologist Muin
Khoury of the Centers for Disease Control
and Prevention in Atlanta. “I think this is
as much of an art as a science right now.”
Deluge of data
For all the apparent variation among
people, the human genetic code is actually
99.5 percent identical from person to person. That remaining individualistic half
percent can help explain how diseases
develop in some people and not others.
Genome-wide association study researchers rely on results from t wo big projects to
guide them to these crucial areas.
The first project, the government-sponsored Human Genome Project,
analyzed a human genome archetype
and transcribed the 3. 2 billion nucleotide
“genetic letters” that make up human
DNA. Using this framework, the nonprofit International HapMap Project is
pinpointing the 11 million specific sites
along the genome where genetic information differs by a single letter. About 4 million sites have been cataloged so far.
Usually these one-letter sites, called
single nucleotide polymorphisms, or SNPs
(pronounced “snips”), do not themselves
cause disease. But the SNPs often lie near
important genes that can. So the SNPs
serve as convenient signposts — pointing
researchers to important disease-related
genes in the neighborhood.
To find SNPs linked with a certain
disease, the simplest approach is to compare groups of volunteers side by side.
Researchers recruit a group of breast
cancer patients, say, and a group of similar people who are breast cancer-free. The
researchers use “SNP chips” — microchips
that test up to 1 million selected SNPs at
once — and record the versions of each
SNP that each person possesses.
Then researchers statistically compare SNPs in the groups. If most breast
cancer patients had two “T” versions of
the SNP known as ESR1002, for example,
and most disease-free volunteers had t wo
“G” versions, then researchers would flag
ESR1002 as a possible breast cancer sus-
pect. Further investigation might then
point to an important gene nearby.
Yet a million SNPs on a chip still
means a million potential suspects to sift
through — most of which are ultimately
not related to the disease. The flood of
information is potentially over whelming.
As the title of a New England Journal of
Medicine editorial last summer described
it, genome-wide association studies are
like “drinking from the fire hose.”
In fact, most statistical methods were
built to deal with data scarcity, not to handle a data deluge. So when a genome-scan
delivers its data — four to five thousand
times more information than in traditional epidemiology studies — standard
statistical methods can easily choke.
For example, a genome-wide association study of 1 million SNPs will flag about
50,000 SNPs as significant. But most will
be false alarms, indistinguishable from
real results. Worse yet, truly interesting SNPs may be ignored and never get
flagged in the first place.
The problem lies in how results get
flagged. Statistical methods essentially
set a cutoff value that any result must
surmount before being flagged as significant — a statistical hurdle, in a sense.
Traditional hurdles do a good job of separating true results from bogus ones when
there aren’t many competitors in the race.
Humans share the same genes, but those genes can carry slight variations
SNP (single letter difference
in DNA code)
23 Chromosome pairs