img.latex_eq { padding: 0; margin: 0; border: 0; }

Thursday, May 17, 2007

Demystifying DNA

It's no secret that the misuse of statistics is a particular pet peeve of mine. I have often voiced my resentment that the general public's innumeracy is constantly exploited, and have expressed my desire to rearrange the high school math curriculum to focus more strongly on statistics, even at the loss of other disciplines. A particular issue where this constantly crops up is in DNA evidence.

Not long ago, the state of Maine sent a bill before its legislature that proposed mandatory DNA data be sent to the CODIS registry for all sex offenders. I was asked by the Maine Civil Liberties Union to help write testimony against this bill, which I happily did. I raised several concerns about the accuracy of the CODIS statistics and the way in which they would be used in court. I have read several articles suggesting that attorneys for both prosecution and defense tend to eliminate highly numerate jurors through the selection process. This allows them to manipulate statistics through several time-tested methods. Humans have a natural tendency to process stats in terms of frequencies. We intuitively understand what is meant by 1/100, but have to deliberately process the same stat if given as a percentage. Skilled attorneys are aware of this, which is why when they are presenting evidence as rock solid, they will employ natural frequencies, and when they are attempting to cast doubt they opt for percentages.

My second big concern is that random matches may be much more common than we have been led to believe. To support this, I must quickly sum up the CODIS process. When a sample is screened at the lab, the technicians aren't mapping the entire double helix. This would waste a lot of time, because even though there is a lot of variation between members of our species, there's still an awful lot of similarity. Natural selection sees to that quite nicely. However, as we have slowly cracked the genetic code, we have discovered elements of the genome that do not appear to do anything. We call it "junk DNA" and it appears to be then result of replication and transcription errors. Because these "genes" do not manifest themselves outwardly, they are not subject to selection pressures and are therefore highly variant. This makes them ideal for identification purposes.

Now I can get to my point.

When we are presented with the likelihood of a match randomly occurring in the population, we are often impressed by the enormity of the numbers. One in 15 quadrillion translates as "case closed and who's buying the beer?" There are two problems with this, the first of which I realized at the time I wrote the testimony and the second I just borrowed from Keith Devlin. Before you can judge guilt or innocence, you have to firmly establish which "population" you're talking about. The 1/15,000,000,000,000,000 stat refers to the probabilities of the entire human population. The stats change dramatically when you look at individual subsets. For example, while the 13 sites used in the CODIS registry vary dramatically as I said, they vary far less within particular ethnic groups. The closer the specimens are to one another on the family tree, the more likely a match becomes. A man with lineage to sub-Saharan Africa is much more likely to match someone of that descent than he would someone from Scandinavia. For close blood relatives the probability of a match sky-rockets.

The other point, which I failed to realize at the time, is the implied precision of the stats themselves. As Devlin points out in his excellent article, numbers like 1 in 15 quadrillion come from extremely naive applications of the power rule. It is true that if the odds of rolling a six with a properly weighted die is 1/6, then the odds of doing it twice in a row is 1/6 x 1/6 = 1/36. The odds of doing it three times is 1/216, and so on. This holds true for any situation where the probabilities of multiple events are concerned. But this is a mathematical ideal. In practice, the actual outcome can be way off the prediction, although with enough attempts, the reality will move closer to the expected value. So even in this simplified case, empirical testing will yield far more "accurate" results than a simple application of the power rule. A number like 15 quadrillion implies a degree of precision that is so ridiculously beyond our abilities that, as Devlin says, it is laughable.

The most interesting figures in Devlin's article come out of empirical tests of the CODIS registry.

As far as I am aware, to date there has been only one attempt to do this, [an empirical test of the registry] and the results obtained were both startling and worrying. A study of theArizona CODIS database carried out in 2005 showed that approximately 1 in every 228 profiles in the database matched another profile in the database at nine or more loci, that approximately 1
in every 1,489 profiles matched at 10 loci, 1 in 16,374 profiles matched at 11 loci, and 1 in 32,747 matched at 12 loci.

How big a population does it take to produce so many matches that appear to contradict so dramatically the astronomical, theoretical figures given by the naive application of the product rule? The Arizona database contained at the time a mere 65,493 entries. Scary isn't it?

It is not much of a leap to estimate that the FBI's national CODIS database of 3,000,000 entries will contain not just one but several pairs that match on all 13 loci, contrary (and how!) to the prediction made by proponents of the currently much touted RMP that you can expect a single match only when you have on the order of 15 quadrillion profiles.

Some of you will have misread this entire post. You will accuse me of being a damn liberal, commie, lefty who wants child molesters to run candy stores. That could not be further from the truth. In fact, if someone lays one hand on a child I know, they had better hope that the authorities get to them before I do. But I simply can not stomach an innocent person going to prison, especially if it is the result of a misapplication of mathematics. Not even Hardy would
want to apologize for that.


Charles Brenner said...

Two points: Devlin has disavowed the column that you quoted extensively. In a later column ( he makes clear that he knows better, and claims that the earlier column was meant to represent the kind of misunderstanding that a lay person might have. My web page at is one of many analyses showing that there is, contrary to the apparent (but disavowed) message of Devlin's early column, nothing "scary" about the observed data on partial DNA matches in the Arizona CODIS database or anywhere else.

Second, while 15 quadrillion may be an unrealistic number in various respects, your criticism that it is contrived because it represents inter-ethnic matching chances is not correct. It represents the typical matching chance for two randomly selected unrelated persons of the same ethnicity. If we drop the artificial "unrelated", yes the chance increases but not notably unless the randomly selected people happen to be close relatives (nephew, son, brother).

Charles Brenner, PhD
Forensic mathematics

Tony said...

Thank you, for such an informed critique. I will be posting a revision based on your article.