Monday, April 9, 2007

Hi Ho, Hi Ho, A-Data-Mining We Go

A gigantic study on astrology was just completed across the pond at the University of Manchester. The researchers were looking to see if there was any correlation between astrological signs and marital success. They looked at over 10 million couples. They didn't find anything.

Now obviously astrology is crap, and if you think otherwise, this isn't the blog for you. But I want to discuss the details of this study further, because it's only saving grace was its shear magnitude. (A more in depth discussion appears at the Skeptics Guide.) They got the data from a 2001 census, and out of the 10 million couples, they found no statistically meaningful connections. This is the merit of using such a large data set. Trivial coincidences are averaged out. Had they used a smaller set they would quite likely have found a more "meaningful" result.

You see, they weren't looking for a particular correlation. There was no hypothesis. They didn't say, "Astrology predicts that Cancers will be attracted to Capricorns," then look for that to bear out in the data. They were just looking for anything interesting. That's called data-mining. It's a common ploy of bad science. Let me give another example. Let's say you have a theory about family pets. You have notices lately at the dog park, that girls seem to prefer smaller dogs, while men prefer larger ones. You call up the local vets and convince them to give you information on the gender of the owners versus the weight of their dogs. Maybe your idea is supported by the data; maybe it isn't. But you've at least conducted a reasonable study.

Now let's go data-mining. You gather a large batch of data about people and their dogs, and after pouring through it you find something curious. Every dalmatian is owned by a man named "Steve." Most German Shepherds are owned by people with French sounding names. And more often than not, Chihuahuas like to paint their fingernails some shade of pink. Can any meaningful conclusion be drawn from this? Of course not. Yet this is exactly the kind of thing that certain researchers try to do each year.

If you go looking for an unspecific anything, you're bound to find something. It may sound obvious, but it's easy to fall victim to this kind of thinking. So be careful out there.

