Tuesday 18 May 2010

How to sample randomly

I was once told that 'duration of unemployment' figures were collected in the following way: people were telephoned at random during the day. If they answered, they were asked if they were unemployed. If they said yes, they were asked for how long they had been unemployed. Before you read any further, can you see what is horribly, horribly wrong with this method of data collection? (There are several things wrong with it, but one of them renders it entirely useless)

Before I give you the answer, a brief detour. When I was at school, I did a piece of statistics coursework (I think it was for GCSE's) in which I compared the average sentence length in a French text to an English text. I can't remember exactly which texts I chose, I think it was newspapers of 'equivalent' quality, but that's largely irrelevant. In order to estimate the average length of sentences in each text, I adopted the following method: pick a word uniformly at random from the text and count the number of words in the sentence containing it.

I had collected around 100 sentence lengths before I noticed the utter ridiculousness of this method. In case anyone hasn't spotted it yet, this 'random sampling' is guaranteed to massively overestimate the average sentence length in any given document, as the probability of any given sentence being chosen is in direct proportion to its length.

Consider the following passage:

"The quick brown fox jumps over the lazy dog whilst the five boxing wizards jump quickly over my lovely sphinx of quartz. Jesus wept"

 If we pick a few random words from this and compute the 'average' sentence length of the sentences that contain them, we're going to come up with something very close to 20 (if we pick every single word, we'll get 20.3333)  The actual average sentence length is 12.

Now, if you didn't immediately spot that this was the key problem with the method of collecting unemployment data I mentioned in the first paragraph (there are problems with telephone polls in general, of course, but they are essentially insignificant compared to the problem with the sampling method), this should make you worry about how easy it is to slip *exceedingly* dodgy statistics past people who aren't paying attention. I'll post a few examples of my favourite 'correlated for spurious reasons' statistics in another post later this week.

As an aside - if you do actually collect the data in the way suggested, you can presumably still get some information about the distribution you're studying - what's your best estimator for the mean? And what assumptions do you have to make about how the data are distributed?

No comments: