Organising Life: Modelling and statistics

This week we've been back in the wonderful world of statistics. I quite like this stuff. One particularly important concept that's really been driven in is that anything beyond the most basic of statistics is actually a type of modelling. But modelling involves models, which by definition aren't real, so how does they help to analyse real data?

The most basic: descriptive statistics
First, a quick reminder of the sorts of statistics most people are familiar with. These are called descriptive because they simply look at the data and tell you exactly what's there.

These describe the centrality (average):

Mean ('average')
Median (middle value if you listed them all in order)
Mode (most common value)

These describe the variance (spread):

Range (difference between the largest and smallest)
Variance
Standard deviation

These might be simple, but they can be very helpful. I'm always finding the means and standard deviations of things. Their limitation is that they do just describe the data, whereas in biology we usually want to be able to take our data and make predictions from it. You might have studied the amount of cabbage eaten by caterpillars reared at 10°, 15° and 20°, but what about temperatures in between? It's sensible to use your data to predict that, but descriptive statistics won't help you do it.

The ones that use models: parametric statistics
Here's some fake data about a particularly difficult video game:

16 people each played one level of eight, and the number of times their character died before they managed to finish the level was recorded. It seems that, in general, the higher levels were more difficult.

Unfortunately, the researcher had left her glasses at home and didn't realise that only every other level was tested. What about the missing levels?

The statistically knowledgeable reader might suggest drawing a line of best fit:

Now we aren't limited to reading up from existing x values. We can read from anywhere, because the line of best fit should represent the general pattern of the data. These in-between values aren't real, but from what we already know we can make a reasonable guess of what they are. For example, it looks like someone playing level three can expect to die seven or eight times.

This is exactly what a model does: the line of best fit is a model of the data. I've been learning to use linear models, which are lines drawn to fit the data as accurately as possible. The better the model fits the data, the more faithful its predictions will be, so the more we can trust them. Parametric statistics are named so because they are described by parameters of the data, like the mean and the variance.

A line fits this data pretty well, but what about data that follows different shapes, like arcs? Mathematically, you describe an arc in a similar way to a straight line, but with an extra parameter that makes it a bit more complicated. To make a line with multiple bends you add more parameters, and so on. The video game data isn't a perfect straight line, so you might extend its model to end up with something like this:

Or even this:

The last one might describe the data perfectly. However, it's going to be extremely complicated and difficult to make predictions from. This model is as impenetrable as real life. That's not the point.

So, when you make a model to pull out the gist of data, there's a trade-off between making it realistic enough for trustworthy predictions and simple enough to be useful. Statistics involves using numbers at exactly the right level of pretendness.

Wednesday, 18 January 2017

Modelling and statistics

No comments:

Post a Comment