Saturday, February 6, 2010

The Bell Curve 4 - Correlation and Regression

These terms may look formidable, but they're actually simple: common sense set to mathematics, if you will. In the diagram in part 3, we see that high school seniors' heights are distributed in a pattern - not a perfect one, but a pattern all the same.

Now, says the book, have the guys line up on the gym floor in columns by height, and in rows by weight, and if you were up on the rafters, you would see a pattern like this one. You'll see at once that there's a relationship between height and weight: the shorter guys tend to be lighter than the taller guys. In statistics, this is called correlation; it's a very important concept, and highly important for the purposes of the book, because the authors do a great deal of correlating various data with general intelligence.

(We could add a third measurement -- waist size -- to the two we have, and have a three dimensional graph; where height is the first "input variable" x, weight is the second "input variable" y, and waist size is the "output variable" z. And, believe it or not, one can make a pretty good estimate of z, given x and y, IF one has a large enough sample. When doing things like this, the sample size is very very important.)

But to go back to this graph -- if one calculates the means and standard deviations of both height and weight, and redraws the graph in terms of means and standard deviations (rather than the raw data), one gets a new picture of the data, and one can then draw what's called the regression line or "best fit line," which is a picture of the mathematical relationship between height and weight -- both of which, be it noted, are expressed in their own terms (remember the elephants and cats).
This means that if you look at the distribution of the guys' weights for the mean height (again, assuming a large sample), you can make some solid statements about how likely it will be -- for instance -- that a guy of mean height will fall two standard deviations below the mean weight (for this sample, nobody).
From here I let the authors speak. This is from pp. 586-587 of the book, Appendix I, "Statistics for People Who Are Sure They Can't Learn Statistics." I'm quoting them because they say things better than I can.
"1. Notice the many exceptions. There is a statistically substantial relationship between height and weight, but, visually, the exceptions seem to dominate. So too with virtually all statistical relationships in the social sciences, most of which are much weaker than this one.

"2. Linear relationships don't always seem to fit very well. The best-fit line looks as though it is too shallow. [my note: a horizontal best fit line means, mathematically, no correlation between x and y.] Look at the tall boys, and see how consistently it [the line] underpredicts how much they weigh. Given the information in the diagram, this might be an optical illusion -- many of the dots in the dense part of the range are on top of each other, as it were, and thus it is impossible to grasp visually how the errors are adding up -- but it could also be that the relationship between height and weight is not linear.
"3. Small samples have individual anomalies. Before we jump to the conclusion that the straight line is not a good representation of the relationship, remember that this sample consists of only 250 boys. An anomaly of this particular small sample is that one of the boys in the sample of 250 weight 250 pounds. Eighteen-year-old boys are very rarely that heavy, judging from the entire NLSY [explained later] sample, fewer than one per 1,000. And yet one of those rarities happened to be picked up in a sample of 250. That's the way samples work.
[My note: and one of the reasons people go to garage and estate sales and show up on "Antiques Roadshow."]
"4. But small samples are also surprisingly accurate, despite their individual anomalies. The relationship between height and weight shown by the sample of 250 18-year-old males is identical to the third decimal place with the relationship among all 6,068 males in the NLSY sample. This is closer than we have any right to expect, but other random samples of only 250 generally produce correlations that are within a few hundredths of the one produced by the larger sample. (There are mathematics for figuring out what "generally" and "within a few hundredths" mean, but we needn't worry about them here.)"
So anyway -- what The Bell Curve is all about -- "Intelligence and Class Structure in American Life" -- is based on lots of mathematical analysis of quite a few numerically measurable factors about people. I hope I have shown how some of the analysis works. I think one can appreciate the book much better if one has an understanding of the tools Herrnstein and Murray used, to get the results and come to the conclusions they did.
All that said, I want to go on record that the numerically measurable factors about any human being are not, repeat not the most important things. Thinking they are, is the fallacy that agnostics and atheists fall into. We believers know better. Science is great in its place, but it can't explain everything.

No comments:

Post a Comment