Mother Jones blogger Kevin Drum has a new post entitled Here’s Why the Black-White IQ Gap is Almost Certainly Environmental. He offers eight reasons why he thinks the differences between Blacks and Whites on average IQ are environmental. None of the reasons he offers for this case support him; all of them illustrate how little he understands this issue and how weak of a critical thinker he must be. I will address each of these points.
Two Warnings
Drum begins by admitting that there is a Black-White IQ gap and nobody thinks otherwise. This is the case. As long as it has been recorded, there has been a roughly one Cohen’s d Black-White IQ gap. The following graph from this preprint makes this clear:
Drum then mentioned that the gap is not the result of test bias or construction and that the gap really does exist. This is an extremely important point which invalidates much of the rest of his arguments, unbeknownst to him. In order to even make this statement, we have to show that the construct the tests assess — which cannot be directly observed — is measurement invariant, meaning that the test assesses the same construct in both groups and that it does so equally well. This condition was stated in 1989 by Mellenbergh as that in which
Where Y and η are observed and factor scores and s is the selection variable for group membership. In other words, given some level of a factor (or latent ability) score, η the level of the observed score, Y, is not affected by one’s group, s: an IQ of 70 is an IQ of 70 in both Blacks and Whites and so on. Another way to state this is in terms of a regression equation whereby
Note that the ε for the initial and group-specific equations is not set to be equivalent because a score may be unbiased even with residual differences. In the case where the residuals are equivalent, the error in measuring the factor scores or latent variables in each group is equivalent and a formulation of measurement invariance called strict factorial invariance is achieved. The theoretical implications of strict factorial invariance are substantial and will be touched on below. For now, note that Eq 2 implies that unbiased scores are a regression of observed on factor scores in which slopes (regression weights or factor loadings) and intercepts (levels) are equal in both groups. If the slopes of the regression of observed on latent scores are unequal, then we encounter nonuniform bias whereby different levels of the observed scores equate to different levels and perhaps types of the latent abilities by group and thus the IQ scores cannot be compared and the meaning of the tests cannot be assumed to be the same by group. Intercept bias is a form of uniform bias where the level alone cannot be interpreted the same way by group, so a score of, say, 70 in one group may be equal to 70 in the latent ability whereas in another it may be equivalent to 80. It is possible to simultaneously have both forms of bias, but, as Arthur Jensen noted in his 1980 Bias in Mental Testing, it is almost never the case that natives of the same country regardless of race have noninvariant (biased) test parameters. The difference in the means is just that — a difference in the means. Visually oriented individuals can play around with this shiny applet in order to better understand test bias.
A deeper theoretical understanding of test bias than this is necessary to fully comprehend why exactly this concept disqualifies so much of what Drum believes. As implied above, the factor model is a linear regression model relating observed scores on a given set of items or subscales to a smaller number of factors or latent variables, which are theoretical constructs constituted by the shared variance of a set of test items. If we have i = 1, …, I observed scores, Y, measuring l = 1, …, L factors and we suppose further that the total sample consists of j = 1, …, J subjects belonging to one of s = 1, …, S groups with I = 10, L = 2, J = 500, and S = 2, we have a test with 10 items/subscales, measuring two factors (latent variables), in a sample of 500 test takers split into two groups. The within-group model for person j on item i is
The observed score y is a sum of the regression intercept, v, the scores on the factors, η, multiplied with the corresponding slopes (again, factor loadings), λ, and a residual, ε. The intercepts and factor loadings in this model are the same for everyone but are able to differ across items, which is why the intercept and factor loading have no subscript j, instead having the item subscript i. The factor score (henceforth latent score) η is subject-specific and has the subscript j as such, with the subscript l indicating which of the L factors being referred to. A latent score (for, say, mathematics or visuospatial ability) does not vary across items. The residual includes error specific to items and individuals and thus has the subscripts ij. The multiple group model is fitted to the means and covariances of the observed items instead of the raw scores and as such to obtain the mean of some item, Yᵢ, within some group, scores are averaged across the group’s subjects. With the residual mean assumed to be zero, the mean of item Yᵢ is
with ξ being the expected mean value. It is more common to denote observed mean item scores μ and factor means x. Hence,
And for I items/subscales we have I equations
described in matrix notation as
where s indicates the means of groups s, s = 1, …, S and μ and v have dimensions of 1 × I, Λ is I × L, and α is 1 × L. Assuming residuals which are uncorrelated with the latent abilities and which are themselves not intercorrelated in addition to intercepts which are constant and have zero covariance with factors and residuals, the covariances of the observed scores, Y, equal the sum of the covariances pre- and post-multiplied with the corresponding factor loadings (slopes) and the residual score variances. Using factor notation for a covariance matrix of the items, Σ, a covariance matrix of latent ability scores Ψ, and residual variances Θ, the equation for variances and covariances in an arbitrary group s is
The equations for the means and covariances are both derived from Eq 3 and hence Eq 7 described the means of the item scores in terms of the latent variable means whereas Eq 8 describes the covariances of the items in terms of the covariances of the latent abilities. The factor loadings denoted by Λ are identical in the models for means (Eq 7) and covariances (Eq 8). A multiple group model comprises both equations, modeling the regression intercepts, factor loadings, means, covariances, and residual variances in order to impose a specific structure on the means of the observed scores and their covariances which can be restricted to take the same value in each group (to be invariant) or to take a specific value. It’s through comparing the fits of models with greater or fewer restrictions that we can assess whether these restrictions are tenable and the models fit comparably, meaning that the mean differences we observe are due to differences in the same constructs. When such a model, fitted to the means and covariances of several groups, is constructed, the means (between-group differences) and covariances (within-group differences) can, clearly, be compared in a single analysis.
Measurement invariance (the state in which items measure the same things in both groups equally well, i.e., unbiasedness) is a statement that for some level of a given ability the probability of answering a given question right is not dependent on one’s group membership. There are a few steps, each nested below the preceding step and all clearly implied above, which are used to assess whether this condition is met. The first step is configural invariance whereby a model in which the same number of indicator variables and constrained and estimated parameters fits equally well in both groups; the coefficients may differ in this model, but the price of this is that this level does not let us interpret group mean differences. The next step is weak or metric invariance whereby the factor loadings are constrained to equality in the different groups. This step only allows us to state that the slopes of the observed on the latent scores are the same by group and thus the sampled populations attribute the same meanings to the latent constructs being studied. Additionally, at this stage we may compare the latent variances and covariances, but still, not to the means. The next step is called strong or scalar invariance and involves the addition of a constraint on the intercepts whilst allowing the means to differ by group, allowing us to finally make comparisons of the group means in addition to the variances and covariances. The next level is known as strict factorial invariance (SFI) and adds that, for a given indicator, the error term is constrained to be equal across groups. This level is not necessary to make group comparisons, but if it holds, it follows that the scale for the construct being measured is equally reliable in both groups, that there’s also measurement invariance with respect to unmeasured variables, the residual variances don’t mask differences in residual means, and, most importantly, that the sources of between-group variation in the constructs being measured are a subset of the within-group sources of variation. (An additional level rarely assessed, called homogeneity of latent variances, constrains the variances of the latent variables to be equal in both groups, assessing whether they used the same ranges of the latent constructs in question; this level is necessary to obviate any issues with the predictive bias of the constructs by group generated by specificity and predictive value being higher or lower due to group-specific variance limitations.) The measurement invariant model (MI with SFI) is represented as
In contrast to equations 7 and 8, equations 9–11 only allowed the factor means and covariances to differ, without any group subscript. The lack of an effect of group is MI. As mentioned above, with MI, the factors affecting the tests are the same in both groups and the background factors can be interpreted in the same way. As such, we can interpret the effects of variables such as socioeconomic status, education, and discrimination the same way in both groups when MI is said to hold. Additionally, things like stereotype threat or unequal Flynn effects, race-related anxiety and nervousness from having an other-race invigilator, racial discrimination, or anything else for a test will generate measurement non-invariance. When MI holds, there is no bias. Things like unequal access to knowledge are an invariant related to the level of ability, not group membership, so if a test score depends on knowing some factoid like vocabulary words or mathematical equations, this difference between groups is not due to unequal access to the words or equations as a result of being in a group, but due to those groups having different levels of ability which predispose different levels of learning; altering this level of knowledge would produce non-invariance in the scores or cause gains related to different latent abilities than in the initial arrangement. Another way to state this is that the groups compared cannot be thought of as two different, identical seeds raised in pots of differing quality when SFI holds (bookish readers will recognize this as a reference to the thought experiment known as Lewontin’s Seeds or X-Factors). Lubke et al. write:
Suppose observed mean differences between groups are due to entirely different factors than those that account for the individual differences within a group. The notion of ‘‘different factors’’ as opposed to ‘‘same factors’’ implies that the relation of observed variables and underlying factors is different in the model for the means as compared with the model for the covariances, that is, the pattern of factor loadings is different for the two parts of the model. If the loadings were the same, the factors would have the same interpretation. In terms of the multigroup model, different loadings imply that the matrix Λ in [Eq 10] differs from the matrix Λ in [Eq 11] (or [Eqs 7 and 8]). However, this is not the case in the MI model. Mean differences are modeled with the same loadings as the covariances. Hence, this model is inconsistent with a situation in which between-group differences are due to entirely different factors than within-group differences. In practice, the MI model would not be expected to fit because the observed mean differences cannot be reproduced by the product of α and the matrix of loadings, which are used to model the observed covariances. Consider a variation of the widely cited thought experiment provided by Lewontin (1974), in which between-group differences are in fact due to entirely different factors than individual differences within a group. The experiment is set up as follows. Seeds that vary with respect to the genetic make-up responsible for plant growth are randomly divided into two parts. Hence, there are no mean differences with respect to the genetic quality between the two parts, but there are individual differences within each part. One part is then sown in soil of high quality, whereas the other seeds are grown under poor conditions. Differences in growth are measured with variables such as height, weight, etc. Differences between groups in these variables are due to soil quality, while within-group differences are due to differences in genes. If an MI model were fitted to data from such an experiment, it would be very likely rejected for the following reason. Consider between-group differences first. The outcome variables (e.g., height and weight of the plants, etc.) are related in a specific way to the soil quality, which causes the mean differences between the two parts. Say that soil quality is especially important for the height of the plant. In the model, this would correspond to a high factor loading. Now consider the within-group differences. The relation of the same outcome variables to an underlying genetic factor are very likely to be different. For instance, the genetic variation within each of the two parts may be especially pronounced with respect to weight-related genes, causing weight to be the observed variable that is most strongly related to the underlying factor. The point is that a soil quality factor would have different factor loadings than a genetic factor, which means that [Eqs 10 and 11] cannot hold simultaneously. The MI model would be rejected.
In the second scenario, the within-factors are a subset of the between-factors. For instance, a verbal test is taken in two groups from neighborhoods that differ with respect to SES. Suppose further that the observed mean differences are partially due to differences in SES. Within groups, SES does not play a role since each of the groups is homogeneous with respect to SES. Hence, in the model for the covariances, we have only a single factor, which is interpreted in terms of verbal ability. To explain the between-group differences, we would need two factors, verbal ability and SES. This is inconsistent with the MI model because, again, in that model the matrix of factor loadings has to be the same for the mean and the covariance model. This excludes a situation in which loadings are zero in the covariance model and nonzero in the mean model.
As a last example, consider the opposite case where the between-factors are a subset of the within-factors. For instance, an IQ test measuring three factors is administered in two groups and the groups differ only with respect to two of the factors. As mentioned above, this case is consistent with the MI model. The covariances within each group result in a three-factor model. As a consequence of fitting a three-factor model, the vector with factor means, α in [Eq 10], contains three elements. However, only two of the element corresponding to the factors with mean group differences are nonzero. The remaining element is zero. In practice, the hypothesis that an element of α is zero can be investigated by inspecting the associated standard error or by a likelihood ratio test
In summary, the MI model is a suitable tool to investigate whether within- and between-group differences are due to the same factors. The model is likely to be rejected if the two types of differences are due to entirely different factors or if there are additional factors affecting between-group differences. Testing the hypothesis that only some of the within factors explain all between differences is straightforward. Tenability of the MI model provides evidence that measurement bias is absent and that, consequently, within- and between-group differences are due to factors with the same conceptual interpretation.
Just as in 1980, practically all modern assessments of MI find invariance with respect to race/ethnicity for natives within the same country including, notably, in two recent studies which assessment invariance with respect to latent variances (here and here). With this background out of the way, readers can better understand some of Drum’s errors. The following is a point-by-point response to Drum’s reasons to believe the Black-White IQ gap is environmental.
Reason One
Modern humans migrated into Europe about 40,000 years ago. That’s a very short time for selection pressures to produce a significant increase in a complex trait like intelligence, which we know to be controlled by hundreds of different genes. Even 100,000 years is a short time. It’s not impossible to see substantial genetic changes that fast, but it’s unlikely.
This is patently absurd. The fact that humans migrated into anywhere at any point in time is immaterial to the question of whether adaptation could occur within that amount of time (it doesn’t matter if his dates are wrong or right). Why we should assume 100,000 years is too little time for adaptive differences to emerge is unexplained and also untrue, given that directional selection is exponential with time-invariant effects of a given gene on fitness. That is, it only takes twice as long for a variant with a 5% fitness benefit to sweep through a population of 10,000 as a population of 100. Given that there is evidence of substantial historical selection for IQ-related variants (for example here) and more recently against them (for example here and here), the implication that selection is unlikely or must be small may be dismissed. Examples from other traits make this obvious as well. There is simply no basis for Drum’s statement.
An additional implication in this Reason is that large changes cannot occur because of the polygenicity of the trait in question. This is another patent absurdity with no substantive basis in any sort of fact. Selection in animal models has, for ages, been given by the Breeder’s equation
where Z is the mean of the trait in the population, h² is the narrow-sense (additive) heritability, and S is the selection differential (incorporating epistasis and dominance can lead to incremental gains, but these are usually minor). Simple truncation and differential fertility (like that observed in Gregory Clark’s 2008 A Farewell to Alms or his forthcoming For Whom the Bell Curve Tolls and a litany of papers produced alongside Neil Cummins) would be enough to produce large population differences if selection pressures or variant availability (linked to the mutation rate, which is higher outside of Africa, and also to the population size, which increased faster outside of Africa due to the adoption of agriculture and productivity-enhanced innovations) in a short period. The fact that polygenicity is not a limit to selection has been well-understood since at least before Student’s (William Sealy Gosset’s) 1933 essay on the implications of the Lanarkshire Milk Experiment. There is not any reason to think polygenicity should limit selection, especially when, as implied by purely linear regression to the mean in human populations for this trait (intelligence), the variance is additive or (acts as if it is). Selection is on the trait and variation in the trait will be produced by selection on it; why variation should be limited by the simplicity or complexity of its architecture given its additivity is uncertain.
Even with this said, all of the differences between groups could be produced by (nearly-)neutral processes such as drift and founder effects. Given that we know selection has occurred in various populations, we may not yet say that this has been differential (without equal-validity polygenic scores for different populations; see here and here), but we can certainly say that selection has acted, acted in large amounts, and that there are cross-population differences in putatively causal variants for our trait of interest (see here).
This Reason cannot help Drum to justify a non-genetic gap by any means. It is confused in the first place and irrelevant in the second.
Reason Two
Speaking very generally, recent research suggests that intelligence is about two-thirds biological and one-third environmental. That amount of environmental influence is more than enough to account for the black-white IQ gap.
Here Drum makes several more errors. For one, the heritability and environmentality (which are the same by race) of the latent traits are substantially higher than the full-score heritabilities. This is almost tautological because these are contaminated by measurement error which attenuates the similarity of the scores by sibship. This is true not just for intelligence, but for all behavioural traits (which are actually represented by latent variables), such as, for example, attention problems (this problem also obviously applies to predictive validity). The heritability of general intelligence in proper latent variable models is routinely extremely high (86% per Panizzon et al. in 2014).
With that stated, the heritability of intelligence increases to an asymptote of around 80% by adulthood, a phenomenon known as the Wilson Effect. This leaves precious little environmental influence to explain the between-group differences. What’s more, it is routinely found that there is no shared environmental component to adult variation in intelligence (this phenomenon appears in virtual twins as well). Since shared environmentality is environmentality with a systematic effect and unshared or unique environmentality is environmentality with a non-systematic effect (random with respect to siblings), a heritability of any value and a lack of shared environmentality implies a between-group heritability of, in truth, 100%, unless this unshared environmentality is systematic with respect to group, but not to siblings, somehow. Effects of common socioeconomic status controls are noncausal with respect to ability or differential regression by race (see here and here for example) so these too may be discounted as explanatory within the small remaining environmental component of trait variation (usually believed to be mostly measurement error and stochastic variation, which is likewise not systematic with respect to race). What environmental effects there are must be somehow systematic with respect to race, unsystematic with respect to siblings, and random with respect to their effects in general (for example here).
It’s implicit in Drum’s statement that he understands trait values are not independent of variance components. This is true. In 1998, Jensen provided us with the following formula to estimate how large the group gap in environmental quality, X, must be in order to explain the differences, d = 1, with heritability, h², in terms only of environments
Given a heritability of 80%, the difference in environmental quality must be 2.24 d, or an overlap of 26.27% in order to explain the group gaps. For socioeconomic status (which we must assume is causal for the same things as race-related cognitive gaps, though the evidence says otherwise), this is simply contrary to the evidence. For example, in the National Educational Longitudinal Study of 1988 (NELS), the difference is 0.625 d, similar to the 0.771 in the Early Childhood Longitudinal Study-Kindergarten Class of 1998–99 (ECLS-K), 0.606 in the Education Longitudinal Study of 2002 (ESL), 0.604 in the High School Longitudinal Study of 2009 (HSLS) (Warne, personal communication, 2018), or similar figures in the NLSY79, 97, and National Collaborative Perinatal Projects (NCPP). Based on the formula provided by Schneider (here) we may also imagine the environmental deficit as multiple variables with smaller gaps to make a cumulative deficit of the necessary size. We must remember, however, that to be consistent with MI, these items must be common ones between groups, so we cannot include influences such as discrimination which only affect Blacks but not Whites.
If we assume that the unshared environment is random noise and the heritability is common, then the formula for the expected group environmental difference is given by
where C² is the variance component for shared environmentality, which we know to be essentially zero for intelligence in adulthood. With this, the required environmental differences are very and perhaps implausibly large, especially considering the decline in socioeconomic differences (which index general material inequalities) in variables like education and income over time. Plotted from the GSS, education versus cohort-invariant wordsum scores, we get
Another issue is whether environmental differences affect the same latent abilities which constitute racial differences. The Black-White differences in ability are concentrated on the general factor of tests, a phenomenon known as Spearman’s hypothesis (this has led some to comment that “the black-white factor is g”). It is commonly found that environmental influences are not related to this general factor, however. This includes for education (see also here for schooling and here for literacy), socioeconomic status, lead and other neurotoxin exposure, and even adoption (see also here). The implied influences of variables such as family size, father occupation, physical possessions in the home, parent personality, and other similar influences on intelligence measures are practically-zero. The question thus becomes: What environmental influences are “more than enough” to account for the Black-White IQ gap? At present, there are not any known influences capable of doing this, nor is it particularly plausible that the gaps in the unshared environment (which must be systematic by race but not sibship and thus exist in a way inconsistent with familial environmental transmission) are large enough to explain the race gaps. Lackadaisically picking a representative study we can assess the relationship between g and variance components, which is as follows:
The Jensen effect (when something relates to g) is modest to strongly positive in its relation to heritability, but for shared environmentality it was weakly to modestly positive, and for unshared environmentality it was strongly negative. For the total environmentality, it was of course the inverse of the heritability. This is a very common finding. How the racial differences could be accounted for by a component (unshared environment) when it doesn’t affect the racial differences is beyond me and I am guessing beyond Kevin Drum.
(One possible but thus far unevidenced explanation is that the Scarr-Rowe effect affects g and is large enough by race to explain the gaps. Given that it is seemingly nonexistent by race, and the form of it whereby it represents moderation of the unshared environment doesn’t seem to make this variance component relate to g, little confidence in this idea is deserved. There is a forthcoming presentation at this year’s ISIR conference by Michael Anthony Woodley of Menie Yr. which deals with the relationship between the Scarr-Rowe and Jensen effects; depending on whether this result is a positive, null, or negative relationship, this explanation may bear on the differences and the idea that socioeconomic status relates to them at all.)
On this point, Drum does not appear to have a legitimate reason to think the Black-White gap is environmental.
Reason Three
There’s a famous result in intelligence studies called the Flynn Effect. What it tells us is that average IQs rose about 3 points per decade throughout the 20th century. That’s roughly 20 points of IQ throughout the entire period, and it’s obvious that this couldn’t have been caused by genes. It’s 100 percent environmental. This is clear evidence that environmental factors are quite powerful and can easily account for very large IQ differences over a very short period of time.
Nothing that changes over a period of decades or centuries can be caused by changes in genes. At a minimum, it takes thousands of years for genetic changes to spread throughout a population.
On this point, Drum is confused about the Flynn effect. Drum effectively engages in a faulty syllogism; per Robert C. Nichols, this goes:
- We do not know what causes the test score changes over time.
- We do not know what causes racial differences in intelligence.
- Since both causes are unknown, they must, therefore, by the same.
- Since the unknown cause of changes over time cannot be genetic, it must be environmental.
- Therefore, racial differences in intelligence are environmental in origin.
The issue here is that the Flynn effect is simultaneously not constrained to the g factor like racial differences are (see here and here) and it is not measurement invariant (see here, for example). Wicherts et al. state:
More importantly, in both B–W studies, it is concluded that the measurement invariance between Blacks and Whites is tenable because the lowest AIC values are found with the factorial invariance models (Dolan, 2000; Dolan & Hamaker, 2001). This clearly contrasts with our current findings on the Flynn effect. It appears therefore that the nature of the Flynn effect is qualitatively different from the nature of B–W differences in the United States [emphasis mine]. Each comparison of groups should be investigated separately. IQ gaps between cohorts do not teach us anything about IQ gaps between contemporary groups, except that each IQ gap should not be confused with real (i.e., latent) differences in intelligence. Only after a proper analysis of measurement invariance of these IQ gaps is conducted can anything be concluded concerning true differences between groups.
As such, we have two realizations: The Flynn effect and racial differences are unrelated and the Flynn effect is not about an invariant change in latent abilities, but instead, about bias in test scores over time, whether this is due to new factors such as abstractness (i.e., test-taking ability independent of intelligence; see here and here) which change the slopes of items with respect to latent abilities or due to changes in the intercepts, which imply incommensurable levels. In general, it is found that the Flynn effect is related to bias in both the slopes and intercepts, so the levels and even the abilities assessed by the items in question between different eras cannot be interpreted, even if they can be interpreted commonly within each era. Applying the same method (calculating congruence coefficients; see here) to the Flynn effect gains reported by Must et al. as Gordon did to the Black-White differences in 1985, we arrive at a value of Tucker’s congruence coefficient of 0.75, which is considered dissimilar, and thus the Flynn effect gains cannot be thought of as analogous to g, as the racial differences can (this also applies to the comparisons in this study, where the congruence coefficients for Immigrant Mexican-Native White and Immigrant White-Native White differences are 0.97 and 0.88 respectively).
We can go further with this. The Flynn effect is of a practically invariant size by racial subgroup in the United States, so regardless of IQ gains, the racial differences have remained the same as a result. The constantness of the Flynn effect implies the environmental causes (if these are indeed the reason for the Flynn effect) are common by race, arguing again against Drum’s interpretation. What’s more, the Flynn effect affects all age groups from the young to the old, is present in pre-school children, and does not accelerate during the school age years. This is consistent with a genetic perspective, in fact, given that the dominant perspective on the cause of the Flynn effect is life history theory and there appears to be selection in favour of slower life histories. The relationship of the Flynn effect to test bias is important, as a recent study of a Norwegian cohort claims that the Flynn effect can be reclaimed from within-family effects, but it never assessed whether this effect was truly invariant, which we should not expect it to be given the similar secular trends in variance components in the same cohorts.
With this Reason, Drum gains no ground, as the Flynn effect is unrelated to racial differences and thus irrelevant to the existence of the Black-White IQ gap.
Reason Four
The difference in average IQ recorded in different European countries is large: on the order of 10 points or more. The genetic background of all these countries is nearly identical, which means, again, that something related to culture, environment, and education is having a large effect.
The two main issues here are as follows:
- Genetic differences arbitrarily considered to be small imply that the phenotypic differences must likewise be small.
- The test scores mentioned are really different or of the correct size with his data.
The genetic backgrounds of these countries are not “nearly identical,” whatever that means; even if they are by some strange definition, they can all be differentiated. By similar logic, the genetic background of siblings is “nearly identical” (much moreso than between-country differences) and as such, the reason for their differences cannot be genetics — but this is untrue (see here and here as examples) and the differences are typically larger than those seen between countries despite the differences in environments and genes both being absolutely smaller and genes being the explanatory factor. We shouldn’t expect much of any within-family predictive validity for known genes since they constitute much smaller differences than those seen between-countries or families: eppur si muove. Finally, we should not expect differences between carriers of monogenic disorders like Huntington’s Chorea to be very large given that they are the result of only single genes and thus as small of a genetic difference as one can obtain, with a much larger effect size than the between-country differences. Drum’s perspective here necessitates denial of all of these things and it requires general confusion about practically everything in genetics. This is called Lewontin’s fallacy.
The second point is that the test scores given are not known to be invariant. Often it is found that between-country scores are not comparable, even if scores within them are. For example, Täht & Must found metric, but not scalar invariance for the PISA. Remember, this implies that the items relate to the same latent abilities, but the levels of the items are not immediately interpretable in the same fashion. Newer data show much more light on this issue and reveal what are more plausibly invariant national differences in IQ, though of similar magnitudes to what Drum states. However, again, the size of the national IQ differences is not an argument against their existence and the contribution of genetics to them any more than the differences between siblings imply a lack of genetic influence.
On this point, Drum is simply confused and inconsistent.
Reason Five
It is very common for marginalized groups to have low IQs. In the early years of the 20th century, for example, the recorded IQs of Italian-Americans, Irish-Americans, Polish-Americans and so forth were very low. This was the case even for IQ scores recorded from the children of immigrants, all of whom were born and educated in the US and were fluent English speakers. These IQ scores weren’t low because of test discrimination (at least not primarily because of that), they were low because marginalized groups often internalize the idea that they aren’t intelligent. However, over the decades, as these groups became accepted as “white,” their IQ scores rose to the average for white Americans.
There is absolutely no basis for Drum’s arguments here. The facts about these differences — similarly talked about by people like Ron Unz and Thomas Sowell — are so badly-recorded that I am certain there are no estimates of latent differences between these groups which are invariant (recall that this is required to compare means). Recall that many of these samples of immigrants were selected not for being representative of their populations, but for diagnostic purposes (see here and here), making them distinctly unrepresentative. What’s more, linguistic bias would have been rife, even during the assimilation process as ethnolinguistic grouping (Chinatowns, Little Italy, etc.) were common barriers to English learning. Linguistic bias is an obvious source of bias, and even when it doesn’t totally invalidate a test (i.e., I could not take a test written in Hangul so it would have no validity for me), it can do things like biasing the intercepts. The poor quality of the samples, the measurements, and the number of these things makes it nigh on impossible to discriminate putative trends towards convergence amongst past immigrants to the United States (and especially if we prescribe a cause when it is more likely than anything just assimilation, genetic and otherwise, especially linguistic). Believing in these things is scrying at tea leaves. See here for more discussion of this overstated argument.
If it is the case that a group’s test scores are affected by some internalized idea about themselves, the test score would be non-invariant. There is no basis whatever for this effect (internalized unintelligence) Drum discusses, but even if there were, it would constitute measurement bias, not something meaningful about the intelligence of members of the various groups. This is true regardless of whether we’re discussing stereotype threat or test anxiety because the direct effect of these things on test scores represents measurement bias due to their being a distinct entity from the construct the test is intended to measure.
Related to the issue of bias is the issue of whether the latent abilities which supposedly converged were even the same if there was unbiasedness. The Black-White differences are a result principally of g but whether the deficits of previous White and Asian immigrants were is simply not known. (This account is also doubtful veridical; if the deficits were very common, Terman certainly would not have been prompted to claim that IQ-based immigration policy would be embarrassed by Asians.) It is more likely than not that the differences between Black and Whites were unlike the differences between immigrant Whites and native Whites in the past. This was certainly the case for crime, which is rather more simple to measure. For crime, something like Spearman’s hypothesis for cognitive tests (where the differences are relegated to certain parts of the tests) also held in the past. Moehling & Piehl present some data to this effect. While some European groups did have high rates of criminality, this was primarily due to minor crimes like public drunkenness. Group differences with respect to serious offenses, for Europeans regardless of nativity, were comparatively minor. The authors find that the opposite case is true for immigrants from Mexico. Assimilation, whether it’s for crime or intelligence, cannot be understood without reference to the specifics of what happens.
Drum’s Reason here cannot offer him anything but the most speculative support for his idea, and even then, the exact opposite inference can be made with as much or more validity given the other supporting lines of conjectural evidence and arguments from things like immigrant selection and bias in whatever their forms.
Reason Six
The same thing has happened elsewhere. In the middle part of the 20th century, the Irish famously had average IQ scores that were similar to those of American blacks — despite the fact that they’re genetically barely distinguishable from the British. However, as Ireland became richer and the Irish themselves became less marginalized, their IQ scores rose. Today their scores are pretty average.
This is similar to Reason Five and most of the argument against that line can be recycled for this. It is worth noting again that Drum resorts to Lewontin’s fallacy and even a failure to understand the current or historical national IQs of the Irish, the latter of which is currently listed as 94–97. Stating that the IQs of the Irish have changed is entirely uninformative with respect to Black-White differences and even the sources of earlier differences between the Irish and British or any other group.
Drum’s Reason here says nothing about the Black-White differences and shows a clear ignorance on even the IQ scores of the Irish he discusses.
Reason Seven
In 1959, Klaus Eyferth performed a study of children in Germany whose fathers had been part of the occupation forces. Some had white fathers and some had black fathers. The IQ scores of the white children and the racially mixed children was virtually identical.
The results of this study have recently been replicated in Japan by Kirkegaard, Lasker & Kura. However, neither study presents anything like clear evidence against a genetic explanation for the differences between Blacks, Whites, or any other groups. Both of these involve old, small, and unrepresentative sampling (we don’t even know the ancestry of the fathers, and in the Eyferth case the result may be due to north African paternity; the effects of age are not assessed despite the expectation that the racial differences should grow with age if the typical gene-environment correlation in childhood is reversed, given the shared environmental effects known to fade) with no ability to assess test information or the nature of the between-group differences. In the former case, with Eyferth’s study, when all subtests are analyzed, the residual score differences are not concentrated on g, but even this tells us nothing particular. The majority of existing evidence (for example) confirms that mixed-race children score in-between their parent races and there is no reason to prefer the Eyferth or KLK study over these other ones, which are usually newer and higher-quality. Much more determinate studies using modern genomic measurements instead of self-reported, assumed, or phenotypically-inferred ancestry support a negative (positive) relationship between genetic ancestry from Africans (Europeans) and intelligence measures (see here and here as examples). Additionally, it does not appear that this relationship can be accounted for by phenotypic differences in skin colour, as expected given our understanding of measurement invariance.
This singular low-quality study and it's equally low-quality and uninformative replication, in a sea of others with the exact opposite or equally indeterminate conclusions, are not able to support an environmental explanation of the Black-White gap.
Reason Eight
Over the past few decades, the black-white IQ gap has narrowed. Roughly speaking, it was about 15 points in 1970 and it’s about 10 points now. This obviously has nothing to do with genes.
This is simply not true: The Black-White IQ gap has not narrowed (see this previous post). If we take the invariance of these purported changes seriously, then the gap has still not narrowed to 10 points. The claim that Blacks have gained 5.5 IQ points by Dickens & Flynn has been met with the observation that the gains on the Wonderlic Personnel Test were 2.4 points, the K-ABC showed losses of 1 point, the Woodcock-Johnson showed a zero gain, and the DAS showed a gain of only 1.83 points. Moreover, these authors simply projected to achieve a narrowing of 5.5 IQ points, when the simple arithmetic gain was actually just 3.4 IQ points. With the other tests mentioned included, this drops to 2.1 IQ points. Moreover, many other studies (included in the graph above) such as Murray’s, Roth et al.’s, Cucina et al.’s, and Kirkegaard et al.’s, have failed to support a general narrowing trend. In fact, the gains observed in Dickens & Flynn were found by them to be negatively associated with g (as, inappropriately, measured by PCA), indicating that the gains were in a different vector than the one assessed in the original tests, meaning they were certainly non-invariant. Even the trend in the NAEP stopped after the 1980s, at the same time as the narrowing in the socioeconomic status gap in that assessment, and in a way consistent with selection and perhaps even the Flynn effect eroding the quality of the tests. A more complete view of the Black-White IQ gap supports no narrowing at any level, the basis of La Griffe Du Lion’s Fundamental Law of Sociology.
Assessments of the latent abilities (instead of simply the observed scores, with all of their potential biases) involved in producing the Black-White differences overcome many of the deficits associated with drawing conclusions from assessments like the NAEP where the psychometric information is absent and the gains or losses are likelier to be a function of bias or selection over time (it is worth noting again that bias is usually absent within an era, but present between them). Most recently, Kane & Oakland, Frisby & Beaujean and Hu et al. have both used proper psychometric methods and reported a 1 d (in K & O’s analysis, adjusted for selection for equal socioeconomic status in the Woodcock-Johnson standardization sample) difference between adult Blacks and Whites in g, in addition to support for Spearman’s hypothesis, all of which is inconsistent with a narrowing. I conjecture that the better the methods used, the smaller and less consistent the apparent narrowing and the less likely it is that it will turn out to be related to a genuine increase in the latent abilities the tests are presumed (but not always found to) assess (see here).
Drum has again overstated the evidence and ignored substantive issues and contrary findings in making this point. There is no psychometrically sound evidence in existence for his conjecture here, but there are many examples of evidence against this point.
Conclusion
I hope this makes sense. You can draw your own conclusions, but my take from all this is that (a) the short time since humans migrated to Europe doesn’t allow much scope for big genetic changes between Africans and Europeans, (b) it’s clear that environment can have a very large effect on IQ scores, and (c ) anyone who thinks the marginalization of African Americans isn’t a big enough effect to account for 10–15 points of IQ is crazy. There are counterarguments to all my points, and none of this “proves” that there can’t possibly be genetic differences between blacks and whites that express themselves in noticeable differences in cognitive abilities. But I sure think it’s very unlikely.
Point A is unevidenced and inconsistent with the existence of the large differences between Europeans and Africans and the plausibility of the observed selection which has occurred in various populations being, ultimately, differential. Point B was not at all shown in his piece and it is unclear where the evidence for it will really be, especially insofar as putative environmental effects have a relation to the racial differences. Point C is perhaps his most tendentious point and there exists no evidence for it. Marginalization is not supposed to be something indefinite and unfalsifiable, so where is the evidence and how can it be made consistent with the existing empirical evidence and understanding of tests? I proffer that it cannot be and, while there is plenty of evidence for genetic involvement in the racial differences and the fact that the racial differences are one and the same with the individual differences within groups (which are dispositively known to be due to genetic differences), the evidence for systematic environmental effects between races is absent and, in most cases (e.g., discrimination, stereotype threat, a history of slavery) impossible as an explanation (if they were possible explanations, then we would have to say that the tests are horribly biased and find some way to prove this). One thing worth wondering is why these purported causes only affect IQ, and not, say, all of this:
What Drum says which is true — that tests are unbiased and IQ test scores are related to intuitive understandings of intelligence — is a good admission, since many still don’t understand that these facts are so well-supported they can’t be denied without burying your head in the sand. What Drum says which is not true is almost everything else that he said.