# тЏатГљтѕєТъљ

С╗јPCAтѕ░жђџУ┐ЄТи╗тіатЎфтБ░У┐ЏУАїтЏатГљтѕєТъљсђѓтЏаТъютЈЉуј░СИГтЏатГљтѕєТъљуџёТа╣Т║љ№╝џТќ»уџ«т░ћТЏ╝уџёСИђУѕгтЏатГљТеАтъІтњїтЏЏтЁЃТќ╣уеІсђѓС╝░У«АтЏатГљТеАтъІуџёжЌ«жбў№╝џТ»ћТќ╣уеІТЏ┤тцџуџёТюфуЪЦТЋ░сђѓУДБтє│Тќ╣ТАѕ1№╝їРђюСИ╗УдЂтЏау┤аРђЮ№╝їС╣Ът░▒Тў»жђџУ┐Єу║┐ТђДС╗БТЋ░уџёУІ▒жЏётБ«СИЙТЮЦС╝░У«АсђѓУДБтє│Тќ╣ТАѕ2№╝їТюђтцДтЈ»УЃйТђД№╝їС╣Ът░▒Тў»жђџУ┐Єт╝║тіатѕєжЁЇтЂЄУ«ЙТЮЦС╝░У«АсђѓТЌІУйгжЌ«жбў№╝џтЏатГљТеАтъІТў»ТЌаТ│ЋУ»єтѕФуџё; тЏау┤ауџёТЋ░жЄЈтЈ»УЃйТў»ТюЅТёЈС╣Ѕуџё№╝їСйєСИфтѕФтЏау┤аСИЇТў»сђѓ
т▒Ћт╝ђТЪЦуюІУ»дТЃЁ

1. Factor Analysis 36-350, Data Mining 23 September 2009 Contents 1 From Principal Components to Factor Analysis 1 1.1 Measurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Preserving correlations . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Roots of Factor Analysis in Causal Discovery . . . . . . . . . . . 3 2 Preliminaries to Factor Estimation 4 3 Estimation by Linear Algebra 5 3.1 A Clue from SpearmanРђЎs One-Factor Model . . . . . . . . . . . . 5 3.2 Estimating Factor Loadings and Specific Variances . . . . . . . . 6 4 Maximum Likelihood Estimation 7 4.1 Estimating Factor Scores . . . . . . . . . . . . . . . . . . . . . . 8 5 The Rotation Problem 8 1 From Principal Components to Factor Analy- sis There are two ways to go from principal components analysis to factor analysis Рђћ two motivating stories. 1.1 Measurement Error Suppose that the numbers we write down as our observations arenРђЎt altogether accurate Рђћ that our numbers are the true variables plus some measurement noise. (Or, if weРђЎre not making the measurements ourselves but just taking numbers from some database, that whoever created the database wasnРђЎt able to measure things perfectly.) PCA doesnРђЎt care about this Рђћ it will try to reproduce true-value-plus-noise from a small number of components. But thatРђЎs 1

2.kind of weird Рђћ why try to reproduce the noise?1 Can we do something like PCA, where we reduce a large number of features to additive combinations of a smaller number of variables, but which allows for noise? The simplest model, starting from PCA, would be something like this. Each object or record has p features, so Xij is the value of feature j for object i. As before, weРђЎll center all the observations (subtract off their mean). We now postulate that there are q factor variables, and each observation is a linear combination of factor scores Fir plus noise: k Xij = ij + Fir wrj (1) r=1 The weights wrj are called the factor loadings of the observable features; they say how much feature j changes, on average, in response to a one-unit change in factor score r. Notice that we are allowing each feature to go along with more than one factor (for a given j, wrj can be non-zero for multiple r). This would correspond to our measurements running together what are really distinct variables. Here ij is as usual the noise term for feature j on object i. WeРђЎll assume this has mean zero and variance ¤ѕj Рђћ i.e., different features has differently-sized noise terms. The ¤ѕj are known as the specific variances, because theyРђЎre specific to individual features. WeРђЎll further assume that E [ ij lm ] = 0, unless i = l, j = m Рђћ that is, each object and each feature has uncorrelated noise. We can also re-write the model in vector form, Xi = i + Fi w (2) with w being a q ├Ќ p matrix. If we stack the vectors into a matrix, we get X = + Fw (3) This is the factor analysis model. The only (!) tasks are to estimate the factor loadings w, the factor scores F, and the specific variances ¤ѕj . A common question at this point is, or should be, where does the model (1) come from? The answer is, we make it up. More formally, we posit it, and all the stuff about the distribution of the noise, etc., as a hypothesis. All the rest of our reasoning is conditional, premised on the assumption that the posited hypothesis is in fact true. It is unfortunately too common to find people who just state the hypothesis in a semi-ritual manner and go on. What we should really do is try to test the hypothesis, i.e., to check whether itРђЎs actually right. We will come back to this. 1.2 Preserving correlations PCA aims to preserve variance, or (what comes to the same thing) minimize mean-squared residuals (reconstruction error). But it doesnРђЎt preserve corre- lations. That is, the correlations of the features of the image vectors are not 1 One reason would be if weРђЎre not sure whatРђЎs noise, or if what seems to be noise for one purpose is signal for something else. But letРђЎs press onward. 2

4.four features, j, l, r, s. Then, if the model (4) is true, ¤Ђjr /¤Ђlr wj wr /wl wr = (6) ¤Ђjs /¤Ђls wj ws /wl ws wj /wl = (7) wj /wl = 1 (8) The relationship ¤Ђjr ¤Ђls = ¤Ђjs ¤Ђlr (9) is called the Рђюtetrad equationРђЮ, and we will meet it again later when we consider methods for causal discovery. Spearman found that the tetrad equation held in his data on school grades (to a good approximation), and concluded that a single general factor of intelligence must exist. This was, of course, logically fallacious. Later work, using large batteries of different kinds of intelligence tests, showed that the tetrad equation does not hold in general, or more exactly that departures from it are too big to explain away as sampling noise. (Recall that the equations are about the true correlations between the variables, but we only get to see sample correlations, which are always a little off.) The response, done in an ad hoc way by Spearman and his followers, and then more systemati- cally by Thurstone, was to introduce multiple factors. This breaks the tetrad equation, but still accounts for the correlations among features by saying that features are really directly correlated with factors, and uncorrelated conditional on the factor scores.3 ThurstoneРђЎs form of factor analysis is basically the one people still use Рђћ there have been refinements, of course, but itРђЎs mostly still his method. 2 Preliminaries to Factor Estimation Assume all the factor scores are uncorrelated with each other and have vari- ance 1; also that they are uncorrelated with the noise terms. WeРђЎll solve the estimation problem for factor analysis by reducing it to an eigenvalue problem again. Start from the matrix form of the model, Eq. 3, which youРђЎll recall was X = + Fw (10) We know that XT X is a p ├Ќ p matrix, in fact itРђЎs n times the sample covariance 3 You can (and should!) read the classic РђюThe Vectors of MindРђЮ paper (Thurstone, 1934) online. 4

5.matrix V. So nV = XT X (11) T = ( + Fw) ( + Fw) (12) T T T = +w F ( + Fw) (13) T T T T T T = + Fw + w F + w F Fw (14) T = n╬е + 0 + 0 + nw Iw (15) = n╬е + nwT w (16) T V = ╬е+w w (17) where ╬е is the diagonal matrix whose entries are the ¤ѕj . The cross-terms cancel because the factor scores are uncorrelated with the noise, and the FT F term is just n times the covariance matrix of the factor scores, which by assumption is the identity matrix. At this point, the actual factor scores have dropped out of the problem, and all we are left with are the more РђюstructuralРђЮ parameters, namely the factor loadings w and the specific variances ¤ѕj . We know, or rather can easily esti- mate, the covariance matrix V, so we want to solve Eq. 17 for these unknown parameters. The problem is that we want q < p, but on its face (17) gives us p2 equations, one for each entry of V, and only p + pq unknowns (the diagonal elements of ╬е, plus the elements of w). Systems with more equations than unknowns generally cannot be solved. This makes it sound like itРђЎs actually impossible to estimate the factor analysis model!4 3 Estimation by Linear Algebra The means of escape is linear algebra. 3.1 A Clue from SpearmanРђЎs One-Factor Model Remember that in SpearmanРђЎs model with a single general factor, the covariance between features a and b in that model is the product of their factor weightings: Vab = wa wb (18) 4 Actually, the book-keeping for the number of degrees of freedom is a little more compli- cated, though the point is sound. First of all, there are not p2 independent equations but only p(p + 1)/2 of them, because V is a symmetric matrix. (Since ╬е is diagonal, ╬е + wT w is automatically symmetric.) On the other hand, each of the q rows of w must be orthogonal to all the others, which gives q(q Рѕњ 1)/2 constraints on the unknowns. So the number of degrees of freedom for V is p(p + 1)/2, and the number of degrees of freedom for the unknown parameters is p + pq Рѕњ q(q Рѕњ 1)/2. If the former exceeds the later, there are degrees of freedom left over to estimate the parameters Рђћ but there may be no exact solution. If on the other hand the parameters have more degrees of freedom than V does, then there cannot possibly be a unique solution, and the model is hopelessly unidentifiable no matter how much data we have. Most software, including RРђЎs default factor analysis function, will simply refuse to work with such a model. 5

6.The exception is that Vaa = wa2 + ¤ѕa , rather than wa2 . However, if we look at U = V Рѕњ ╬е, thatРђЎs the same as V off the diagonal, and a little algebra shows that its diagonal entries are, in fact, just wa2 . So if we look at any two rows of U, theyРђЎre proportional to each other: wa Ua┬и = Ub┬и (19) wb This means that, when SpearmanРђЎs model holds true, there is actually only one linearly-independent row in in U. Rather than having p2 equations, weРђЎve only got p independent equations.5 Recall from linear algebra that the rank of a matrix is how many linearly independent rows it has.6 Ordinarily, the matrix is of full rank, meaning all the rows are linearly independent. What we have just seen is that when SpearmanРђЎs model holds, the matrix U is not of full rank, but rather of rank 1. More generally, when the factor analysis model holds with q factors, the matrix has rank q. 3.2 Estimating Factor Loadings and Specific Variances We are now in a position to set up the classic method for estimating the factor model. As above, define U = V Рѕњ ╬е. This is the reduced or adjusted covari- ance matrix. The diagonal entries are no longer the variances of the features, but the variances minus the specific variances. These common variances or commonalities show how much of the variance in each feature is associated with the variances of the latent factors. U is still, like V, a positive symmetric matrix. We canРђЎt actually calculate U until we know, or have a guess as to, ╬е. A reasonable and common starting-point is to do a linear regression of each feature j on all the other features, and then set ¤ѕj to the mean squared error for that regression. Because U is a positive symmetric matrix, we know from linear algebra that it can be written as U = CDC T (20) where C is the matrix whose columns are the eigenvectors of U, and D is the diagonal matrix whose entries are the eigenvalues. That is, if we use all p eigenvectors, we can reproduce the covariance matrix exactly. Suppose we instead use Cq , the p ├Ќ q matrix whose columns are the eigenvectors going with the q largest eigenvalues, and likewise make Dq the diagonal matrix of those eigenvalues. Then Cq Dq Cq T will be a symmetric positive p ├Ќ p matrix. It wonРђЎt quite equal U, but it will come closer as we let q grow towards p, and at any given q, this matrix comes closer to being U than any other we could put together which had rank q. 5 This creates its own problems when we try to estimate the factor scores, as weРђЎll see. 6 We could also talk about the columns; it wouldnРђЎt make any difference. 6

7. Now define Dq 1/2 as the q ├Ќ q diagonal matrix of the square roots of the eigenvalues. Clearly Dq = Dq 1/2 Dq 1/2 . So T Cq Dq Cq T = Cq Dq 1/2 Dq 1/2 Cq T = Cq Dq 1/2 Cq Dq 1/2 (21) So we have T U РЅѕ Cq Dq 1/2 Cq Dq 1/2 (22) but at the same time we know that U = wT w. So first we identify w with T Cq Dq 1/2 : T w = Cq Dq 1/2 (23) Now we use w to re-set ╬е, so as to fix the diagonal entries of the covariance matrix. T w = Cq Dq 1/2 (24) k 2 ¤ѕj = Vjj Рѕњ wrj (25) r=1 V РЅѕ V РЅА ╬е + wT w (26) The РђюpredictedРђЮ covariance matrix V in the last line is exactly right on the diagonal (by construction), and should be closer off-diagonal than anything else we could do with the same number of factors Рђћ i.e., the same rank for the U matrix. However, our estimate of U itself has in general changed, so we can try iterating this (i.e., re-calculating Cq and Dq ), until nothing changes. LetРђЎs think a bit more about how well weРђЎre approximating V. The approx- imation will always be exact when q = p, so that there is one factor for each feature (in which case ╬е = 0 always). Then all factor analysis does for us is to rotate the coordinate axes in feature space, so that the new coordinates are uncorrelated. (This is the same was what PCA does with p components.) The approximation can also be exact with fewer factors than features if the reduced covariance matrix is of less than full rank, and we use at least as many factors as the rank. 4 Maximum Likelihood Estimation It has probably not escaped your notice that the estimation procedure above requires a starting guess as to ╬е. This makes its consistency somewhat shaky. (If we continually put in ridiculous values for ╬е, thereРђЎs no reason to expect that w Рєњ w, even with immensely large samples.) On the other hand, we know from our elementary statistics courses that maximum likelihood estimates are generally consistent, unless we choose a spectacularly bad model. Can we use that here? 7

8. We can, but at a cost. We have so far got away with just making assumptions about the means and covariances of the factor scores F. To get an actual likelihood, we need to assume something about their distribution as well. The usual assumption is that Fir Рѕ╝ N (0, 1), and that the factor scores are independent across factors r = 1, . . . q and individuals i = 1, . . . n. With this assumption, the features have a multivariate normal distribution Xi Рѕ╝ N (0, ╬е + wT w). This means that the log-likelihood is np n n Рѕњ1 L=Рѕњ log 2¤ђ Рѕњ log |╬е + wT w| Рѕњ tr (╬е + wT w) V (27) 2 2 2 where tr A is the trace of the matrix A, the sum of its diagonal elements. One can either try direct numerical maximization, or use a two-stage pro- cedure. Starting, once again, with a guess as to ╬е, one finds that the optimal choice of ╬е1/2 wT is given by the matrix whose columns are the q leading eigen- vectors of ╬е1/2 V╬е1/2 . Starting from a guess as to w, the optimal choice of ╬е is given by the diagonal entries of V Рѕњ wT w. So again one starts with a guess about the unique variances (e.g., the residuals of the regressions) and iterates to convergence.7 The differences between the maximum likelihood estimates and the Рђюprin- cipal factorsРђЮ approach can be substantial. If the data appear to be normally distributed (as shown by the usual tests), then the additional efficiency of max- imum likelihood estimation is highly worthwhile. Also, as weРђЎll see next time, it is a lot easier to test the model assumptions is one uses the MLE. 4.1 Estimating Factor Scores The probably the best method for estimating factor scores is the РђюregressionРђЮ or РђюThomsonРђЮ method, which says Fir = Xij bij (28) j and seeks the weights bij which will minimize the mean squared error, E[(Fir Рѕњ Fir )2 ]. You will see how this works in a homework problem. 5 The Rotation Problem Recall from linear algebra that a matrix O is orthogonal if its inverse is the same as its transpose, OT O = I. The classic examples are rotation matrices. For instance, to rotate a two-dimensional vector through an angle ╬▒, we multiply it by cos ╬▒ Рѕњ sin ╬▒ R╬▒ = (29) sin ╬▒ cos ╬▒ 7 The algebra is tedious. See section 3.2 in Bartholomew (1987) if you really want it. (Note that Bartholomew has a sign error in his equation 3.16.) 8

9.The inverse to this matrix must be the one which rotates through the angle Рѕњ╬▒, RРѕњ1 ╬▒ = RРѕњ╬▒ , but trigonometry tells us that RРѕњ╬▒ = R╬▒ . T To see why this matters to us, go back to the matrix form of the factor model, and insert an orthogonal q ├Ќ q matrix and its transpose: X = + Fw (30) T = + FOO w (31) = + Gu (32) WeРђЎve changed the factor scores to G = FO, and weРђЎve changed the factor loadings to u = OT w, but nothing about the features has changed at all. We can do as many orthogonal transformations of the factors as we like, with no observable consequences whatsoever.8 Statistically, the fact that different parameter settings give us the same ob- servational consequences means that the parameters of the factor model are unidentifiable. The rotation problem is, as it were, the revenant of having an ill-posed problem: we thought weРђЎd slain it through heroic feats of linear algebra, but itРђЎs still around and determined to have its revenge. Mathematically, this should not be surprising at all. The factor live in a q-dimensional vector space of their own. We should be free to set up any coor- dinate system we feel like on that space. Changing coordinates in factor space will, however, require a compensating change in how factor space coordinates relate to feature space (the factor loadings matrix w). ThatРђЎs all weРђЎve done here with our orthogonal transformation. Substantively, this should be rather troubling. If we can rotate the factors as much as we like without consequences, how on Earth can we interpret them? Exercises 1. Prove Eq. 5. 2. Why is it fallacious to go from Рђюthe data have the kind of correlations predicted by a one-factor modelРђЮ to Рђюthe data were generated by a one- factor modelРђЮ? 3. Show that the correlation between the j th feature and G, in the one-factor model, is wj . 4. Show that the diagonal entries of U = V Рѕњ ╬е are given by wa2 . References Bartholomew, David J. (1987). Latent Variable Models and Factor Analysis. New York: Oxford University Press. 8 Notice that the log-likelihood only involves wT w, which is equal to wT OOT w = uT u, so even assuming Gaussian distributions doesnРђЎt change things. 9

10.Spearman, Charles (1904). РђюРђюGeneral Intelligence,РђЮ Objectively Determined and Measured.РђЮ American Journal of Psychology, 15: 201РђЊ293. URL http: //psychclassics.yorku.ca/Spearman/. Thurstone, L. L. (1934). РђюThe Vectors of Mind.РђЮ Psychological Review , 41: 1РђЊ32. URL http://psychclassics.yorku.ca/Thurstone/. 10

уёдуѓ╣уДЉТіђPHPт╝ђтЈЉтиЦуеІтИѕ
ти▓т░єжЊЙТјЦтцЇтѕХУЄ│тЅфУ┤┤ТЮ┐