3

I am a geologist attempting to apply the discriminant function analysis to surface features I have mapped in ArcGIS. At the moment I have 4 dimensionless sorting variables calculated for each feature, and about 85% of my data points have been classified into one of 4 group types. I didn't check whether the data were normalized before I ran the first analysis. The DFA was able to successfully cross-validate 70.7% of all the data points, and one group was successfully cross-validated 91.8% of the time (very good). The problem is that the Box's M statistic is 0.000 at the p=0.05 level (default in SPSS), and because that technically means the results are not robust, it's probably a useless result.

I went back and checked the data for normality in MatLab. I was able to make more normal the collective data using Box-Cox transform, and then applying the kstest function on the transformed data in MatLab; however, the distribution of data within the groups (as opposed to the collective data from all groups, which is what I tested first) is non-normal, and I cannot justify applying different transformations to different groups.

My questions are the following:

  1. Given how well it was able to cross-validate despite the Box's M statistic problem, are the results okay?
  2. Is there anything I can do to salvage the analysis if the results are not okay?

I think I know the answer to both of the questions, but I want to make sure I can't do anything before I give up entirely and descope to a linear or log regression analysis. Thanks in advance for the answers!

Here is a picture of the SPSS DFA graph, if it helps:

Discriminant Function Analysis of the surface features

Jeremy Miles
  • 17,812
  • From the picture, one might recommend to somewhat lessen the right skew of the Function 2, that is, of the variable(s) associated most stronly with this dimension. As for Function 1, data along it seems roughly symmetric in every group, so that's OK. – ttnphns Aug 18 '23 at 18:45

2 Answers2

6

Box's M test is to check if the groups' covariances matrices are the same in the population; this homogeneity of covariances is as assumption of linear discriminant analysis (LDA). The test is quite sensitive to the violation of multivariate normality.

On the other hand, LDA itself is relatively robust to both violations - of the homogeneity and of normality. Unless Box's p-value is extremely low, one may be justified in practice not to pay much attention to the test. Explore the "Log Determinants" table associated with the test, it helps to identify which groups are more similar and which are apart, w.r.t. the homogeneity. Maybe you decide to exclude a group.

Don't forget about the option in SPSS's LDA "Classify - Use covariance matrix - Separate groups". As I've mentioned here

QDA (quadratic discriminant analysis) would be then [i.e. gross nonhomogeneity of covariance matrices] a step better approximation than LDA. A practical approach half-way between LDA and QDA is to use LDA-discriminants but use their observed separate-class covariance matrices at classification instead of their pooled matrix.

ttnphns
  • 57,480
  • 49
  • 284
  • 501
1

I also face this problem with my data. I am a biomedical scientist and my data are rarely normally distributed. However, I have had successful classification using DA as you describe, some times over 90% RCC. I don't know if I have the best answers for your questions, but I can tell you that Box's M is not robust with non-normal data. Use non-parametric Levene's test instead. Another option is to use multinomial (binary when you only have two groups) logistic regression, which is not sensitive to deviations from normal distribution as is the case for DA. Comparing results from discriminant analysis and logistic regression may add credibility to your conclusions whenever the take-home is the same from both techniques. I would not describe the analysis as worthless just because of Box's M or lack of normality. After all, RCC is often used to evaluate the performance of techniques such as DA and logistic regression; if you're getting high RCC with large enough sample sizes, I would be believe the results. Hope this helps.