1

My rowdata in R, as revealed by str() shows that: Variable A is of type character (chr), and variable B is of type numeric.

Variable A has 4 levels representing solutions A, B, C, and D. I used as.factor() to convert A into a factor.

Variable B has 3 levels representing different concentrations of 25, 50, and 75 mmol/L, I used as.factor() to convert B to factor.

The data is normally distributed with homogeneity of variances, I therefore used 2-way ANOVA analysis to assess the influences of variable AB on y (independent).

However, there are 2 strange results on Df and significant after converting types of factor:

there are the codes I used in R.

I first check if they are factor (False) and calculate with 2-way-anova, NO significant.

Secondly, I convert to factor and check again (True). same calculation again: p<0.05!! Df also changed.

is.factor(sheet10$Concentration) #False
is.factor(sheet10$Solution) #False
a1 = aov(AD1 ~ Solution * Concentration, data = sheet10)
summary(a1)

enter image description here

This has left me confused about which analysis approach to trust/use. May please help me to explain?

supplements on experiment and data:

  • The experiment involved an adsorption test of 5 different samples (S1, S2, S3,S4, S5), each representing a different material, across 4 solutions of A, B, C, and D.

  • each solution has 3 concentrations of 25, 50, and 75 mmol/l.

  • replication 2, n=2. resulting in a total of 120 data points.

design of my experiment. each sample has the same design, therefore, there are 120 data points. enter image description here

How I list my data and imported to Renter image description here:

Fay
  • 11
  • 4
    Have a look at plots of all observations by solution and concentration. One possible explanation is that concentration has a clear effect., but it isn't monotonic (for example, outcome values at 50 may be larger than at both 25 and 75). In that case concentration as a factor should be significant, but as a numerical variable the ANOVA tries to fit a line and tests the slope, and may not reject a zero slope. – Christian Hennig Jan 12 '24 at 11:24
  • 1
    I think @ChristianHennig is likely correct (I had the same thought). I'm not sure why this change strikes you as so strange. You changed the model -- that is, you asked a different question -- and got a different result -- that is, a different answer. This also shows a good reason to look at plots. – Peter Flom Jan 12 '24 at 11:53
  • 1
    @PeterFlom lol. cuz of my bad stats. I did not realize that converting the factors changed the model until you and ChristianHennig pointed it out. – Fay Jan 12 '24 at 11:57
  • @ChristianHennig. thank you so much. In the original data, concentration indeed has an impact on the results. Another issue I face is that the data for S1, S2, S3, S4, and S5 do not always meet the normality (checked via ANOVA residuals) and homogeneity of variances for every group. In this situation, should I use only non-parametric analyses for all groups, OR is it advisable to use parametric analyses for the normality and homogeneity data, and non-parametric analyses for those that do not? thanks again – Fay Jan 12 '24 at 11:59
  • 1
    @Fay One clue that you did notice is that the df changes. – Peter Flom Jan 12 '24 at 12:00
  • @Fay: The issue of model assumptions is not so easy (deviations from assumptions can sometimes be tolerated but not always), and I won't comment without knowing the data. I also don't see how you model that you have five different samples and 2 observations per sample and cell (?). Do you mean you run five different separate analyses? 2 observations per cell can't be used to diagnose normality. – Christian Hennig Jan 12 '24 at 12:54
  • @Christian with enough cells and an assumption of homoscedasticity, two observations per cell works quite well for evaluating the error distribution. – whuber Jan 12 '24 at 13:30
  • @ChristianHennig. thank you. may you please have a look on latest pictures? it shows how i design the experiment and the data usd in R model. – Fay Jan 12 '24 at 13:41
  • @whuber. thank you for your comment. I reedited this question with the experiment design and my row data. – Fay Jan 12 '24 at 13:43
  • @whuber The distributional assumption is within cell, and from 2 observations you can't say anything about the distributional shape. You can see violations of homoscedasticity though, which may in turn make normality checks fail. Assuming homoscedasticity to diagnose normality doesn't seem very reasonable to me as the difference between two observations tells us something about variance, but nothing about distributional shape. – Christian Hennig Jan 12 '24 at 14:36
  • Including data as images isn't very helpful (though better than no data at all). The easiest way to add data to a question, if using R, is dput(). See also this SO thread. – dipetkov Jan 12 '24 at 14:48
  • @dipetkov lol thanks for the tip. i do learn a lot from those amazing comments! :) – Fay Jan 12 '24 at 16:10
  • @Christian But you have far more than two observations: you have as many pairs of observations as you have cells. – whuber Jan 12 '24 at 17:21
  • @whuber But as I wrote before, the normality assumption is within cells. Any set of pairs within cells is compatible with normality, unless you require all distributions the same, which implies homoskedasticity. On what basis can you justify assuming homoskedasticity? Well, you could check whether all differences of within-cell pairs are about the same. But in that case, if anything, the distribution of residuals will have lighter tails than the normal, and any violation of normality of that kind is harmless (you will automatically have symmetry, all considering a saturated model). – Christian Hennig Jan 12 '24 at 17:51
  • Homoscedasticity is standard and can be tested, as you know. Checking for similar differences would be incorrect. You need to examine the entire distribution of absolute values of the differences of pairs within the cells. – whuber Jan 12 '24 at 17:52
  • @whuber So even though I admit that you could observe a very specific deviation from normality in this way, I don't think this is of relevance, whereas you actually can observe relevant deviations from homoskedasticity. – Christian Hennig Jan 12 '24 at 17:55
  • @whuber My point is that any relevant deviation from normality looking at the distribution of residuals/absolute differences between pairs in fact comes from too heavy tails, and will be well compatible with normal distributions everywhere, but heteroskedasticity. I.e., looking for relevant violations of normality doesn't add anything of interest to looking for violations of homoskedasticity. – Christian Hennig Jan 12 '24 at 17:58

1 Answers1

1

The first model, with only 1 df devoted to concentration, implicitly assumes that the outcome is linearly related to Concentration. That's how these functions interpret a numerical predictor, unless you specify otherwise. Furthermore, even though you didn't specify Solution as a factor before building that model, R will interpret a character-valued predictor as a factor if it can. It did.

A strict linear association between a continuous predictor and outcome is seldom going to hold. See Chapter 2 of Frank Harrell's Regression Modeling Strategies for flexible ways to evaluate associations of continuous predictors with outcome in the general case.

The second model, with Concentration treated as a factor, makes no assumptions about linearity between outcome and Concentration. Each Concentration level is allowed to have its own association with outcome and its own interaction with Solution. As it removes the highly restrictive linearity assumption of the first model, this second model makes a lot more sense. (For future reference, if there were more than 3 levels of Concentration you might consider a different approach from treating it as a multi-level factor, as discussed in the Harrell reference.)

A couple of warnings, however.

First, although the basic R aov() works OK with a perfectly balanced design like yours, it can lead to problems when there isn't balance. Learn about other ways to work with more general data sets.

Second, your models don't take into account that there only seem to be 5 separate samples, each tested in duplicate under all the conditions. The multiple observations on the same sample are likely to be correlated in some way (e.g., much lower overall in S5), so your standard-error estimates will be incorrect unless you use a modeling method that accounts for the correlations (e.g., include sample as another fixed effect in your ANOVA, or as a random effect in a mixed model).

Third, it looks like your outcome levels are restricted to values under 100; perhaps they are something like a percentage absorption. In that case a simple ANOVA might not be the best choice, although sometimes it can work OK if the residuals are close enough to normally distributed.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • thank you so much @EdM

    data of S1 to S5 are the results of adsorption on five materials. They are completely independent of each other. Therefore, your mentioned "much lower overall in S5" is also reasonable (or the desired outcome).

    I performed five separate analyses: S~ concentration * solution, repeatedly.

    Last but not least, I am aware that my data is small (only duplicates), and I have read some opinions suggesting that a 2-way ANOVA analysis with n=2 may not be meaningful.

    Therefore, I would like to inquire if you have any recommended models for such a small amount of data.

    – Fay Jan 22 '24 at 10:12
  • @Fay it depends on what you mean by "meaningful." In principle you can get by with n=2 per cell, but you run the risk of getting imprecise estimates or being thrown off by an outlying measurement error. I'm more worried about whether your simple linear models for materials S1 through S4, with values close to the fixed upper limit of 100, meet the assumptions needed for getting correct p-values. See this page for the assumptions and this page for plots that help evaluate whether they were met. – EdM Jan 22 '24 at 14:07