How do I best show that Likert-responses for one set of questions differ from another set of questions tested in the same population?

Question

Let me preface this by saying I am very new to research / statistics so I don't know much and I apologize for the basic questions.

I have data from a Likert-type survey (Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree) that consists of twelve questions assessing bias. The first six questions assess perceived age bias (bias one thinks exists in the world) and the latter six questions assess personal age bias (bias one personally has). The likert responses can be collapsed to leave essentially three categories: Unbiased, Neutral, Biased.

My hypothesis was that there was no difference between perceived and personal age bias. I'm not sure what the best way to test this hypothesis is with my data...

Since Q1 - 6 measure perceive bias while 7 - 12 measure personal bias, I thought about aggregating results for Q1-6 and comparing that to results for Q7-12. I could then do something like a Chi-square to show a difference in distribution between the two and a post-hoc standardized residual analysis to pinpoint which response category is weighing in the most on that difference.

Also, if I do aggregate the results...what do I do if there is an outlier response that throws things off? Is there a better way to go about this? I pretty much want to say something like "Responses to the perceived bias questions were to show bias compared to the personal bias questions" or whatever I find.

Sorry for the long post, and I am grateful for the help.

BruceET · Answer 1 · 2020-09-24T22:00:53.377

Please explain what you mean by 'collapse'. I can imagine several ways to do this, even though I suppose it results in some loss in information, and thus fear it may be a bad idea.

I cannot imagine how you you could get a genuine 'outlier' from such Likert data. If you have values from one or more subjects that 'throw things off', then maybe you have made a mistake or maybe there is an unexpected outcome in your data that's trying to tell you something. Please explain what the 'outlier' is and why you think it is a barrier to analysis.

After collapsing, suppose each subject has a 'Perception' score x1, which can take values -1, 0, 1. Similarly, each subject has a 'Personal' score x2, which can take values -1, 0, 1.

Then you wonder whether a subject's two scores have the "same distribution."

Speaking very loosely, as you certainly are, that might mean any one of several things. For example, two possible interpretations (among several) might be:

Looking across subjects, difference scores d = x1 - x2 would tend to be $0.$ (Mean or median $0.)$
It is often possible to predict a subject's x2 score from his/her x1. (Dependence between x1 and x2.)

How to do a test might depend on your method of 'collapsing' and would certainly depend on what interpretations you have in mind. (Paired Wilcoxon test for the first, chi-squared test of independence for the second.)

Please clarify your question. Then maybe one of us can make suggestions how to analyze your data.

For some studies I think Bayesian analysis is clearly the best choice. But, IMHO this is not likely to be one of them. // You have some thinking to do to sharpen up what you want to find from your data, how you want to measure that, and what test to use for analysis. I think you are more likely to find useful specific guidance on that from traditional frequentist sources. — BruceET, Sep 24 '20 at 23:43

score 0 · Answer 2 · answered Sep 24 '20 at 23:24

I would encourage that you consider reframing your research question. Statistically speaking, you cannot prove the absence of a difference unless you plan on using Bayesian methods. I'll explain what I mean using a simple example of how you might use your data.

Say that you decide to take the mean rating of items 1-6 and 7-12. To do this, you might document a rating of "Strongly Disagree" as a 1, "Disagree" as a 2, "Neutral" as a 3, etc. You could then average the two parts of your scale so that each person has an average rating of perceived age bias and an average rating of personal age bias. You might then use a paired sample's t-test to "test" whether the average rating of these constructs differ within individuals. Under standard null hypothesis testing, the following would be your null and alternative hypotheses:

$H_0: \mu_d = 0$

$H_a: \mu_d \ne 0$

Just for clarity, $\mu_d$ is the mean difference (i.e., the average difference in the perceived age bias and personal age bias for each person). This null hypothesis test should illustrate that the null hypothesis (that there is no average difference) is your research hypothesis. At first glance, that should mean that your question is answerable since you could test that hypothesis; however, the issue with frequentist methods is that we are testing whether we can reject the null hypothesis. Frequentist null hypothesis tests can only ever have two results: reject the null hypothesis (result is statistically significant) or fail to reject the null hypothesis (result is not statistically significant).

So, depending on your statistical orientation, framing your question as trying to show the absence of an effect is not possible. From a scientific integrity perspective, your research question and hypothesis shouldn't be trying to prove a negative either.

Not all hope is gone, however. Like I mentioned, you could use Bayesian methods to estimate the evidence for the null relative to the alternative hypotheses. Or, you could reframe your statistical hypothesis. For example, instead of saying that there is no difference between the two subscales, you could test whether results are negligibly different from zero (or effectively zero). This framework is the two one-sided t-test (TOST).

As far as strategies for answering a question of whether people respond to items similarly or differently, I'm going to assume that items are repeated on the scale. In other words, the same items are given on the perceived bias and personal bias scales (e.g., Q1 = Q7, Q2 = Q8, etc.). the Item Response Theory framework gives a few resources for doing this analysis. You could test directly whether there is differential item functioning depending on whether an item is for personal bias or perceived bias, but this would assume that it's safe to say that there is a single latent variable (i.e., bias) as opposed to two separate latent variables (i.e., personal bias and perceived bias). Item Response Theory generally requires a fairly large sample (e.g., 250-500 people). You could do a chi-square for each item where you check whether counts of endorsements for the different Likert ratings versus whether the item is for perceived or personal bias, but you would need to adjust your p-values for multiple observations. The issue is that you would run into the fact that a non-significant result can't be taken as meaning that there is no difference.

How do I best show that Likert-responses for one set of questions differ from another set of questions tested in the same population?

2 Answers2