5

I have a dataset that measures students' time spent working on a set of mathematics questions. My dataframe looks a little something like this:

Participant ID Question 1 Question 2 Question 3
1107 54.2 48.9 45.0
4208 53.1 45.6 40.6

I have times for 20 questions for about 200 students. I have observed an overall decrease in time spent per question, as is shown in the figure below:

Time series of graph time spent per question

I would like to accompany this graph with a statistical measure of negative tendency.

I don't think I should use a correlation statistic as the question number is a categorical variable.

I maybe could do a OLS regression, with X being the question number and y being the time spent per question, but I am not sure how to interpret the result.

What else could I try?


Edit

Since a few people have been asking about the context in which this data was collected, you can read all about it in the study pre-registration https://osf.io/f7zgd

jda5
  • 175
  • OLS is a simple way to summarise the trend. Remember in OLS the slope (\beta_1) is the average change in $E (y)$ for a unit change in $x$ ($x \to x + 1 \implies y \to y + \beta_1)$. – jcken Mar 30 '22 at 12:21
  • 1
    I'm not sure it would be your best choice, but I would probably do a Mann-Kendall trend test. – sceptic Mar 30 '22 at 13:02
  • Can you say more about the design of your experiment as it's not clear whether you are measuring a question effect (ie. time spent on these particular 20 questions) or an order effect (ie. time spent on questions at the start vs the end of the session)? – dipetkov Mar 30 '22 at 21:34
  • @dipetkov In the question above I think I was alluding to an order effect, though I would also be interested in a question effect. The data is taken from a two-group, between subjects, RCT design with one factor and three measures (pre-test, post-test, and delayed post-test). Participants completed questions on one of two apps, though the data above only considers the experimental group. The decreasing time per question I believe to be an indication of improvement and efficacy of the app. Thus both the time spent on the app (aka the 20 questions) and the order effect are relevant! – jda5 Mar 31 '22 at 08:12

4 Answers4

5

The plot itself is perhaps the best way to present the tendency.

Consider supplementing it with a robust visual indication of trend, such as a lightly colored line or curve. Building on psychometric principles (lightly and with some diffidence), I would favor an exponential curve determined by, say, the median values of the first third of the questions and the median values of the last third of the questions.

An equivalent description is to fit a straight line on a log-linear plot, as shown here.

Figure

This visualization has been engineered to support the apparent objectives of the question:

  • A title tells the reader what you want them to know.

  • The connecting line segments are visually suppressed because they are not the message.

  • The fitted line is made most prominent visually because it is the basic statistical summary -- it is the message.

  • Points that are significantly beyond the values of the fitted line (with a Bonferroni adjustment for 20 comparisons) are highlighted by making them brighter and coloring them prominently. (This assumes the vertical error bars are two-sided confidence intervals for a confidence level near 95%.)

  • The line is summarized by a single statistical measure of trend, displayed in the subtitle at the bottom: it represents an average 6.2% decrease in working time for each successive question.

This line passes through the median of the first five answer times (horizontally located at the median of the corresponding question numbers 0,1,2,3,4) and the median of the last five answer times (horizontally located at the median of the corresponding question numbers (16, 17, 18, 19, 20). This technique of using medians of the data at either extreme is advocated by John Tukey in his book EDA (Addison-Wesley 1977).

Some judgment is needed. Tukey often used the first third and last third of the data when making such exploratory fits. When I do that here, the left part of the line barely changes (it should not, since the data are consistent in that part of the plot) while the right part changes appreciably, reflecting both the greater variation in times and the greater standard errors there:

Figure 2

This time, however, (a) there are more badly fit points and (b) they consistently fall below the line. This suggests this fit does not have a sufficiently negative slope. Thus, we can have confidence that the initial exploratory estimate of $-6\%$ (or so) is one of the best possible descriptions of the trend.

whuber
  • 322,774
  • 1
    Thank you for what is an outstanding answer! I am definitely going to draw this plot on my data! – jda5 Mar 30 '22 at 17:51
  • Could I maybe instead use linear regression to draw my line of best fit? – jda5 Mar 30 '22 at 17:55
  • 1
    Linear regression of the logarithms would produce a similar fit. Just don't rely on its estimates of standard errors or p-values, because you haven't accounted for the obvious heteroscedasticity (strong changes in variability of response with question number). You literally could just pencil in a line by eye over the loglinear graph and accomplish everything you want. – whuber Mar 30 '22 at 17:58
  • Great, thanks for your help. – jda5 Mar 30 '22 at 18:10
  • While a straight line does give the OP what they asked for (one number summary of negative tendency), what I see in the original plot is a strong linear trend for questions 1-10 and then a leveling-off after question 11. Adding the line to some extent tricks the visual system to see a negative linear relationship for the entire range of questions. – dipetkov Mar 30 '22 at 23:47
  • @dipetkov Those are interesting observations. But you propose a model that requires at least four parameters instead of two. It therefore ought to be considered a refinement of the (admittedly) simplistic characterization of a single trend. One reason why presenting the graphic is so important is that it permits all viewers to evaluate such alternative models and draw their own conclusions. – whuber Mar 31 '22 at 00:43
  • 1
    @whuber I agree! It's always better to show the data. And it was your plot that made me realize that I don't quite see a straight line for the full set of questions. – dipetkov Mar 31 '22 at 01:05
  • 1
    @dipetkov That's fine: plotting the points has thereby served its purpose. But what you do see is necessarily more complicated than a linear trend, raising questions about whether your assessment is real or due to chance variation that merely causes changes attracting your attention. Deciding among these possibilities is (as I'm sure you well know) such a thorny problem that usually, if we have the chance, we obtain additional independent data to test such novel hypotheses. – whuber Dec 23 '22 at 14:03
4

While your question variables are categorical, they could also be treated as ordinal, since they are done in sequence, so there is a natural ordering of the questions. In that case something like Spearman's rho correlation coefficient would be... ok.

user2974951
  • 7,813
3

Caveat The OP presents an interesting experiment that produced (up to) 200x20 = 4000 measurements. It's best to analyze the data at the student level, not the 20 averages per question, using for example spline regression as the averages don't follow a simple trend and the variances don't look constant either. That being said, the actual question is how to summarize the trend in the averages and what that trend is.

As @whuber illustrates, a plot is a great way to present 20 numbers (at least in two dimensions). If the numbers follow a pattern, this pattern can be highlighted with a judicious use of color and superimposing a line or a curve to emphasize the apparent relationship between the x and y variables.

These are powerful tools because the human visual system is very strong at seeing patterns. Sometimes however the additional graphical elements may suggest a relationship that's not strongly supported by the data.

I, for example, see a negative correlation between question order and time to answer only for questions 1 to 10. For questions 11 to 20, I see a leveling-off of average time to answer and an increase in variance (except for question 18).

Qualitatively, the data is consistent with both the "constantly negative" relationship" and "first-half negative, second-half level" relationship between question and time.

To demonstrate I plot the data twice, superimposing the two relationship patterns in each plot. I also add some commentary in the title, deliberately provoking, to stress the point I want to make with the functional relationship, in front of my imaginary audience. [For completeness, I include my "data" below.]

Neither alternative shown below is convincing (though the constant trend is less unconvincing?) It's better to fit a model that doesn't make strong assumptions about the relationship between question order and time to the full dataset of obseverations without averaging first. Perhaps spline regression, as recommended in Regression Modeling Strategies, is a place to start. See for example here.

Take 1: Time to answer levels off in the second half of exam

And the alternative:

Take 2: Time to answer improves continuously throughout the exam

My fake data. I took the average response time from the original plot by eye. I assume time is measured in seconds because the total time is about 400 units and 400 minutes of math is a lot.

math <- tribble(
  ~question, ~time, ~sd,
  1, 40, 4,
  2, 31, 3,
  3, 27, 3,
  4, 24, 2,
  5, 27, 4,
  6, 28, 4,
  7, 26, 2,
  8, 24, 3,
  9, 21, 4,
  10, 24, 5,
  11, 13, 3,
  12, 10, 2,
  13, 13, 3,
  14, 24, 7,
  15, 17, 8,
  16, 9, 1,
  17, 16, 6,
  18, 16, 0.5,
  19, 10, 5,
  20, 11, 6
)
dipetkov
  • 9,805
  • 1
    I recommend studying the residuals. The residuals for your more complicated model aren't appreciably better than for the simpler exponential trend: the same number of observations lie significantly beyond the fit. There's also the (very relevant) question of scientific plausibility: do you really believe students suddenly experience a drop-off in answering time after the 11th question?? – whuber Mar 31 '22 at 03:08
  • 1
    @whuber I don't believe either and I wouldn't be fitting a trend to the 20 averages. If that's not clear then I didn't make my point clear. I would analyze the full data at the student level (all 200 x 20 = 4000 points). Actually, I was interested in this question because I thought the experimental design is interesting: Are the questions given in the same order and how is the order enforced? Do the question have a (known) level of difficulty? Do students always answer or can they skip a question? Why analyze time to answer on its own when correctness of answer seems relevant too?.... – dipetkov Mar 31 '22 at 04:43
  • @dipetkov First up, I just wold like to thank you for your incredibly thoughtful answer! Ok, now on to your questions... – jda5 Mar 31 '22 at 08:16
  • 1
    The questions are given to students on an app and the order is enforced. The questions all pertain to the same topic and each ask the students to do pretty much the same thing. I cannot objectively claim anything about the difficulty of the questions but subjectively, drawing on my experiences as an educator, I can say that the questions get progressively harder (though jumps in difficult may not be consistent across all questions). The student can skip a question and a few actually did (~30) - I have removed these from the dataset. – jda5 Mar 31 '22 at 08:23
  • 1
    Since I collected data on an app there are loads of measures! This includes (per question): total time spent on the question, time spent submitting an answer, time spent looking at feedback the app provides, time spent looking at a model solution, number of correct answers, number of correct working out steps, text the student submitted, number of times the student asked for help (from the app), their score (the app gives them a score), the number of times students viewed their feedback and the number of times students viewed their data (the app shows the graphs on their performance). – jda5 Mar 31 '22 at 08:28
  • Some of these are definitely more interesting measures of ability over time than others, but since they all could be expressed as a series many of the above answers apply to other my other measures! I have pre-registered this study, so if you are interested, or would like to know anything else in more detail (sampling, data collection procedures, etc.), you can read about it here: https://osf.io/f7zgd – jda5 Mar 31 '22 at 08:33
  • +1 This post deserves recognition for its thoughtful, expansive approach to the underlying problem. – whuber Mar 31 '22 at 13:28
  • Jacob, your study is interesting but it's up to you to summarize it for people on CV. I have three suggestions. 1. Plot all the data. 2. Ask a better question. 3. Aggregate the data differently (for an initial analysis.) – dipetkov Mar 31 '22 at 18:32
  • There are 2 groups and 3 stages (pre, post, and delayed post). I suggest making 6 "spaghetti plots" for each group/stage combination. A spaghetti plot will show the response time for each student across the 20 questions. See the first plot in the Many models section of R for Data Science. Let's see if visualizing all the data clarifies whether there is a time trend and what its shape is. Sample 50 or 100 students if there is too much overplotting.
  • – dipetkov Mar 31 '22 at 18:33
  • Consider writing a new question. Writing good questions on CV is hard. You should present your study & data succinctly (a spaghetti plot will come in handy!) and separately from your analyses of it. Be as specific as possible about the two/three most important questions you want to answer. Describe your attempts to answer them.
  • – dipetkov Mar 31 '22 at 18:33
  • I suggest the following analysis: For each student and session, let Y be the number of questions they answer correctly out of 20. Then you have Y_pre (predictor), Y_post & Y_post2 (responses) and any other covariates you've collected about the students. You can model Y_post/Y_post2 with a binomial regression on Y_pre + treatment + covariates....
  • – dipetkov Mar 31 '22 at 18:33