0

I am testing a script in R to work with the PISA data for Spain. I calculate the average maths score as the weighted average with the weights (w_fstuwt) of each of the 5 plausible values and then taking the average of those 5 values. Comparing with the official PISA results (https://pisadataexplorer.oecd.org/ide/idepisa) I get the same results when I calculate it for Spain as a whole, but if I calculate it by regions (Andalusia, Aragon, etc.) the data do not match. For example, the average score in Mathematics in Andalusia, using R, I get 468.43, SPSS returns the same, while PISA Reports returns 467.40. Some regions overestimate and others underestimate. If I do the calculations with the instvy package, the average for Andalusia gives a value of 471.92, which does not match the PISA Report data either. I'm a bit lost here. Could it be due to the weights?

1 Answers1

1

There are 10 plausibles values for mathematics ("PV1MATH"... "PV10MATH"), not 5. Additionally, Andalusia is divided in four different strata in PISA. This is the case for all regions of Spain. If you missed that, this is probably why you get different results from the PISA report tool (which is hopefully correct - a few paragraphs below you'll find some R code to get the same result as them).

There's a PISA codebook (xlsx file) allowing you to check there are 10 plausible values for mathematics, and that strata in Spain are not based only on the region. Check the spreasheet "CY07_MSU_SCH_QQQ", and search for "Plausible Value 1 in Mathematics", "Plausible Value 10 in Mathematics", "Andalusia", etc.

It's generally a good idea to check thoroughly the documentation of complex data, even if it's unfortunately often a bore -and in this specific case, not very reader-friendly.

Here is a reproductible example in R to get the same result than the PISA report tool for Andalusia, using the 2018 student data file available for download here:

library("haven")
data <- read_sav("CY07_MSU_STU_QQQ.sav") 
#getting the subset for Andalusia only    
andalusia = subset(data, STRATUM %in% c("ESP0101", "ESP0102", "ESP9001", "ESP9002"))
#averaging the 10 scores for each student
vmath = c()
for (i in 1:10) {
  vmath = c(vmath, paste("PV", i, "MATH", sep=""))
}
andalusia$avg.math = rowMeans(andalusia[,vmath], na.rm=TRUE)
#calculating the weighted mean
weighted.mean(andalusia$avg.math, andalusia$W_FSTUWT)
>>> 467.4082
J-J-J
  • 4,098
  • 1
    Oh my goodness, 10 plausible values. I've spent several hours studying the pisa data analysis manual and the codebooks you mention, but it didn't even cross my mind that the 2018 edition had 10 PV instead of 5 (the pisa data analysis manual mentions 5, embarrased). Now it all makes sense, thank you very much! – David Gutiérrez Rubio Jun 07 '23 at 16:26
  • @DavidGutiérrezRubio You're welcome! That's an understandable mistake. I guess you could write to them to let them know their manual is not up-to-date. It's not the first problem I notice with the OECD website (even if it's generally OK), so I guess they have quite a small team working on updating the documentation, or maybe their priorities are elsewhere. I still have to find a public data website with a good, up-to-date and user-friendly documentation. – J-J-J Jun 07 '23 at 16:36