How to tell which variable is more meaningful when modeling the relationship between several predictors and outcome variable?

Question

I'm facing a problem in which I need to figure out two things:

which predictor, out of several relevant ones, is the most meaningful one in its effect/predictive power over a predicted variable.
the order of meaningfulness (from most meaningful to least) across those different predictors.

As I do not have an a-priory hypothesis about this investigation, I thought I should be doing some sort of multiple regression analysis. Then, perhaps I should be extracting the terms from the model and see which one is the most meaningful. I already know that going by p-value isn't the right way. Then what is?

Example

Let's say that I want to investigate which factors affect the well-being of city residents. I sample people (residents) from both New York City and San Francisco, and ask them to rate:

Their general satisfaction from their city.
How clean their city (in their opinion)
How good the level of education in schools
How good the public transportation.

The way I see this, there are 3 relevant variables here (cleanliness, education, and transportation) that may be related to overall satisfaction. I want to model this relationship, and conclude how different NYC from SF is. For example, when it comes to overall satisfaction, is the impact of education > transportation > cleanliness in NYC, whereas in SF transportation > education > cleanliness?

Here's some toy data to demonstrate.

my_df <- structure(list(location = c("sf", "nyc", "nyc", "sf", "nyc", 
                                  "nyc", "nyc", "nyc", "nyc", "sf", "nyc", "sf", "sf", "sf", "nyc", 
                                  "sf", "sf", "nyc", "nyc", "nyc", "sf", "sf", "sf", "sf", "sf", 
                                  "nyc", "sf", "sf", "nyc", "sf", "nyc", "nyc", "nyc", "sf", "nyc", 
                                  "nyc", "nyc", "sf", "nyc", "sf", "sf", "nyc", "nyc", "nyc", "nyc", 
                                  "nyc", "nyc", "nyc", "nyc", "sf", "nyc", "nyc", "sf", "sf", "nyc", 
                                  "nyc", "nyc", "nyc", "sf", "sf", "nyc", "sf", "nyc", "nyc", "sf", 
                                  "nyc", "sf", "sf", "nyc", "nyc", "nyc", "nyc", "sf", "nyc", "nyc", 
                                  "nyc", "sf", "sf", "nyc", "nyc", "nyc", "nyc", "sf", "sf", "nyc", 
                                  "nyc", "nyc", "sf", "sf", "sf", "nyc", "sf", "sf", "sf", "nyc", 
                                  "nyc", "nyc", "nyc", "nyc", "nyc"), 
                     satisfied = c(5, 1, 7, 5, 
                                   7, 1, 5, 5, 5, 7, 7, 4, 1, 3, 5, 6, 7, 7, 6, 4, 4, 5, 6, 5, 5, 
                                   7, 5, 6, 5, 4, 7, 7, 5, 5, 4, 7, 7, 5, 6, 6, 3, 6, 5, 7, 5, 7, 
                                   6, 5, 4, 3, 6, 5, 7, 3, 5, 5, 7, 5, 6, 7, 7, 7, 7, 5, 4, 7, 6, 
                                   7, 7, 6, 6, 6, 5, 7, 5, 4, 6, 4, 7, 5, 6, 6, 5, 5, 6, 7, 6, 5, 
                                   1, 5, 2, 7, 7, 7, 7, 7, 1, 3, 7, 7), 
                     clean = c(4, 1, 
                                         7, 3, 4, 1, 6, 6, 7, 4, 5, 1, 1, 1, 4, 6, 6, 1, 6, 1, 4, 2, 2, 
                                         7, 3, 5, 2, 4, 1, 1, 4, 6, 3, 5, 1, 4, 5, 2, 5, 5, 4, 5, 4, 7, 
                                         3, 6, 5, 4, 5, 4, 5, 4, 5, 1, 5, 2, 6, 5, 7, 6, 3, 7, 5, 6, 4, 
                                         6, 6, 5, 5, 5, 4, 1, 4, 4, 5, 1, 3, 1, 2, 2, 6, 4, 3, 6, 7, 7, 
                                         5, 2, 4, 1, 3, 1, 5, 3, 5, 5, 1, 1, 5, 6), 
                     edu = c(5, 
                                             1, 7, 4, 4, 1, 6, 6, 6, 4, 5, 4, 4, 3, 3, 5, 4, 1, 1, 3, 5, 6, 
                                             5, 5, 3, 6, 2, 4, 4, 4, 6, 3, 4, 7, 1, 4, 7, 5, 6, 5, 5, 5, 4, 
                                             7, 3, 7, 6, 5, 5, 4, 5, 3, 4, 4, 4, 7, 5, 4, 6, 6, 4, 7, 4, 2, 
                                             5, 6, 6, 7, 6, 7, 3, 3, 2, 6, 6, 2, 5, 3, 6, 5, 6, 4, 4, 5, 6, 
                                             7, 3, 3, 4, 5, 4, 1, 3, 4, 4, 6, 5, 1, 4, 6), 
                     transportation = c(1, 
                                              1, 7, 5, 7, 1, 4, 6, 6, 6, 6, 1, 1, 1, 5, 5, 4, 7, 6, 6, 7, 5, 
                                              2, 7, 3, 6, 1, 4, 7, 5, 6, 7, 4, 3, 2, 6, 4, 2, 6, 5, 4, 7, 6, 
                                              7, 3, 7, 4, 4, 5, 4, 6, 3, 5, 2, 7, 3, 7, 7, 7, 6, 7, 7, 7, 5, 
                                              3, 5, 4, 7, 6, 6, 4, 2, 4, 4, 5, 6, 5, 2, 6, 2, 6, 6, 3, 4, 7, 
                                              7, 7, 4, 5, 4, 5, 3, 7, 5, 7, 7, 7, 1, 6, 6)), 
                row.names = c(NA, -100L), 
                class = c("tbl_df", "tbl", "data.frame"))
library(magrittr)
library(effectsize)
my_df %>%
  lm(satisfied ~ cleanlocation + edulocation + transportation*location, data = .) %>%
  effectsize() %>%
  plot()
#> Warning: It is deprecated to specify guide = FALSE to remove a guide. Please
#> use guide = &quot;none&quot; instead.

^{Created on 2021-08-15 by the reprex package (v2.0.0)}

To compare between cities, I added interaction terms between each edu/clean/transportation variable and location variable. Then I used effectsize::effectsize() to get the estimates from the model. But what can I conclude from those estimates?

If I completely got this wrong, please advise what other path I should take for tackling this problem.

Thanks!

@JTH, thanks. Could you please hint how you would then conclude about the meaningfulness of each predictor? — Emman, Aug 16 '21 at 12:27
I think looking at variation (chi^2 statistic) and penalizing by degrees of freedom looks to be the best procedure. But this is subjective. see the anova function in the rms package. — JTH, Aug 16 '21 at 12:32
@JTH, thank you for recommending the {rms} package, I didn't know about it. I've tried wrapping my head around its functions, but it's all very new to me. I'll be thankful if you could provide an example of how you would utilize rms functions to do ANOVA over the data I showed in the post. I could then pick up from such an example. — Emman, Aug 16 '21 at 13:09
@JTH, I took a stub with rms. Please see my answer below. Could you provide feedback please? I also didn't find how to follow your suggestion with chi-square and df. — Emman, Aug 16 '21 at 13:39
@Emman I would suggest looking at the residuals of the model. The residuals can tell you a lot whether the model is effective or not. — mnm, Aug 19 '21 at 03:57

COOLSerdash · Accepted Answer · 2022-05-11T07:38:58.323

I believe you're looking for a metric of (relative) variable importance (see also this thread). Many available methods rely on the decomposition of the $R^2$ to assign ranks or relative importance to each predictor in a multiple linear regression model. A certain approach in this family is better known under the term "Dominance analysis" (see Azen et al. 2003). Azen et al. (2003) also discuss other measures of importance such as importance based on regression coefficients, based on correlations of importance based on a combination of coefficients and correlations. A general good overview of techniques based on variance decomposition can be found in the paper of Grömping (2012). These techniques are implemented in the R packages relaimpo, domir and yhat.

Here, I'm going to illustrate a method that is model-agnostic (i.e. it can be applied to a variety of model types) and has intuitive appeal: Variable importance based on permutation. The idea is very simple:

Decide on a performance metric that is important to you. Examples include: Root mean square error (RMSE), mean absolute error (MAE), $R^2$ etc. This also is somewhat dependent in the model type.
Calculate the metric on your dataset, $M_{orig}$. This serves as baseline performance metric.
For $i = 1, 2, \ldots, j$:
(a) Permute the values of the predictor $X_i$ in the data set.
(b) Recompute the metric on the permuted data and call it $M_{perm}$.
(c) Record the difference from baseline using $imp(X_i)=M_{perm} - M_{orig}$.

Do this repeatedly, say 1000 times, and take the average of the importance values. Intuitively, the permutations break the relationship between the predictor $X_i$ and the outcome. The larger the change in the performance metric, the higher the predictors' importance. More information can be found in this chapter of an online book by Christoph Molnar.

The R package vip implements this procedure (see the documentation (PDF) for more information). The following code applies the idea to your dataset. I chose the $R^2$ and the mean absolute error (MAE) as performance metrics and permute each predictor 1000 times:

library(vip)
# The model
mod <- lm(satisfied ~ clean*location + edu*location + transportation*location, data = my_df)
Calculate permutation-based importance with r-squared as metric
set.seed(142857) # For reproducibility
p_r2 <- vip::vi(mod, method = "permute", target = "satisfied", metric = "rsquared", pred_wrapper = predict, nsim = 1000)
p_r2
Variable       Importance  StDev
  <chr>               <dbl>  <dbl>
1 transportation     0.198  0.0492
2 clean              0.177  0.0465
3 edu                0.0462 0.0237
4 location           0.0449 0.0250
Calculate permutation-based importance with mae as metric
p_mae <- vip::vi(mod, method = "permute", target = "satisfied", metric = "mae", pred_wrapper = predict, nsim = 1000)
p_mae
Variable       Importance  StDev
  <chr>               <dbl>  <dbl>
1 transportation     0.166  0.0413
2 clean              0.144  0.0400
3 location           0.0396 0.0214
4 edu                0.0368 0.0219

According to the $R^2$, permuting transportation leads to the largest change in $R^2$, followed by clean. Using the mean absolute error shows a similar ordering with transportation and clean being most important while location and edu being least important.

References

Azen R, Budescu DV (2003): The Dominance Analysis Approach for Comparing Predictors in Multiple Regression. Psychological Methods 8:2, 129-148. (link)

Grömping U (2012): Estimators of relative importance in linear regression based on variance decomposition. Am Stat 61:2, 139-147. (link)

Thanks, this is new to me and looks interesting and relevant, and well explained. Is there a way to compare between cities (levels of location) within this framework? Essentially, I want to know if the order of Importance values in p_r2/p_mae remains the same or changes when we examine each city on its own. Is there any way to do that? — Emman, Aug 19 '21 at 11:25
Great overview, thanks! Whilst the permutation method is attractive due to its universality, I wonder whether it can lead to misleading results because it breaks the dependency structure among the predictors. I have tried it on http://users.stat.ufl.edu/~winner/data/worldagprod.dat with the model Output~Population.in.agriculture+Workstock+Land.equivalent+Fertilizer.consumption. Interestingly, the permutation method yields results very similar to standardized coeffizients (betasq metric in relaimpo), but differes in rank order and relative difference from the $R^2$ decomposition methods. — cdalitz, Dec 06 '21 at 09:01

Emman · Answer 2 · 2021-08-16T15:19:11.653

The following answer is some sort of an ignorant attempt to use {rms} package, following @JTH's suggestion. I got to say that this is the first time I'm using this package, and I have very minimal understanding of what I'm doing. Hence, I ask that anybody who can -- please provide feedback!

I've followed the procedure described in this chapter.

my_df <- structure(list(location = c("sf", "nyc", "nyc", "sf", "nyc", 
                                     "nyc", "nyc", "nyc", "nyc", "sf", "nyc", "sf", "sf", "sf", "nyc", 
                                     "sf", "sf", "nyc", "nyc", "nyc", "sf", "sf", "sf", "sf", "sf", 
                                     "nyc", "sf", "sf", "nyc", "sf", "nyc", "nyc", "nyc", "sf", "nyc", 
                                     "nyc", "nyc", "sf", "nyc", "sf", "sf", "nyc", "nyc", "nyc", "nyc", 
                                     "nyc", "nyc", "nyc", "nyc", "sf", "nyc", "nyc", "sf", "sf", "nyc", 
                                     "nyc", "nyc", "nyc", "sf", "sf", "nyc", "sf", "nyc", "nyc", "sf", 
                                     "nyc", "sf", "sf", "nyc", "nyc", "nyc", "nyc", "sf", "nyc", "nyc", 
                                     "nyc", "sf", "sf", "nyc", "nyc", "nyc", "nyc", "sf", "sf", "nyc", 
                                     "nyc", "nyc", "sf", "sf", "sf", "nyc", "sf", "sf", "sf", "nyc", 
                                     "nyc", "nyc", "nyc", "nyc", "nyc"), 
                        satisfied = c(5, 1, 7, 5, 
                                      7, 1, 5, 5, 5, 7, 7, 4, 1, 3, 5, 6, 7, 7, 6, 4, 4, 5, 6, 5, 5, 
                                      7, 5, 6, 5, 4, 7, 7, 5, 5, 4, 7, 7, 5, 6, 6, 3, 6, 5, 7, 5, 7, 
                                      6, 5, 4, 3, 6, 5, 7, 3, 5, 5, 7, 5, 6, 7, 7, 7, 7, 5, 4, 7, 6, 
                                      7, 7, 6, 6, 6, 5, 7, 5, 4, 6, 4, 7, 5, 6, 6, 5, 5, 6, 7, 6, 5, 
                                      1, 5, 2, 7, 7, 7, 7, 7, 1, 3, 7, 7), 
                        clean = c(4, 1, 
                                  7, 3, 4, 1, 6, 6, 7, 4, 5, 1, 1, 1, 4, 6, 6, 1, 6, 1, 4, 2, 2, 
                                  7, 3, 5, 2, 4, 1, 1, 4, 6, 3, 5, 1, 4, 5, 2, 5, 5, 4, 5, 4, 7, 
                                  3, 6, 5, 4, 5, 4, 5, 4, 5, 1, 5, 2, 6, 5, 7, 6, 3, 7, 5, 6, 4, 
                                  6, 6, 5, 5, 5, 4, 1, 4, 4, 5, 1, 3, 1, 2, 2, 6, 4, 3, 6, 7, 7, 
                                  5, 2, 4, 1, 3, 1, 5, 3, 5, 5, 1, 1, 5, 6), 
                        edu = c(5, 
                                1, 7, 4, 4, 1, 6, 6, 6, 4, 5, 4, 4, 3, 3, 5, 4, 1, 1, 3, 5, 6, 
                                5, 5, 3, 6, 2, 4, 4, 4, 6, 3, 4, 7, 1, 4, 7, 5, 6, 5, 5, 5, 4, 
                                7, 3, 7, 6, 5, 5, 4, 5, 3, 4, 4, 4, 7, 5, 4, 6, 6, 4, 7, 4, 2, 
                                5, 6, 6, 7, 6, 7, 3, 3, 2, 6, 6, 2, 5, 3, 6, 5, 6, 4, 4, 5, 6, 
                                7, 3, 3, 4, 5, 4, 1, 3, 4, 4, 6, 5, 1, 4, 6), 
                        transportation = c(1, 
                                           1, 7, 5, 7, 1, 4, 6, 6, 6, 6, 1, 1, 1, 5, 5, 4, 7, 6, 6, 7, 5, 
                                           2, 7, 3, 6, 1, 4, 7, 5, 6, 7, 4, 3, 2, 6, 4, 2, 6, 5, 4, 7, 6, 
                                           7, 3, 7, 4, 4, 5, 4, 6, 3, 5, 2, 7, 3, 7, 7, 7, 6, 7, 7, 7, 5, 
                                           3, 5, 4, 7, 6, 6, 4, 2, 4, 4, 5, 6, 5, 2, 6, 2, 6, 6, 3, 4, 7, 
                                           7, 7, 4, 5, 4, 5, 3, 7, 5, 7, 7, 7, 1, 6, 6)), 
                   row.names = c(NA, -100L), 
                   class = c("tbl_df", "tbl", "data.frame"))
library(rms, warn.conflicts = FALSE)
#> Loading required package: Hmisc
#> Loading required package: lattice
#> Loading required package: survival
#> Loading required package: Formula
#> Loading required package: ggplot2
#> 
#> Attaching package: 'Hmisc'
#> The following objects are masked from 'package:base':
#> 
#>     format.pval, units
#> Loading required package: SparseM
#> 
#> Attaching package: 'SparseM'
#> The following object is masked from 'package:base':
#> 
#>     backsolve
model_fit <- rms::ols(satisfied ~ cleanlocation + edulocation + transportation*location, data = my_df)
my_datadist <- rms::datadist(my_df)     ## apparently, we need these two lines 
options(datadist = "my_datadist")  ## otherwise we get an error with summary(model_fit)
                                   ## I learned it from here: https://stackoverflow.com/a/41378930/6105259
plot(summary(model_fit))

^{Created on 2021-08-16 by the reprex package (v2.0.0)}

As far as I understand this output, we see the effect each predictor carries on the outcome variable, with 95% CI. Thus, for example, we can conclude that clean has a greater effect over satisfied than edu has.

Does this make sense?
Originally, I was interested in the interaction between each edu/clean/transportation and location, as I want to learn how the relationships between predictors and outcome change between cities. But here, as far as I can see from this output, the interaction isn't reflected in terms of effect on satisfied.

UPDATE

Following @JTH's comment, I'm adding another plot:

plot(anova(model_fit))

The info underlying this plot is here:

library(dplyr, warn.conflicts = FALSE)
library(tibble)
anova(model_fit) %>% 
  as_tibble(rownames = "factor") %>%
  mutate(across(3:6, round, 4))
#> # A tibble: 14 x 6
#>    factor                          d.f.     Partial SS MS      F       P

#>    <chr>                           <anov.r> <anov.rms>   <anov.> <anov.> <anov.>
#>  1 "clean  (Factor+Higher Order F~  2        11.8209      5.9105 3.4587  0.0356 
#>  2 " All Interactions"              1         0.0000      0.0000 0.0000  0.9969 
#>  3 "location  (Factor+Higher Orde~  4         4.8955      1.2239 0.7162  0.5830 
#>  4 " All Interactions"              3         4.8821      1.6274 0.9523  0.4188 
#>  5 "edu  (Factor+Higher Order Fac~  2         4.0844      2.0422 1.1951  0.3073 
#>  6 " All Interactions"              1         3.1038      3.1038 1.8163  0.1811 
#>  7 "transportation  (Factor+Highe~  2        15.3207      7.6604 4.4828  0.0139 
#>  8 " All Interactions"              1         0.0476      0.0476 0.0279  0.8678 
#>  9 "clean * location  (Factor+Hig~  1         0.0000      0.0000 0.0000  0.9969 
#> 10 "location * edu  (Factor+Highe~  1         3.1038      3.1038 1.8163  0.1811 
#> 11 "location * transportation  (F~  1         0.0476      0.0476 0.0279  0.8678 
#> 12 "TOTAL INTERACTION"              3         4.8821      1.6274 0.9523  0.4188 
#> 13 "TOTAL"                          7        90.9761     12.9966 7.6055  0.0000 
#> 14 "ERROR"                         92       157.2139      1.7088     NA      NA

UPDATE 2

Addressing @EdM's comment, here is a print of anova(model_fit) without converting it to a tibble.

anova(model_fit) %>% 
  round(., 4)
#>                 Analysis of Variance          Response: satisfied 
#> 
#>  Factor                                                   d.f. Partial SS
#>  clean  (Factor+Higher Order Factors)                      2    11.8209  
#>   All Interactions                                         1     0.0000  
#>  location  (Factor+Higher Order Factors)                   4     4.8955  
#>   All Interactions                                         3     4.8821  
#>  edu  (Factor+Higher Order Factors)                        2     4.0844  
#>   All Interactions                                         1     3.1038  
#>  transportation  (Factor+Higher Order Factors)             2    15.3207  
#>   All Interactions                                         1     0.0476  
#>  clean * location  (Factor+Higher Order Factors)           1     0.0000  
#>  location * edu  (Factor+Higher Order Factors)             1     3.1038  
#>  location * transportation  (Factor+Higher Order Factors)  1     0.0476  
#>  TOTAL INTERACTION                                         3     4.8821  
#>  REGRESSION                                                7    90.9761  
#>  ERROR                                                    92   157.2139  
#>  MS      F    P     
#>   5.9105 3.46 0.0356
#>   0.0000 0.00 0.9969
#>   1.2239 0.72 0.5830
#>   1.6274 0.95 0.4188
#>   2.0422 1.20 0.3073
#>   3.1038 1.82 0.1811
#>   7.6604 4.48 0.0139
#>   0.0476 0.03 0.8678
#>   0.0000 0.00 0.9969
#>   3.1038 1.82 0.1811
#>   0.0476 0.03 0.8678
#>   1.6274 0.95 0.4188
#>  12.9966 7.61 <.0001
#>   1.7088

Try plot(anova(model_fit)), which graphically shows the technique I mentioned in my other comment. — JTH, Aug 16 '21 at 14:12
@JTH, thanks. I updated my answer. Looking at the plot (the x axis), we could have thought that some values are negative (to the left of 0). However, I think that this is just an issue with the graphics. Still, should we go by the Partial SS or the MS column? For the interaction terms, both columns hold the same values, but this isn't the case for the other terms. I'm a bit confused... — Emman, Aug 16 '21 at 14:30
Again, this is subjective, but my preference is for Chi^2 - DF, which is the x-axis of plot(anova(model_fit)). The reason the penalization is important is because you can end up in situations where the # categories determines the factor relevance. — JTH, Aug 16 '21 at 14:46
It might help for you to get the course notes and book associated with the rms package. Besides showing examples of how to do this evaluation of variable importance, they also show the dangers of putting too much trust in that evaluation. See the sections on "Bootstrapping Ranks of Predictors" in Chapter 5 of both references. Chapter 4 (in both) is a superb summary of strategies for regression modeling, which I recommend that you read carefully and keep as a guide to doing this type of work. — EdM, Aug 16 '21 at 14:58
The way you printed the anova.rms object, you lost some text. For each of the individual variable names (clean, location, edu, transportation) the corresponding row in the printout includes calculations for the "Factor + Higher Order Factors"; in this case, the variable plus all interactions involving it. That's why the Partial SS and MS values don't agree in those rows. The p-value for each of those rows agrees with the p-value in the corresponding row of the plot, which also includes the interactions. Negative plot values mean chi-square is less than the corresponding df. — EdM, Aug 16 '21 at 15:09
@EdM, thank you! your comments are very helpful. I've updated the post and added the raw print of anova(model_fit) (with some rounding). — Emman, Aug 16 '21 at 15:20
I agree with @JTH that ($\chi^2$ - df) is a reasonable measure here. Packages like rms can do different processing on an object returned by a regression depending on whether they perform print() or plot() or summary() on it. In this case, ($\chi^2$ - df) is most easily seen in the plot; you would need to do more calculations to get it from the printout. — EdM, Aug 16 '21 at 15:29
@EdM, thanks! One thing I don't understand though -- part of my original answer was to see how this set of predictors changes when we look at one city (nyc/sf) versus the other. But when I plot with plot(anova(model_fit)), what do I see here? Is it supposed to be according to location's reference level? That is, if we tweak using relevel(location, ref = ...) then the plot should be somewhat different? I can say that I tried releveling but the plot remains the same. — Emman, Aug 16 '21 at 15:43
@EdM Do these values measure effect size or statistical significance? Or is in this particular use case a monotonous relationship between the ANOVA p values and Cohen's $f^2$, which measures how strongly each variable improves the fit? — cdalitz, Aug 16 '21 at 15:54
"To see how this set of predictors changes when we look at one city ... versus the other" you need to look at the interaction terms. In this case, none of them is significant, so you can't reliably say that cities differ. An advantage of that plot is precisely that it doesn't depend on choice of reference level: it's an overall measure for that predictor or interaction term. If interactions are significant, you can use Predict() and contrast.rms() to get into details of how the cities differ. — EdM, Aug 16 '21 at 16:08
@cdalitz with OLS, each $\chi^2$ value in the plot is the product of the corresponding partial F statistic times its df. Subtracting the df penalizes for the number $p$ of fitted predictors included in that estimate, as $p$ is the expected value of $\chi^2$ under the null. So I suppose you could consider it analogous to Cohen's $f^2$ but with penalization. This nicely generalizes to regressions other than OLS, with $\chi^2$ values from a Wald statistic on each set of coefficients and covariances. — EdM, Aug 16 '21 at 17:20

cdalitz · Answer 3 · 2021-08-16T14:12:21.097

As all your variables are numeric and limited to the same range of values, directly comparing the absolute values of the coefficients (as you did in your answer) is one possible way to estimate effect size.

It might be, however, that for one variable the answers are only in the range 1-3, while they are for another variable in the range 1-5. If both variables are equally important, the second will have a smaller coefficient due to its wider range. To overcome this problem, you could compute standardized coefficients, aka "beta coefficients". There are convenience functions for their computation in addon packages, but with base R you can achieve the same by standardizing your data before doing the model fit:

my_df.scaled <- scale(my_df)

In addition, you should also check, whether all coefficients are significantly different from zero.

mnm · Answer 4 · 2021-08-19T07:34:22.630

I'd look at the residuals for determining the effectiveness of a model. Plots of residuals versus other quantities are used to find failures of assumptions. The most common plot, especially useful in simple regression, is the plot of residuals versus the fitted values. A null plot would indicate no failure of assumptions. Curvature might indicate the fitted mean function is incorrect. Residuals that seem to increase or decrease in average magnitude with the fitted values might indicate nonconstant residual variance. A few relatively large residuals may be indicative of outliers, cases for which the model is somehow inappropriate.

Assumptions of a linear model are:

Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.
Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals.
Homoscedasticity: The residuals have constant variance at every level of x.
Normality: The residuals of the model are normally distributed.

If one or more of these assumptions are violated, then the results of our linear regression may be unreliable or even misleading.

Using the given data, I build a simple linear regression model given below.

# required libraries
library(caret)
library(ggplot2)
library(magrittr)
split the train dataset into train and test set
set.seed(2021)
index <- createDataPartition(my_df$satisfied, p = 0.7, list = FALSE)
df_train <- my_df[index, ]
df_test  <- my_df[-index, ]
Model building
lm_model <- lm(satisfied ~., data = my_df)
Make predictions and compute the R2, RMSE and MAE
predictions <- lm_model %>% predict(df_test)
data.frame( R2 = R2(predictions, df_test$satisfied),
            RMSE = RMSE(predictions, df_test$satisfied),
            MAE = MAE(predictions, df_test$satisfied))
         R2      RMSE       MAE
1 0.4513587 0.9956456 0.8224509
Residuals = Observed - Predicted
compute residuals
residualVals <- df_test$satisfied - predictions
df.1 <- data.frame(df_test$satisfied, predictions, 
                   residualVals)
colnames(df.1)<- c("observed","predicted","residuals")
head(df.1)
  observed predicted   residuals
1        5  4.443428  0.55657221
2        7  6.930498  0.06950225
3        4  3.626674  0.37332568
4        3  3.538958 -0.53895795
5        5  5.083349 -0.08334872
6        7  6.097199  0.90280068
ggplot(data = df.1, aes(x=predicted, y=residuals))+
  geom_point()+
  xlab("Predicted values for satsified")+
  ylab("Residuals")+
  ggtitle("Residual plot")+
  theme_bw()

Discussion

To understand the strength/weakness of a model, relying on a single metric is problematic. Visualization of model fit, particularly residual plots in the context of linear regression model, are critical to understanding whether the model is fit for purpose.

When the outcome is a number, the most common method for characterizing a model’s predictive capabilities is to use the root mean squared error (RMSE). This metric is a function of the model residuals, which are the observed values minus the model predictions. The mean squared error (MSE) is calculated by squaring the residuals and summing them. The RMSE is then calculated by taking the square root of the MSE so that it is in the same units as the original data. The value is usually interpreted as either how far (on average) the residuals are from zero or as the average distance between the observed values and the model predictions. Another common metric is the coefficient of determination, commonly written as R2. This value can be interpreted as the proportion of the information in the data that is explained by the model. Thus, an R2 value of 0.45 in the above model, implies that the model can explain less than half of the variation in the outcome/dependent/response variable, satisfied variation. Simply put, the above model is not good. It should be noted that R2 is a measure of correlation and not accuracy.

Thanks. Could you explain further how this method helps you to learn about which predictor is more meaningful compared to the others? — Emman, Aug 19 '21 at 06:38
@Emman I've added a discussion. Let me know if it helps or not. If not, be descriptive on what you would like to discuss. — mnm, Aug 19 '21 at 07:25
thanks for adding the discussion, While it is informative and generally relevant, it doesn't address the initial question that led to this thread: how can we compare between individual predictors? Your answer discusses the model as a whole, but this is not what I seek to understand. I simply want to know whether in a given city, residents value education more than transportation, and whether in a different city it might be the other way around. It is my assumption that predicting satisfaction can lead to such conclusions. How can we utilize your method to arriving at such bottom line(s)? — Emman, Aug 19 '21 at 08:04
@Emman sadly but you'll have to figure out the relationship between predictors on your own. Mine and others who have given varied options should be seen as pointers/directions, that can be used to make sense of relationships you want to understand/comprehend/prove. — mnm, Aug 19 '21 at 08:31

How to tell which variable is more meaningful when modeling the relationship between several predictors and outcome variable?

Example

4 Answers4

Calculate permutation-based importance with r-squared as metric

Calculate permutation-based importance with mae as metric

split the train dataset into train and test set

Model building

Make predictions and compute the R2, RMSE and MAE

Residuals = Observed - Predicted

compute residuals

Discussion

Linked