How should variables in statistics be understood?

Question

I had a slightly basic question about variables (particularly in statistics).

We would say that “height” is a variable. In my mind, “height” does not just represent different values, like 2, 45, or 75. It represents values + units: 2 feet, 45 feet, or 75 feet.

However, when we work with variables mathematically, I believe that we think of them as solely representing values / numbers. For example, we will say statements such as:

height = 2

weight = height + 4

weight = 6

It seems slightly confusing to me to have to think about variables in 2 different ways. Am I thinking about variables correctly?

In your model there are units implicitly involved, though they may not be natural, and your model would change depending on whether you are using feet or metres or kilograms or pounds or some standardised version of the data values. — Henry, Feb 07 '23 at 16:48
By implicitly, do you kind of mean like how @Dave described in his answer? — Pranav Jain, Feb 07 '23 at 16:50
https://stats.stackexchange.com/questions/50/what-is-meant-by-a-random-variable answers the mathematical part of this. The units calculus is part of applied mathematics, too: it's not particular to statistics and it's an unavoidable part of using mathematics in the physical sciecnes. — whuber, Feb 07 '23 at 18:06
Given a vector space that gives you frustration, you may find some solace in nondimensionalization. — Galen, Feb 08 '23 at 03:25
Is this question about calculation with units or something else about the concept of variables? — Sextus Empiricus, Feb 08 '23 at 07:39
@PranavJain in that case, is it a duplicate of the question mentioned by whuber? And isn't it too confusing to involve units in the question, when it is not about units? — Sextus Empiricus, Feb 09 '23 at 08:39
I just wanted to know if in statistics, why sometimes variables which are thought of as symbols representing numbers + units (height = 2 ft) are often thought of as symbols solely representing numbers (height = 2). It turns out that variables which represent numbers + units should always be thought of as representing numbers + units. It is only for convenience sake that we omit them, but implicitly they are there. — Pranav Jain, Feb 09 '23 at 18:04

Matt Krause · Accepted Answer · 2023-08-10T15:54:17.320

Units are often understood to be part of a random variable. If someone writes that "$h_i$ is the height of the $i$th patient ", you're meant to assume that every $i$ uses the same unit and it's a sensible one given the context (say, meters if $i$ indexes over adult humans). This is often omitted for the sake of brevity, but it would clearer if people explicitly wrote out things like "$h_i$ is the height of the $i$th patient in meters."

Units restrict how you can manipulate variables (and constants, too!). You can only directly add or subtract variables with the same units, which yields a result with the same units as the original. Variables with different units can be multiplied or divided, with the result having compound units. If $X$ is in meters and $Y$ in seconds, $\frac{X}{Y}$ has units of $\frac{\text{m}}{\text{s}}$ and $XY$ has units of $\text{m}\cdot \text{s}$. You can also multiply and divide by unitless values, like the number of data points, which does not change the result's units. Thus, the mean has the same units as the data from which is calculated, as does the standard deviation. Variance, however, has squared units: if $X$ is in meters, $V(X)$ has units of $\text{m}^2$.

Some statistical operations remove the units. Suppose you have a set of heights $h_i$ measured in meters. If you plug these into into the formula for $Z$-scores $$z_i = \frac{h_i - \bar{h}}{\sigma_h}$$ you'll notice that the units cancel: $\frac{\left(\text{meter} - \text{meter}\right) \overset{\Delta}{=} \text{meter}}{\text{meter}} = 1$, so $z_i$ is unitless. In a sense, this is the whole point: the $Z$ score is used to make variables, potentially measured with on different scales and with different units, comparable. Covariance and correlation also make a useful pair of examples. If you grind though the formula, you'll find the the covariance of $X$ and $Y$ has units of $x \cdot y$ and a scale that depends on the original variable. Correlation includes a "normalization" term that rescales this to between -1 and +1, while removing the units. This too is intentional: it lets you use correlation to compare relationships between disparate quantities.

Others find conversions between units. You can interpret regression as finding the "best" way to convert between the dependent and independent variables. Suppose you want to crudely predict people's heights. Using a set of heights (in meters), weights (in kilograms), and sex, you might fit a model like: $$ \text{height} = \beta_0 + \beta_1 \text{weight} + \beta_2 \text{sex}$$

The coefficient $\beta_1$ needs to have units of $\frac{\text{m}}{\text{kg}}$ for the equation to balance, and that's exactly how you should interpret it. Holding everything else at its baseline level, a $1 \text{ kg}$ increase in weight corresponds to a $\beta_1 \text{ m}$ increase in predicted height! Categorical variables are conceptually similar. We usually encode them as (unitless) numbers, such as male=0, female=1. $\beta_2$ is therefore in units of meters. You can also imagine that it's in the fictitious units of meters per sex-code or something if you want to move the units through the encoding process instead.

This holds for linear models, but you can apply similar logic to other families. A logistic regression, for example, predicts probabilities and finds coefficients that can be interpreted as mapping between the predictors and changes in log odds.

Want to learn more? If you're familiar with dimensional analysis from a natural science class, you may enjoy this article Nick Cox found.

Finney, D. J. (1977). Dimensions of Statistics. Journal of the Royal Statistical Society. Series C (Applied Statistics), 26(3), 285–289. https://doi.org/10.2307/2346969

If you want more practice with dimensional analysis, check out the first chapter of Sanjoy Mahajan's Street Fighting Mathematics, available with open-access here.

Thank you for the reply! This really did help clarify stuff. Knowing that coefficients actually have their own units is really helpful. — Pranav Jain, Feb 08 '23 at 04:29
(+1) See https://www.jstor.org/stable/2346969 for a wonderful paper by David Finney on dimensions of statistics. This issue arises in most elementary teaching when we explain that variance isn't always convenient because of its units and dimensions, and point to SD instead. But in much teaching it is (I think over-optimistically) assumed obvious. I've seen too many researchers rejoice when regression root mean square error goes down when you use logarithms of the response, forgetting that the units have changed. (In some circumstances, RMSE could go "up" with different units and a better fit.) — Nick Cox, Feb 08 '23 at 08:19
There could be an entire essay on when dimensionless unit-free statistics help and when they don't. Cases in point start with the coefficient of variation, correlations, t statistics. P-values! — Nick Cox, Feb 08 '23 at 22:02
@NickCox, I ought to finish two other papers before I go on leave! Don’t tempt me :-) — Matt Krause, Feb 09 '23 at 02:19

score 12 · Answer 2 · edited Feb 08 '23 at 08:13

You have to be careful here, because this depends on what exactly you're doing. For example, if you consider just a single variable and you compare two groups, methods such as a two-sample t-test or Wilcoxon test are scale invariant, meaning that their outcome will not depend on the measurement unit, and will be the same if you apply a linear (t-test) or even monotonic (Wilcoxon) transformation to the data, as is a change of measurement units.

Note that in most cases changes of measurement unit are linear, but they may not necessarily be, for example fuel consumption of cars can be measured by "gallons per mile" and "miles per gallon"; and you may get different results even in a t-test (as this is only invariant to linear transformation), depending on which measurement unit you choose, but not in a Wilcoxon.

If you want to test, in a single sample, for example "true mean=5", obviously this needs to use the same measurement unit as your data.

Furthermore, there are methods in statistics that are not scale invariant. Particularly when you're working with more than one variable, multivariate methods that bring information from several variables with potentially different measurement units together may be sensitive to the measurement scales. For example Principal Component Analysis and k-means clustering will implicitly give higher weight to variables that have a larger variance. This means that if one variable is weight in kg and the other one is height, the weight/impact of the height variable for the overall result will be higher if height is in cm than if the same heights are given in m, as this will yield smaller numbers and a smaller variance. In such cases it is normally recommended to standardise the data (standardising removes the effect from linear changes of the measurement units), whereas this is not recommended in some cases if the measurement units for all variables are the same, namely if for some reason in the given situation a larger variance implies that a variable is in fact more informative than the others.

Dave · Answer 3 · 2023-02-08T18:06:10.677

10

Implicitly, there is a unit-conversion factor in front of the height. Thus, the full way to write it is something like:

$$ \text{weight (lbs)} = \dfrac{1 \text{ lbs}}{1 \text{ ft}}\bigg(\text{Height (ft)}\bigg) + 4\text{ (lbs)} $$

It is typical to ignore units, and statistics tends to work out fine, but you are correct that the units are there and do sometimes need to be considered. For instance, if you wanted to calculate the weight for a subject who has their height measured in meters, you cannot just plug that number of meters into the equation. You would first have to convert the height in meters to a height in feet.

edited Feb 08 '23 at 18:06

answered Feb 07 '23 at 16:42

Dave

62,186

That is very insightful. So unit coversions are implicitly there, but we just ignore them for convenience. – Pranav Jain Feb 07 '23 at 16:48
2

@PranavJain That's how I see it. – Dave Feb 07 '23 at 16:57
5

I strongly disagree with this exchange due to the potential confusion it will sow. We never ignore a units conversion, because it will create errors. We just might not indicate it explicitly. The equation in this answer is invalid because it omits the factor that might convert "lbs" to "ft" (a rather unusual conversion, to say the least). – whuber Feb 07 '23 at 18:10
1

@PranavJain Note that I only meant the unit conversions in your original equation. Later comments are correct that this is just a convenience that we often (but NOT always) can get away with doing. If you converted the units to meters and Newtons, for instance, then the numbers would change because the units have changed. – Dave Feb 07 '23 at 18:14
1

It might be better to write the equation as $$\text{weight} = 1\cdot \text{height} + 4$$ the $1$ is a coefficient with units 'weight per height'. Because, in this example, it has the value of the identity operator, it is not written explicitly, but that makes the dimensions confusing. – Sextus Empiricus Feb 08 '23 at 07:36

score 1 · Answer 4 · edited Feb 08 '23 at 08:22

1

Largely ignorable--the regression won't care if you measure height in inches or feet. However very small or very large numbers (millions or millionths) can cause issues due to software limitations. So when dealing with a large number such as count of population, or count of road miles, you might divide the total (e.g. 12345678) by a thousand to get 1234.5678. Likewise, if you are dealing with a very small number (e.g. pediatric cancer rate per million of .001234), you might change units to get 1.234.

edited Feb 08 '23 at 08:22

Nick Cox

56,404
8
127
185

answered Feb 07 '23 at 17:57

Mox

305

6

Some regressions and other statistical procedures do care about the units. Most of these are developed in applied literature to be used only for data that have been expressed in some conventional units. Others are for specific applications such as to counts. (I have seen otherwise respectable statisticians recommend invalid procedures by not keeping this in mind, especially with problems related to Poisson distributions.) – whuber Feb 07 '23 at 18:12
1

ie, %RMSE/PMSE is only valid when data of similar magnitude are compared. – Mox Feb 07 '23 at 18:30
4

I cannot see how that is relevant to this thread. – whuber Feb 07 '23 at 20:57
2

Changing units by a factor of 1000 or so can certainly be a good idea, but the reason is human convenience and comprehension. Software shouldn’t misbehave one way and not the other. – Nick Cox Feb 08 '23 at 09:13

How should variables in statistics be understood?

4 Answers4