Why are some datasets denoted as D = {X,y} instead of D={x,y}?

Question

In this question: On the importance of the i.i.d. assumption in statistical learning the dataset is denoted as D={X,y}. In statistics, capital letters are usually used to refer to a random variable, but how does this explain the fact that y is a small letter? ie why is it not D={X,Y}?

Welcome to cv, StackExchanger :-). If you take a closer look at the formula $\mathcal{D} = { \bf {X}, \bf{y} }$, you can see that the X is even bold face. This notation has very likely been chosen because X is a matrix here. — Ute, Jul 28 '23 at 21:57

score 1 · Answer 1 · answered Jul 28 '23 at 22:16

As noticed in the comment, $\mathcal{D} = \{ \bf {X}, \bf{y} \}$ seems to suggest that $\bf {X}$ if a matrix and $\bf{y}$ is a vector. There may be cases though where both or neither of the terms are matrices. There are different notation conventions, so the uppercase letter does not need to mean matrix, and the matrix may be denoted otherwise (e.g. bold font alone). So don't take the symbols for granted and always consult the context.

Why are some datasets denoted as D = {X,y} instead of D={x,y}?

1 Answers1