An underlying idea in statistical learning is that you can learn by repeating an experiment. For example, we can keep flipping a thumbtack to learn the probability that a thumbtack lands on its head.
In the time-series context, we observe a single run of a stochastic process rather than repeated runs of the stochastic process. We observe 1 long experiment rather than multiple, independent experiments.
We need stationarity and ergodicity so that observing a long run of a stochastic process is similar to observing many independent runs of a stochastic process.
Some (imprecise) definitions
Let $\Omega$ be a sample space. A stochastic process $\{Y_t\}$ is a function of both time $t \in \{1, 2, 3, \ldots\}$ and outcome $\omega \in \Omega$.
- For any time $t$, $Y_t$ is a random variable (i.e. a function from $\Omega$ to some space such as the space of real numbers).
- For any outcome $\omega$ the series $Y(\omega)$ is a time-series of real numbers: $\{Y_1(\omega), Y_2(\omega), Y_3(\omega), \ldots \}$
A fundamental issue in time series
In Statistics 101, we're taught about a series of independent and identically distributed variables $X_1$, $X_2$, $X_3$ etc... We observe multiple, identical experiments $i = 1, \ldots, n$ where an $\omega_i \in \Omega$ is randomly chosen and this allows us to learn about random variable $X$. By the Law of Large Numbers, we have $\frac{1}{n} \sum_{i=1}^n X_i$ converging almost surely to $\operatorname{E}[X]$.
A fundamental difference in the time-series setting is that we're observing multiple observations over time $t$ rather than multiple draws from $\Omega$.
In the general case, the sample mean of a stochastic process $\frac{1}{T} \sum_{t=1}^T Y_t$ may not converge to anything at all!
For multiple observations over time to accomplish a similar task as multiple draws from the sample space, we need stationarity and ergodicity.
If an unconditional mean $\operatorname{E}[Y]$ exists and the conditions for the ergodic theorem are satisfied, the time-series, sample mean $\frac{1}{T}\sum_{t =1}^T Y_t$ will converge to the unconditional mean $\operatorname{E}[Y]$.
Example 1: failure of stationarity
Let $\{Y_t\}$ be the degenerate process $Y_t = t$. We can see that $\{Y_t\}$ is not a stationary (the joint distribution is not time-invariant).
Let $S_t = \frac{1}{t} \sum_{i=1}^t Y_i$ be the time-series sample mean, and it's obvious that $S_t$ doesn't converge to anything as $t \rightarrow \infty$: $S_1 = 1, S_2 = \frac{3}{2}, S_3 = 2, \ldots, S_t = \frac{t+1}{2}$. A time invariant mean of $Y_t$ doesn't exist: $S_t$ is unbounded as $t \rightarrow \infty$.
Example: failure of ergodicity
Let $X$ be the result of a single coin flip. Let $Y_t = X$ for all $t$, that is, either $\{Y_t\} = (0, 0, 0, 0, 0, 0, 0, \ldots)$ or $\{Y_t\} = (1, 1, 1, 1, 1, 1, 1, \ldots$.
Even though $\operatorname{E}[Y_t] = \frac{1}{2}$, the time-series sample mean $S_t = \frac{1}{t} \sum_{i = 1}^t Y_i$ won't give you the mean of $Y_t$.