How to measure variability of categorical time series?

Question

I have a list of categorical variable time series. For example

A, A, A, A, A, A, A, A, A
A, B, C, A, B, C, C, C, A
A, Z, D, X, Y, T, A, A, F
C, D, C, D, C, D, D, C, C
A, A, A, A, B, A, A, A, A
A, A, A, A, A, B, B, B, B

I'd like to capture the jitteriness or the unpredictability of the time series and to order the list by that measure.

Are there any good measures for this?

What matters to you? Changes from one letter to another? Variety of different letters? // Consider number and lengths of runs, 'similarity indexes', // Possibly relevant Q&A. — BruceET, Jan 05 '21 at 09:16
https://stats.stackexchange.com/questions/438279/autocorrelation-for-a-categorical-time-series — CheeseBurger, Jan 05 '21 at 09:19
@BruceET, I think what's more important for me is the change itself (i.e. #changes) — Elimination, Jan 05 '21 at 11:41
@GuneykanOzkaya, how do you autocorrelate a categorical variable? — Elimination, Jan 05 '21 at 12:08
"Jitteriness" and unpredictability are different concepts. E.g. consider the distinction between AAAABBBB and ABABABAB. The second is more "jittery" than the first, because the symbols flip back and forth. But, they do so in a very regular way, so both sequences are highly predictable. Could you edit the question to be more specific about the properties you're interested in? — user20160, Jan 05 '21 at 19:32
@user20160: "Predictability" is hard to define or assess in a string of nine. For AAAABBBBB you have only two categories and only one 'change' (2 runs). That's why I asked what "important" to OP. — BruceET, Jan 05 '21 at 19:39
@BruceET Yes, I agree with your request for clarification. My comment is a followup to that, because the OP replied that number of changes is of interest. But, the question still mentions an interest in 'unpredictability', so I just want to emphasize that changes don't necessarily imply unpredictability. Short strings of characters were just a cartoonish way of pointing that out. Formalizing predictability certainly requires more nuance. — user20160, Jan 05 '21 at 19:51
OP mentions "jitteriness or unpredictability" which leaves lots of room to speculate about criteria and goals. — BruceET, Jan 05 '21 at 20:05

score 3 · Answer 1 · answered Jan 06 '21 at 10:14

I note from the comments that you are primarily interested in changes in state. One way to look at these discrete time-series is to model then as a Markov chain, in which the probability distribution for the outcome depends only on the previous state. Transition probabilities for a Markov chain can be estimated empirically from their sample proportions, so the probability of transitioning into a different state than the present state can also be estimated in this manner. Thus, in a discrete-state time-series $\mathbf{x} = (x_1,...,x_n)$ the estimator for the probability of a change-of-state is:

$$\hat{p} = \frac{1}{n-1} \sum_{t=2}^n \mathbb{I}(x_t \neq x_{t-1}).$$

For the time-series vectors in your question you have:

$$\begin{matrix} \text{A, A, A, A, A, A, A, A, A} & & & \hat{p} = 1, \\[6pt] \text{A, B, C, A, B, C, C, C, A} & & & \hat{p} = \tfrac{3}{4}, \\[6pt] \text{A, Z, D, X, Y, T, A, A, F} & & & \hat{p} = \tfrac{7}{8}, \\[6pt] \text{C, D, C, D, C, D, D, C, C} & & & \hat{p} = \tfrac{3}{4}, \\[6pt] \text{A, A, A, A, B, A, A, A, A} & & & \hat{p} = \tfrac{1}{4}, \\[6pt] \text{A, A, A, A, A, B, B, B, B} & & & \hat{p} = \tfrac{1}{8}. \\[6pt] \end{matrix}$$

BruceET · Answer 2 · 2021-01-06T09:29:55.930

Here are a couple of simple ideas, both easily implemented in R, that may be of use.

Number of runs. In statistics, a run is a consecutive sequence of values. R will count runs in a sequence. In R, it can be a little easier to use numbers as category labels then letters. So I'll "translate" your your A, B, D, etc. to number 1,2,4, etc., still taking them to be nominal categorical. [Spaces in input are just for ease of reading.]

x1 = c(1,1,1, 1,1,1, 1,1,1)
x2 = c(1,2,3, 1,2,3, 3,3,1)
x3 = c(1,9,4, 8,7,6, 1,1,5)
x4 = c(3,4,3, 4,3,4, 4,3,3)
x5 = c(1,1,1, 1,2,1, 1,1,1)
x6 = c(1,1,1, 1,1,2, 2,2,2)

In R, there is a procedure called rle (for "run length encoding") Here is how it works for x5

x5
[1] 1 1 1 1 2 1 1 1 1
rle(x5)
Run Length Encoding
  lengths: int [1:3] 4 1 4
  values : num [1:3] 1 2 1

So there are three runs: specifically a run of four 1s, a run of one 2, and a run of four 1s.

We can capture just the number, three, of runs using $-notation, as follows:

rle(x5)$val
[1] 1 2 1
length(rle(x5)$val)
[1] 3

Here are the numbers of runs in your examples:

length(rle(x1)$val)
[1] 1
length(rle(x2)$val)
[1] 7
length(rle(x3)$val)
[1] 8
length(rle(x4)$val)
[1] 7
length(rle(x5)$val)
[1] 3

We might write x1 < x5 < x2 = x4 < x3 to put your five sequences in order of increasing 'variability' or 'complexity' according to their increasing numbers of runs.

This method using runs does not account for the different number or of frequencies of the various categories.

Number of different categories, You can count the number of uniquely different categories in R by using unique to remove "redundant" categories, and then length to find the answer, as below, using x3 as an example:

x3
[1] 1 9 4 8 7 6 1 1 5
unique(x3)
[1] 1 9 4 8 7 6 5
length(unique(x3))
[1] 7

There are nine categories in the sequence, but only seven uniquely different ones.

For all five of your sequences, the counts of uniquely different categories are as follows:

length(unique(x1))
[1] 1
length(unique(x2))
[1] 3
length(unique(x3))
[1] 7
length(unique(x4))
[1] 2
length(unique(x5))
[1] 2

According to numbers of runs, sequences x2 and x4 are not distinguishable, but you might say that x4 < x2, according to the number of uniquely different categories.

Notes: (1) If you will have strings of different lengths in future work, you will have to take lengths of stings into account. Maybe runs per string length or different categories per string length. (2) A more sophisticated way of taking the 'diversity' of categories into account might be to use a diversity index. One of the simplest is Simpson's. You can look at Wikipedia's discussion of diversity indexes if you want to experiment with diversity indexes.

How to measure variability of categorical time series?

2 Answers2