Here are a couple of simple ideas, both easily implemented in R, that may be of use.
Number of runs. In statistics, a run is a consecutive sequence of values.
R will count runs in a sequence. In R, it can be a little easier to use numbers as category labels then letters. So I'll "translate" your your A, B, D, etc. to number 1,2,4, etc.,
still taking them to be nominal categorical. [Spaces in input are just for ease of reading.]
x1 = c(1,1,1, 1,1,1, 1,1,1)
x2 = c(1,2,3, 1,2,3, 3,3,1)
x3 = c(1,9,4, 8,7,6, 1,1,5)
x4 = c(3,4,3, 4,3,4, 4,3,3)
x5 = c(1,1,1, 1,2,1, 1,1,1)
x6 = c(1,1,1, 1,1,2, 2,2,2)
In R, there is a procedure called rle (for "run length encoding")
Here is how it works for x5
x5
[1] 1 1 1 1 2 1 1 1 1
rle(x5)
Run Length Encoding
lengths: int [1:3] 4 1 4
values : num [1:3] 1 2 1
So there are three runs: specifically a run of four 1s, a run of one
2, and a run of four 1s.
We can capture just the number, three, of runs using $-notation, as follows:
rle(x5)$val
[1] 1 2 1
length(rle(x5)$val)
[1] 3
Here are the numbers of runs in your examples:
length(rle(x1)$val)
[1] 1
length(rle(x2)$val)
[1] 7
length(rle(x3)$val)
[1] 8
length(rle(x4)$val)
[1] 7
length(rle(x5)$val)
[1] 3
We might write x1 < x5 < x2 = x4 < x3 to put your five
sequences in order of increasing 'variability' or 'complexity' according to
their increasing numbers of runs.
This method using runs does not account for the different number or
of frequencies of the various categories.
Number of different categories, You can count the number of uniquely different categories in R
by using unique to remove "redundant" categories, and then length
to find the answer, as below, using x3 as an example:
x3
[1] 1 9 4 8 7 6 1 1 5
unique(x3)
[1] 1 9 4 8 7 6 5
length(unique(x3))
[1] 7
There are nine categories in the sequence, but only seven uniquely different ones.
For all five of your sequences, the counts of uniquely different categories are
as follows:
length(unique(x1))
[1] 1
length(unique(x2))
[1] 3
length(unique(x3))
[1] 7
length(unique(x4))
[1] 2
length(unique(x5))
[1] 2
According to numbers of runs, sequences x2 and x4 are not distinguishable,
but you might say that x4 < x2, according to the number of uniquely different categories.
Notes: (1) If you will have strings of different lengths in future work, you will have to take lengths of stings into account. Maybe runs per string length or different categories per string length. (2) A more sophisticated way of taking the 'diversity' of categories into account
might be to use a diversity index. One of the simplest is Simpson's.
You can look at Wikipedia's discussion of diversity indexes if you want to experiment with diversity indexes.
AAAABBBBByou have only two categories and only one 'change' (2 runs). That's why I asked what "important" to OP. – BruceET Jan 05 '21 at 19:39