5

I have 20 years of data from observing dolphins. When a group is seen, it gets an unique number identifying it, and all identified (marked) dolphins were also registered. So I had a table like

Group Dolphin
1     1
1     10
1     14
2     10
2     23

I got about 20,000 groups and 500 dolphins. I made a matrix with groups in rows and dolphins in columns, with 1 where a dolphin is in a group, 0 where it's not. So I've ran a multivariate analysis (prcomp in R), and it gave me two very distinct blocks.

PCA

I know that the vertical axis (PC2) is related with the size of the group, the small groups are below, in the bottom of the triangles. But the two blocks are separated in the horizontal axis (PC1), and I didn't figured yet what it's about. I suppose the dolphin population might be separated into two "clans" or something like that, but couldn't discover yet. I colored the points according to group size, water level, place of sight, but all patterns appear equally on both sides. The block on the right has about 2400 points.

My questions are: such a separation must mean something, right? What other tools can I use to understand what is going on?

Thanks very much!

Rodrigo
  • 339

1 Answers1

2

I recommend looking at prcomp$rotation[,1], which tells you how much each dolphin contributes to the first principal component. (I believe these numbers are called scores or loadings.) This might help you to interpret PC1 in your plot.

Here is an example where there are ten dolphins and 100 groups, and the dolphins are divided into two clans, with groups tending to be made up mostly of dolphins from one clan.

set.seed(100)
M <- matrix(0, 100, 10)
for (i in 1:50) M[i,] <- c(sample(0:1, 5, replace=T, prob=runif(2)), 
sample(0:1, 5, replace=T, prob=c(0.95,0.05)) )
for (i in 51:100) M[i,] <- rev(c(sample(0:1, 5, replace=T, prob=runif(2)),
c(sample(0:1, 5, replace=T, prob=c(0.95,0.05)) ))
)

image(M)

enter image description here

biplot(prcomp(M))

enter image description here

prcomp(M)$rotation[,1:2]

             PC1       PC2
 [1,] -0.2088296 0.1042746
 [2,] -0.3581626 0.1974796
 [3,] -0.3615619 0.2843145
 [4,] -0.3754459 0.3319092
 [5,] -0.2892279 0.5214507
 [6,]  0.3206343 0.4149111
 [7,]  0.2782524 0.2736374
 [8,]  0.3384193 0.3517811
 [9,]  0.2716513 0.2545807
[10,]  0.3228274 0.2272208

Here, PC2 is related to the size of the group since all the dolphins contribute to it. But PC1 distinguishes between dolphins 1-5 and dolphins 6-10. You can also see this on the biplot (although in your case, a biplot might not be very helpful because there will be so many arrows.)

However, it's also possible that the separation might be an artefact and doesn't actually mean anything; this can happen with PCA. See Shalizi's PCA tutorials lecture 10 for an example.

Another thing you could consider is social network analysis. People have worked on dolphin social networks before, so you might get some ideas from there. There is a data set called Dolphin in the UCI network data repository which might have links to interesting work.

[Edit: I also wonder whether PC1 might be related to time, since your data are spread over 20 years?]

Flounderer
  • 10,518
  • 1
  • 37
  • 45
  • Thank you very much, Flounderer! Digging the data a little more I found out that only 3 individuals were causing that separation. They were seen alone (or as the only recognized individual in the group) more than 1000 times each. Removing them from the analysis, it all became a single, pretty cloud of points. But your insights will surely be useful. Thanks for the links! – Rodrigo Feb 12 '14 at 12:57