Your clarifications help. Here are some approximations - though I am not sure they are justified:
- Of the $n$ solutions, you expect $pn-p$ to be copied and $n-np+p$ to be original.
- Each original solution might appear on average about $\frac1{1-p}$ times, though with a truncated geometric distribution, and the probability it appears $m$ times is about $(1-p)p^{m-1}$.
- If a particular original solution occurs $m$ times, then the probability that a teaching assistant does not see its type is about $(1-\frac1k)^m$ using a binomial rather than a hypergeometric distribution as an approximation
- So the probability a teaching assistant does not see a particular type of solution is about $\sum_1^m (1-p)p^{m-1}(1-\frac1k)^m \approx \frac{(1-p)(1-\frac1k)}{1- p(1-\frac1k)} = 1-\frac{1}{k-kp+p}$
- making the probability a teaching assistant does see a particular type of solution is about $\frac{1}{k-kp+p}$ and the expected number of types of solutions seen about $\frac{n-np+p}{k-kp+p}$
- so about $\frac{(n-k)p}{k^2 - k(k-1)p}$ fewer types than the $\frac{n}{k}$ total solutions that the teaching assistant actually sees
Apart from some of the arbitrary approximations, I also worry that the estimates by a teaching assistant will be wide. So it might be sensible to test this by simulation.
Suppose there are $n=100$ students with all but the first having a probability of copying of $p=\frac23$, and $k=5$ teaching assistants so each receives $\frac{100}{5}=20$ solutions. Using R:
tadupes <- function(p, n, k){
sols <- cumsum(c(1,sample(c(0,1), n-1, replace=TRUE, prob=c(p,1-p))))
tasols <- sample(sols, n/k, replace=FALSE)
return(n/k - length(unique(tasols)))
}
set.seed(2023)
probcheat <- 2/3
students <- 100
numTAs <- 5
(students-numTAs)probcheat / (numTAs^2-numTAs(numTAs-1)*probcheat)
5.428571
numsims <- 10^6
dupesims <- replicate(numsims, tadupes(p=probcheat, n=students, k=numTAs))
mean(dupesims)
5.474884
which suggests that the approximation is close even if it is not exact. Changing $p,n,k$ and rerunning the simulation gives similar conclusions.
But a TA may potentially see a wide range of obvious duplicates. These simulations range from $0$ to $15$ out of the $20$ solutions seen:
table(dupesims)/numsims
# dupesims
# 0 1 2 3 4 5 6 7
# 0.000858 0.008154 0.035128 0.092989 0.165634 0.212801 0.203577 0.147844
# 8 9 10 11 12 13 14 15
# 0.082219 0.035507 0.011652 0.002979 0.000573 0.000075 0.000009 0.000001
If a teaching assistant sees $d$ obviously duplicated solutions (not counting the first of each type, i.e. the difference between solutions and types seen), they could estimate $p$ as something like $$\hat p =\frac{dk^2}{dk^2 +n-(d-1)k }$$ as an approximate method of moments estimator. In terms of your question, $p'=\frac{dk}{n}$ so $\hat p =\frac{knp'}{knp'-np'+n+k}$. That table of duplicates would then become this table of predicted cheating probabilities, ranging from $0$ to almost $0.95$:
predictcheat <- function(d, n, k){
return( (d*k^2)/(d*k^2 + n - (d+1)*k) )
}
predsims <- predictcheat(d=dupesims, n=students, k=numTAs)
table(predsims)/numsims
# predsims
# 0 0.217391304347826 0.37037037037037 0.483870967741935
# 0.000858 0.008154 0.035128 0.092989
# 0.571428571428571 0.641025641025641 0.697674418604651 0.74468085106383
# 0.165634 0.212801 0.203577 0.147844
# 0.784313725490196 0.818181818181818 0.847457627118644 0.873015873015873
# 0.082219 0.035507 0.011652 0.002979
# 0.895522388059702 0.915492957746479 0.933333333333333 0.949367088607595
# 0.000573 0.000075 0.000009 0.000001
and because the expression for prediction is non-linear, introducing bias, this means that the expected prediction is not quite so close to the actual probability of $\frac23$
mean(predsims)
# 0.6495667