Puzzle: Detecting Duplicate Solutions

Question

This question has been on my mind for a while, so I am asking it here.

Suppose there are n students who are tasked with an assignment. Students can either produce an original solution, or they can copy a solution from one of their collegues. Let's say they copy a solution with probability p. On a certain date all students hand-in their solutions.
Suppose there are k teaching assistants who look at the handed-in solutions from the students. Assume that n > k, so every teaching assistant receives a chunk of n/k solutions to look at. Teaching assistants only know their own chunk and do not know the solutions other teaching assistants receive.
Let's say that a teaching assistant finds that a proportion p' in their chunk of n/k solutions was copied. Can they reasonably predict the overall copying propability p from just observing the copying propability p' of their chunk? Or put another way: What might that teaching assistant reasonably assert about the total number of copied solutions among all handed-in solutions?

EDIT:

Some additional assumptions / thoughts because of the comments to make the problem more tractable (but hopefully not too distant from reality):

Let's say all students sit in a single line on a very long bench. The first student on that bench is guaranteed to produce an original solution. The second student has a choice, either produce an original solution, or copy from the first student. They choose between these two options with some unknown but fixed probability p. Similarly, the third student can produce an original solution or copy from the second student with probability p and so on.
The teaching assistants do not know where each student was placed on that bench.
The handed-in solutions are assigned randomly to the teaching assistants.

You may need to slightly expand the question. Suppose you tried to simulate it for given $n$ and $k$. One issue would be what would happen if every student decided to copy: there would be nobody to copy from. Ignoring that, you need to establish the process: do copiers only copy from own-workers or potentially from other copiers? (That choice could affect the distribution of the numbers of copies of different items) — Henry, Sep 02 '23 at 15:05
When you state that "they can copy a solution from one of their colleagues", is this colleague part of the $n$ students or outside the $n$ students? — Xi'an, Sep 02 '23 at 15:26
@Henry Thanks for your comments. I expanded a little on my question and added some simplifying assumptions. — r0f1, Sep 02 '23 at 15:37
Are the "chunks" randomly assigned or sequentially assigned from the line of students? (This matters a great deal.) — whuber, Sep 02 '23 at 15:52
That's a great question. Let's give the students and advantage and say that chunks are randomly assigned to the teaching assistants. — r0f1, Sep 02 '23 at 16:05
Thank you. Just as you did before, please include that information directly in the question itself. I am also curious whether the graders know where each student was situated in the line, because that looks like relevant and useful information. — whuber, Sep 02 '23 at 16:06
Thanks, I edited the question and added the clarifications. The teaching assistants do not know where each student was placed on that bench. — r0f1, Sep 02 '23 at 16:13

Henry · Accepted Answer · 2023-09-03T22:01:35.630

Your clarifications help. Here are some approximations - though I am not sure they are justified:

Of the $n$ solutions, you expect $pn-p$ to be copied and $n-np+p$ to be original.
Each original solution might appear on average about $\frac1{1-p}$ times, though with a truncated geometric distribution, and the probability it appears $m$ times is about $(1-p)p^{m-1}$.
If a particular original solution occurs $m$ times, then the probability that a teaching assistant does not see its type is about $(1-\frac1k)^m$ using a binomial rather than a hypergeometric distribution as an approximation
So the probability a teaching assistant does not see a particular type of solution is about $\sum_1^m (1-p)p^{m-1}(1-\frac1k)^m \approx \frac{(1-p)(1-\frac1k)}{1- p(1-\frac1k)} = 1-\frac{1}{k-kp+p}$
making the probability a teaching assistant does see a particular type of solution is about $\frac{1}{k-kp+p}$ and the expected number of types of solutions seen about $\frac{n-np+p}{k-kp+p}$
so about $\frac{(n-k)p}{k^2 - k(k-1)p}$ fewer types than the $\frac{n}{k}$ total solutions that the teaching assistant actually sees

Apart from some of the arbitrary approximations, I also worry that the estimates by a teaching assistant will be wide. So it might be sensible to test this by simulation.

Suppose there are $n=100$ students with all but the first having a probability of copying of $p=\frac23$, and $k=5$ teaching assistants so each receives $\frac{100}{5}=20$ solutions. Using R:

tadupes <- function(p, n, k){
  sols <- cumsum(c(1,sample(c(0,1), n-1, replace=TRUE, prob=c(p,1-p))))
  tasols <- sample(sols, n/k, replace=FALSE) 
  return(n/k - length(unique(tasols)))
  }
set.seed(2023)
probcheat <- 2/3
students <- 100
numTAs <- 5
(students-numTAs)probcheat / (numTAs^2-numTAs(numTAs-1)*probcheat)
5.428571
numsims <- 10^6
dupesims <- replicate(numsims, tadupes(p=probcheat, n=students, k=numTAs))
mean(dupesims)
5.474884

which suggests that the approximation is close even if it is not exact. Changing $p,n,k$ and rerunning the simulation gives similar conclusions.

But a TA may potentially see a wide range of obvious duplicates. These simulations range from $0$ to $15$ out of the $20$ solutions seen:

table(dupesims)/numsims
# dupesims
#        0        1        2        3        4        5        6        7 
# 0.000858 0.008154 0.035128 0.092989 0.165634 0.212801 0.203577 0.147844 
#        8        9       10       11       12       13       14       15 
# 0.082219 0.035507 0.011652 0.002979 0.000573 0.000075 0.000009 0.000001

If a teaching assistant sees $d$ obviously duplicated solutions (not counting the first of each type, i.e. the difference between solutions and types seen), they could estimate $p$ as something like $$\hat p =\frac{dk^2}{dk^2 +n-(d-1)k }$$ as an approximate method of moments estimator. In terms of your question, $p'=\frac{dk}{n}$ so $\hat p =\frac{knp'}{knp'-np'+n+k}$. That table of duplicates would then become this table of predicted cheating probabilities, ranging from $0$ to almost $0.95$:

predictcheat <- function(d, n, k){
  return( (d*k^2)/(d*k^2 + n - (d+1)*k) )
  }
predsims <- predictcheat(d=dupesims, n=students, k=numTAs)
table(predsims)/numsims
# predsims
#                 0 0.217391304347826  0.37037037037037 0.483870967741935 
#          0.000858          0.008154          0.035128          0.092989 
# 0.571428571428571 0.641025641025641 0.697674418604651  0.74468085106383 
#          0.165634          0.212801          0.203577          0.147844 
# 0.784313725490196 0.818181818181818 0.847457627118644 0.873015873015873 
#          0.082219          0.035507          0.011652          0.002979 
# 0.895522388059702 0.915492957746479 0.933333333333333 0.949367088607595 
#          0.000573          0.000075          0.000009          0.000001

and because the expression for prediction is non-linear, introducing bias, this means that the expected prediction is not quite so close to the actual probability of $\frac23$

mean(predsims)
# 0.6495667

Thank you for that elaborate answer. – r0f1 Sep 03 '23 at 20:27 — r0f1, Sep 03 '23 at 20:27

Puzzle: Detecting Duplicate Solutions

1 Answers1

5.428571

5.474884